Costs and benefits of audiovisual interactions

Abstract

A strong temporal correlation promotes integration of concurrent sensory signals, either within a single sensory modality, or from different modalities. Although the benefits of such integration are well known, far less attention has been given to possible costs incurred when concurrent sensory signals are uncorrelated. In two experiments, subjects categorized the rate at which a visual object modulated in size, while they also tried to ignore a concurrent task-irrelevant broadband sound. Overall, the experiments showed that (i) losses in accuracy from mismatched auditory and visual rates were larger than gains from matched rates and (ii) mismatched auditory and visual rates slowed responses more than they were sped up when rates matched. Experiment One showed that audiovisual interaction varied with the difference between the visual modulation rate and the modulation rate of a concurrent auditory stimulus. Experiment Two showed that audiovisual interaction depended upon the strength of the task-irrelevant auditory modulation. Although our stimuli involved abstract, low-dimensional stimuli, not speech, the effects we observed parallel key findings on interference in multi-speaker settings.

Keywords

Decision making multisensory interaction audiovisual stimuli selective attention

During speech, the visible movements of a speaker’s face closely parallel the acoustic signals produced by the speaker’s vocal apparatus (Chandrasekaran et al., 2009). This audiovisual temporal relationship boosts the intelligibility of face-to-face conversation (Sumby and Pollack, 1954), which is particularly helpful in noisy environments, such as crowded restaurants and parties (Golumbic et al., 2013), or when a listener is hearing-impaired (Dias et al., 2021). Just as temporal correlation can boost intelligibility in multi-talker settings, sounds that are uncorrelated with a speaker’s facial kinematics can reduce speech intelligibility (Golumbic et al., 2013; Li et al., 2018). Filtering out irrelevant sounds puts a premium on selective attention (Kerlin et al., 2010; Passow et al., 2012). To be clear, by that term we mean “the processes that allow an individual to select and focus on particular input for further processing while simultaneously suppressing irrelevant or distracting information” (Stevens and Bavelier, 2012).

Because the intelligibility of face-to-face speech depends upon interactions among high-dimensional sensory and cognitive variables (Schneider et al., 2002; Peelle and Sommers, 2015; Conway et al., 2001), studying audiovisual speech presents difficult challenges. These challenges have led some researchers to develop low-dimensional analogues to face-to-face speech. For example, to isolate critical features of a speaker’s facial kinematics, speech intelligibility has been studied with a size-modulating disc or other dynamic visual object standing in for a speaker’s face (Bernstein et al., 2004; Strand et al., 2020; Yuan et al., 2021).

Several groups have examined selective listening with simple dynamic visual and auditory stimuli whose temporal frequencies resemble those of normal speech (Maddox et al., 2015; Varghese et al., 2017). To study how task-relevant and task-irrelevant amplitude-modulated sound influences perception of a size-modulating visual stimulus, our laboratory has used simple, low-dimensional, temporally modulated visual and auditory stimuli embedded in a video game, Fish Police!! (Sun et al., 2017; Varghese et al., 2017; Zhou, 2019; Sun and Sekuler, 2021). Importantly, some conditions of the game demand selective attention in order to filter out distracting task-irrelevant signals. This feature allows the game to estimate not only audiovisual interactions that benefit performance, but also ones that degrade it.

Subjects play Fish Police!! by categorizing the rate at which the image of a fish modulates in size. Over trials, this temporal modulation randomly takes on either of two alternative rates, for example, 5 and 6 Hz. Subjects quickly learn to categorize the modulation rates as either “slower” or “faster”. The modulating visual object can be accompanied by a task-irrelevant sound whose rate of amplitude modulation either matches the visual modulation rate or not. When visual and auditory modulations are matched, signals from the two modalities are temporally correlated. However, when visual and auditory modulation rates are mismatched, the mismatch produces competition between the modalities. Specifically, the slower visual modulation might be paired with the faster auditory rate, or vice versa. In that way, auditory and visual stimuli are not merely uncorrelated or imperfectly correlated: they directly conflict with one another. Although different operations may be responsible, the conflict from mismatched audiovisual stimuli in Fish Police!! can be analogized to the conflict represented in Eriksen’s Flanker task or in the Stroop task (Lamers and Roelofs, 2011).

One result has been confirmed over and over: Visual modulation rate is more accurately categorized when the visual stimulus is accompanied by a concurrent, synchronized auditory modulation than when the two compete. The first study with Fish Police!! (Goldberg et al.) was limited to just two conditions: auditory-visual modulations that matched in rate and phase (hereafter, Congruent) and auditory-visual modulations that were mismatched (hereafter, Incongruent). Accuracy in categorizing size modulation rate differed substantially between the two conditions, with the Congruent condition producing more accurate categorization. This result was robust, reaching statistical reliability after as few as just 20-30 trials (Goldberg et al., 2015), but ambiguous. The absence of any neutral, control condition made it impossible to tell whether the difference was caused by a gain in accuracy with Congruent stimuli, by a loss in accuracy with Incongruent stimuli, or by some combination of the two. So it is not clear that the difference between Congruent and Incongruent conditions actually reflected an advantage from audiovisual binding of the stimuli. Resolving this ambiguity is needed before the task’s usefulness as a stand-in for audiovisual speech can be fairly evaluated.

Several follow-up experiments have tried to resolve the ambiguity. They included an additional, Control condition in which a task-irrelevant sound was either absent or present but unmodulated. Every follow-up experiment replicated the significant advantage of the Congruent over the Incongruent condition, but they diverged in how performance in the Incongruent condition related to the performance in Control condition. In some experiments, results from the latter two conditions were indistinguishable from one another (Sun et al., 2017; Zhou, 2019), while in others, accuracy in the Incongruent condition fell significantly below that of the Control condition (Varghese et al., 2017; Zhou, 2019; Sun and Sekuler, 2021). The first of these outcomes is surprising because it shows that performance in the Incongruent condition can be unimpaired relative to the Control condition.

The divergent outcomes might have arisen from differences in the variables summarized in Table 1. The table shows that experiments differed in (i) the number of subjects tested (n), (ii) the difference between the low and high modulation frequencies ( $Δ$ Freq) with which the stimuli were modulated, and (iii) the depth at which the nominally irrelevant, auditory stimulus was modulated. The difference in accuracy, Congruent > Incongruent, held over a wide range of sample sizes, from n = 11 to n = 27, so that variable probably does not explain the inconsistent relationship between Control and Incongruent conditions in the experiments that compared those conditions. Instead, we decided to focus on the other two variables, the frequency difference and the depth of modulation. Equation 1 defines what we mean by “depth of auditory modulation.”

Percent \;Modulation \;Depth = \frac{Maximum-Minimum}{Maximum+Minimum} * 100

(1)

Table 1.

Key Results and Conditions From Previous Experiments.

Source	$n$	$Δ F$	Auditory	Outcome
Source	$n$	$Δ F$	modulation	Outcome
Sun et al. 2017-1 $^{1}$	12	2 Hz	25%	Cong>Incong; Incong $\sim$ Ctrl
Sun et al. 2017-2	11	2 Hz	25%	Cong>Incong; Incong $\sim$ Ctrl
Varghese et al. 2017	13	1 Hz	50%	Cong>Incong; Incong $≪$ Ctrl
Zhou 2019 $^{2, 3}$	27	1.5 Hz	100%	Cong>Incong; Incong $\sim$ Ctrl
Sun and Sekuler 2021 $^{4}$	19	3 Hz	25%	Cong>Incong; Incong $\leq$ Ctrl

$Δ$ F = frequency difference between possible visual stimuli; Ctrl = Control condition; Cong = condition in which auditory and visual stimuli matched in frequency; Incong = condition in which auditory and visual stimuli modulated at different frequencies. $^{1}$ Ctrl condition included no sound. $^{2}$ Cong accuracy was near upper limit. $^{3}$ Unpublished Master’s thesis. $^{4}$ Data only from the portion of study with a constant inter-trial interval.

Experiment One

Experiment One used the Fish Police!! platform to examine how audiovisual interaction was affected by the difference between visual and auditory modulation frequencies. Multiple accounts of audiovisual interaction make a simple prediction for accuracy of categorization in the Incongruent condition: the closer the two conflicting frequencies are to one another, the poorer will be the accuracy in that condition.

Based on preliminary tests, three different frequency pairs were selected: 5 and 6 Hz, 5 and 7 Hz, and 5 and 8 Hz. Note that these values spanned the range used by previous studies with Fish Police!! (see Table 1). Throughout the experiment, 5 Hz was always the lower of the two possible frequencies; subjects were instructed to categorize a fish that oscillated at 5 Hz as a slow fish, making their response as rapidly as possible while trying to ignore the accompanying sound. The alternate frequency varied from one block of trials to another, taking values of 6, 7, and 8 Hz. Whatever that alternate frequency might be, though, fish modulating at that rate were to be categorized as faster.

Subjects

Subjects were 30 Brandeis undergraduate and graduate students, 19 to 24 years old. Twenty-six of them participated for course credit, the rest for pay; 23 of the 30 subjects were female. Visual acuity was measured prior to the experiment using the ETDRS vision chart at 60 cm viewing distance (Rosser et al., 2001). All subjects had normal or corrected-to-normal visual acuity (all subjects had < 0.1 on logMar acuity), and all self-reported normal hearing. Procedures were approved by Brandeis University’s Committee for the Protection of Human Subjects.

Stimuli

The visual stimulus was an image of a brightly colored clown fish (Amphiprion ocellaris), which swam steadily across a computer monitor at 25 $^{\circ}$ /sec. The fish’s mean luminance was 22 cd/m $^{2}$ , a value approximating one used in a previous study (Sun and Sekuler, 2021). On each trial, the fish appeared at either left or right side of the display and swam steadily to the other side. As the fish swam, its entire body modulated sinusoidally in size around a mean dorsal-ventral height of 2.5 $^{\circ}$ visual angle. Modulation depth for the visual stimulus, as defined in [Eq. 1], was 20%. For visual modulation, maximum and minimum values corresponded to the fish’s body size; for auditory modulation, those values corresponded to the amplitude of the auditory stimulus. Oscillation frequencies were selected from the set [5, 6, 7, and 8] Hz.

Auditory stimuli were delivered through a computer speaker (BOKA 81000), at a mean level of 48 dB $_{S P L}$ , measured at a subject’s ears. The sound stream comprised a six-component harmonic series of 220–1320 Hz, whose components’ power tapered off with frequency. The amplitude of the sound stream either modulated sinusoidally over time with 100% modulation depth [Eq. 1], or in a Control condition was unmodulated. To avoid a distracting transient at stimulus onset, the Control condition’s unmodulated sound ramped on during its first 100 msec. As Table 2 suggests, when visual and auditory stimuli were modulated at non-matching frequencies, the mismatch was designed to promote a kind of competition between the two, much as the opposite pointing arrowheads do in some versions of Eriksen’s Flanker Task.

Table 2.

Constitution of the three categories of audiovisual match (avMatch)

Rate of size	Sound modulation rate
Modulation	5 Hz	None	5 Hz + $Δ$ Freq
5 Hz	Congruent		Incongruent
		Control
5 Hz + $Δ$ Freq-	Incongruent		Congruent

Schematic depiction of Experiment One’s stimulus conditions. On Congruent trials the size modulation frequency and starting sine phase matched the frequency and starting sine phase of the concurrent, synchronized sound modulation; those matched frequencies could be either both 5 Hz or 5 Hz + $Δ$ Freq, where $Δ$ Freq was 1, 2, or 3 Hz. On Incongruent trials, the rate of size and auditory modulations were mismatched. On Control trials, the sound was unmodulated.

Procedure

The experiment, programmed in Matlab R2015a, presented stimuli on a 21.5 inch iMac under OSX Yosemite (version 10.10.5). For subjects’ comfort, a lamp located behind the subject provided a constant, low level of illumination in the testing room. An adjustable chin rest, located 57 cm from the monitor, supported the subject’s head and chin.

Before each trial, a small black fixation cross was presented for 500 ms on a gray background. Then, after a random delay of 300–800 ms, the visual and auditory stimuli were presented, along with a background image depicting a sea floor. Subjects categorized the frequency at which the fish’s size modulated, while trying to disregard the accompanying sound. Responses were signaled by pressing either the P or Q key on the computer keyboard; the mapping of keys was counterbalanced across subjects. Both fish and background disappeared, and the sound stream ended when a response was made; otherwise they disappeared after two seconds. Brief distinctive high and low pitch tones provided immediate feedback about response correctness. The next trial began two seconds later. We chose that inter-trial interval expecting it would suffice to control time pressure and subject stress (Sun and Sekuler, 2021; Sussman et al., 2021).

Each block of trials comprised randomly interspersed, equal numbers of trials from three categories of audiovisual match (hereafter, avMatch) between the visual and auditory stimulus: Congruent (visual and auditory stimuli modulated in synchrony at the same frequency), Control (the size of fish oscillated but the accompanying sound amplitude was unmodulated), and Incongruent (visual and auditory stimuli modulated at different frequencies). Table 2 summarizes the conditions.

In every block of trials, the lower modulation frequency for either visual and auditory stimuli was 5 Hz ; the higher modulation frequency varied between blocks of trials, taking on values of 6, 7, or 8 Hz. We will use the term $Δ$ Freq to designate the difference between the high and low frequencies. Note that $Δ$ Freq differentially affects each category of avMatch. For stimuli from the Incongruent category, whether the visual stimulus is 5 Hz or 5+ $Δ$ Freq Hz, the auditory stimulus was constrained to be the opposite, that is, 5+ $Δ$ Freq or 5 Hz, respectively. So for such stimuli, $Δ$ Freq was the difference in Hz between auditory and visual stimuli. That was not the case for the other categories of avMatch. Specially, no matter whether the the visual stimulus in the Control category was 5 Hz or 5+ $Δ$ Freq Hz, the auditory stimulus was the same steady, unmodulated tone. In the Congruent category, no matter whether the visual stimulus was 5 Hz or 5+ $Δ$ Freq Hz, the accompanying auditory stimulus always had that same frequency.

Within each 90-trial block, there were 30 trials of Congruent, Control, and Incongruent stimuli, presented in random order. Ad lib breaks were allowed between blocks. Subjects completed two blocks of trials with each of the three frequency pairs. Within each set of three blocks, the order of blocks was randomized. Each subject received 540 trials, spread over six block-randomized sets of 90 trials each. As mentioned already, $Δ$ Freq was fixed with each block.

The experiment consisted of one practice block and six experimental blocks per subject.The practice block comprised six trials, in which all three categories of avMatch were represented equally. The modulation frequencies in the practice block were 5 and 7 Hz, representing the central value of $Δ$ Freq that would be used in the experiment. An entire experimental session for each subject took about 60 minutes.

For each subject, response accuracy was defined by the proportion of trials on which the visual stimulus was correctly categorized (as either “slower” or “faster”). One subject’s data were discarded for low mean accuracy, 0.66 proportion correct responses over all conditions. That value was 2.7 standard deviations below the mean, 0.85 correct, for all other subjects. After this subject was excluded, 29 subjects’ data were left for formal analysis. Running several analyses with and without that subject, confirmed that the exclusion had negligible effect. The 29 retained subjects failed to respond within the 2 s time limit just 27 times out of a total of 15,640 trials. Those trials were not included in accuracy calculations.

Analyses for both experiments used several R packages, most importantly Afex: Analysis of Factorial EXperiments (Singmann et al., 2016), emmeans (Lenth, 2021), and lme4 (Bates et al., 2015). Analysis of variance (ANOVAs) generated Type III sum of squares using subjects’ mean accuracy in each condition. lme4’s glmer() model was applied to individual trial-by-trial accuracy data; the analysis included a random intercept for subjects. This reduced some between-subject variance from our within-subject, repeated measures design in which responses from any one person were more similar than responses from other people. Because accuracy responses were binary (that is, correct or not correct), the regression used a logit link function.

Results: Response Accuracy

Figure 1 shows how accuracy of categorization varied with frequency difference, $Δ$ Freq, and with various conditions of avMatch. An ANOVA confirmed that each independent variable, $Δ$ Freq and category of avMatch, significantly affected response accuracy, as summarized in Table 3. The interaction between the two independent variables was also significant, though, as Table 3 shows, with a smaller effect size. The figure suggests that differences among the three conditions, Congruent, Control, and Incongurent, varied across levels of $Δ$ Freq, with the differences among them being largest at $Δ$ Freq = 1 Hz, and smallest at $Δ$ Freq = 3 Hz. One other feature of Figure 1 deserves comment: the change in accuracy in the Congruent condition with changes in $Δ$ Freq. In order to understand that change, it is important to remember the subjects’ task. Auditory and visual modulations matched one another on every Congruent trial, though subjects’ task was not to judge the match, but simply to categorize the frequency as higher or the lower. As a result, it is unsurprising that a larger difference between possible frequencies, say $Δ$ Freq = 3, would make the task easier than a smaller one, say $Δ$ Freq = 1.

Figure 1.

Categorization accuracy for various combinations of (i) difference between the high and low modulation frequencies, which also determines the difference between visual and auditory modulation frequencies on Incongruent trials, and (ii) three classes of audiovisual match. Each error bar spans the 95% confidence interval. Data points represent individual subjects.

Table 3.

Analysis of Variance (ANOVA) on Accuracy in Experiment One

	df s	$ε$	SSn	SSd	F value	p value	$η_{p a r t i a l}^{2}$
$Δ$ Freq	1.94,54.23	0.97	1.74	0.39	124.13	<.001	0.816
avMatch	1.42,39.87	0.71	0.98	0.46	60.42	<.001	0.683
avMatch $\times Δ$ Freq	2.93,82.02	0.73	0.04	0.38	3.34	<0.024	0.107

$ε$ is the Greenhouse-Geisser correction factor; $η_{p a r t i a l}^{2}$ is a measure of effect size.

As the Introduction explained, we were interested in resolving discrepancies among previous experiments. This interest lent special importance to the interaction between avMatch and $Δ$ Freq. So we followed up the omnibus ANOVA’s significant interaction by examining key pairwise differences that might have contributed to that interaction. The data for this analysis were subjects’ mean accuracies in each of the nine combinations of $Δ$ Freq and Condition. Because not all of the (9 $\times$ 8)/2 possible pairwise comparisons were equally relevant to the experiment’s purpose, we narrowed the analysis to the nine differences in accuracy for avMatch categories within the same $Δ$ Freq value.

Table 4 shows the resulting pairwise comparisons within each level of $Δ$ Freq. Note first that for each avMatch comparison, accuracy decreases monotonically with $Δ$ Freq. Although the table shows accuracy differences among categories of avMatch, as we anticipated, the table’s results do not fully explain inconsistencies among the previous experiments shown in Table 1. In particular, our results do not explain why the relationship between Control and Incongruent varied among those previous experiments. In our results, Control was consistently more accurate than Incongruent, which suggests that to explain previous discrepancies some other variable must be considered, which we will do later, in Experiment Two.

Because categorization was so accurate for many subjects, particularly at $Δ$ Freq = 3 Hz, the significant interaction between Match and $Δ$ Freq might have been a ceiling effect, a situation in which the measured variable approaches or hits the upper limit of the measurement scale. To address this possibility, we analyzed individual trials rather than as aggregate mean proportion correct. Each response was treated as a binary value (correct or incorrect), and entered into a generalized linear mixed effects model with a logit link function Warton and Hui (2011). This allowed us to account not only for variability from trial type, but also the nesting of trials within subjects. Because multiple responses from the same subject are likely to be more similar than responses from other subjects, accounting for both trial type and subject-level variance could reduce error in our models (Singmann and Kellen, 2019). In other words, we assumed that each subject had an idiosyncratic overall level of accuracy. By taking account of subject-dependent performance, this approach would more adequately address possible ceiling effects, while also giving the added advantage of increased power.

Table 4.

Select Pairwise Differences From Experiment One: Accuracy Results

$Δ$ Freq		Cong – Ctrl	Ctrl – Incong	Cong – Incong
1	Difference	0.0706	0.1195	0.1900
	p value	<.0002	<.0001	<.0001
2	Difference	0.0455	0.0901	0.1357
	p value	<.0002	<.0002	<.0001
3	Difference	0.0255	0.0894	0.1149
	p value	<.0033	<.0001	<.0001

p values corrected with Holm-Bonferroni method.

The generalized linear model included a random subject-dependent intercept, six fixed effects, and a term for residual variance. Table 5 includes the subject-dependent intercept (term [1]) along with the fixed effects, and gives the results associated with each. A note of explanation may be helpful in interpreting the table. When lme4’s glmer() routine includes a factor, e.g., avMatch, its levels are evaluated relative to a single, designated reference level, by default the level that would be first alphabetically. Because the Control condition seemed a more natural reference category, we forced the model to reference other categories to it.

Table 5.

Summary of GLMER for Experiment 1: Accuracy data

	Effect	Estimate	Std Error	z value	p>z
$[1]$	(Intercept)	0.276	0.148	1.860	.063
$[2]$	$Δ$ Freq	0.950	0.059	16.187	2 × $10^{- 16}$
$[3]$	avMatch.Incong	$- 0.308$	0.134	$- 2.299$	.021
$[4]$	avMatch.Cong	0.301	0.160	1.875	.061
$[5]$	$Δ$ Freq $\times$ avMatch.Incong	$- 0.261$	0.073	$- 3.566$	.000
$[6]$	$Δ$ Freq $\times$ avMatch.Cong	0.150	0.096	1.571	.116

Fixed effect $[1]$ captures individual subjects’ regression intercepts, a second fixed effect, $Δ$ Freq, represents the three levels of difference between the auditory and visual modulation frequencies ( $[2]$ ). Two fixed effects $[3]$ and $[4]$ , represent the main effect of the match between auditory and visual stimuli: Congruent, Incongruent, and Control. The model treated these as ordered categorical variables, one of which ( $[3]$ ) contrasted Incongruent against Control, and the other of which ( $[4]$ ) contrasted Congruent against Control. The two remaining fixed effects ( $[5]$ and $[6]$ ) represent the interaction between $Δ$ Freq and each of the avMatch contrasts in turn.

In Table 5, term [1] shows a significant subject-dependent intercept, reflecting the fact that subjects differ from one another in overall accuracy. Term [2] shows that $Δ$ Freq strongly affected accuracy. Referring back to Figure 1 it is clear that overall, accuracy is lowest when the difference between the alternative frequencies is smallest. The Estimate and z-value for Term [3] show that, as a whole, over levels of $Δ$ Freq, accuracy is worse in the Incongruent condition, than it is in and the model’s reference Control condition. Note that the term representing the Congruent condition, Term [4], is not only opposite in sign to Term [3], as expected, but also smaller. So the loss in accuracy with an incongruent auditory stimulus is larger than the benefit with a congruent auditory stimulus. The effects represented by Terms [5] and [6] extend this inequality between performance costs and benefits to the pair of interactions between $Δ$ Freq and avMatch. Those interactions are opposite in sign and differ considerably in magnitude, with the interaction involving Incongruent auditory stimuli being much larger than the interaction involving its Congruent counterpart. Referring back to Figure 1 this interaction is shown by the variation in the difference, Control - Incongruent, across levels of $Δ$ Freq.

As just mentioned, the GLMER showed a significant interaction, but only for the difference between Control and Incongruent conditions. To check the importance of including an interaction term in model, we recomputed the model, this time omitting the interaction between the two fixed variables. Note that the model with the interaction had two more degrees of freedom than the model without. We then compared the two nested models with an analysis of variance. The result, ${\tilde{χ}}^{2}$ (df = 2) = 27.698, p<.0001, showed that the interaction term improved model fit above what was expected from the additional degrees of freedom alone.

Results: Response Times

As explained in the “Introduction” section, our principal focus was on how accurately subjects categorized visual modulation frequency under various conditions. Although response times held just secondary interest, for sake of completeness, we decided to supplement the analysis of accuracy by examining response times as well. The analysis was based on subjects’ median response time in each condition, considering only trials on which the response was correct. Figure 2 shows results for individual subjects in each condition. Table 6 presents a corresponding, within-subject repeated measures ANOVA. Both main effects, the types of avMatch and values of $Δ$ Freq, were statistically significant ( $p <$ .001), and each accounted for a significant portion of the overall variance. In contrast, the interaction between the effects was not significant. Comparisons between pairs of conditions (Table 7) show that for all categories of $Δ$ Freq, the same relationships hold: response times on Incongruent trials were reliably longer than on either Control or Congruent trials, but the difference between Congruent and Control did not differ reliably.

Figure 2.

Response times for various combinations of (i) the difference on Incongruent trials between modulation frequencies of the visual stimulus and auditory stimulus, and (ii) three classes of audiovisual match. Each error bar spans the 95% confidence interval. Data points represent individual subjects.

Table 6.

Analysius of Variance (ANOVA) on Response Times in Experiment One

	df s	$ε$	SSn	SSd	F value	p value	$η_{p a r t i a l}^{2}$
$Δ$ Freq	1.63,45.61	0.81	762182.49	907735.68	23.51	<.001	0.456
avMatch	1.81,50.72	0.91	177234.76	155864.74	31.84	<.001	0.532
avMatch $\times Δ$ Freq	2.90,81.14	0.72	2923.80	130901.37	0.63	.595	0.022

$ε$ is the Greenhouse-Geisser correction factor; $η_{p a r t i a l}^{2}$ is a measure of effect size.

Table 7.

Select Pairwise Differences From Experiment One: Response times

$Δ$ Freq		Cong – Ctrl	Ctrl – Incong	Cong – Incong
1	Difference	$- 9.7586$	$- 42.9483$	$- 52.7069$
	p value	0.5072	<.0021	<.0005
2	Difference	$- 13.7931$	$- 44.1034$	$- 57.8966$
	p value	0.1129	<.0007	<.0001
3	Difference	$- 16.1552$	$- 55.3276$	$- 71.4828$
	p value	<0.0252	<.0001	<.0001

p values corrected with Holm-Bonferroni method.

Experiment One focused on $Δ$ Freq, one of three variables singled out in Table 1’s summary of previous Fish Police!! studies. Before introducing Experiment Two, which will examine the impact of modulation depth, another of those three variables, it is worth considering the third variable, n, the number of subjects. The considerable, nearly threefold variation in n across experiments encourages the question: How many subjects are actually needed to produce reliable effects with the Fish Police!! protocol? For an answer, the R package Superpower (Lakens and Caldwell, 2021) used the data from Experiment One to simulate the experiment’s outcome 1,000 times. For each main effect in the ANOVA, 99% power was exceeded by n = 10, the smallest number of subjects simulated; for the ANOVA’s interaction term, 99% power was exceeded when n $\geq$ 23 (80% power was reached by n = 15). Clearly, Experiment One tested more subjects than actually needed to generate reliable results.

Experiment Two

The task-irrelevant auditory stimulus in Experiment One was modulated at 100%, a higher value than in most previous experiments with the Fish Police!! protocol: 25% in (Sun et al., 2017) and in (Sun and Sekuler, 2021), and 50% in (Varghese et al., 2017). Experiment One showed that with $Δ$ Freq = 1 accuracy in the Incongruent condition was clearly reduced relative to the Control condition. an important result not found in several previous experiments (Sun et al., 2017; Varghese et al., 2017). Experiment Two built on that result, using only $Δ$ Freq = 1 so as to reduce complications introduced when accuracy was at or near the upper limit of psychophysical performance. As in Experiment One, subjects were instructed to respond as rapidly as possible while trying to ignore the accompanying sound. Throughout the experiment, fish whose size modulated at 5 Hz were to be categorized by subjects as “slower,” and fish whose size modulated at 6 Hz were to be categorized by subjects as “faster.”

Subjects

Ten Brandeis undergraduate and graduate students, nine female. participated in this experiment. Four subjects had participated in Experiment One. Each subject received $10 for participation. Visual acuity was again measured using the ETDRS vision chart at a view distance of 60 cm (Rosser et al., 2001). Subjects had normal or corrected-to-normal logMar visual acuity ( $\leq$ 0.2), and reported normal hearing. Through an oversight, subjects’ ages were not recorded.

Stimuli

Except for the following changes, the experiment used the same apparatus, task, and testing environment as Experiment One. For the amplitude modulated sound, two new modulation depths (60% and 20%) were added to the 100% modulation carried over from Experiment One. Throughout, $Δ$ Freq was maintained at 1 Hz, with auditory and visual modulation rates of 5 or 6 Hz. Note that subjects confirmed the audibility of even the lowest depth of auditory modulation, 20% (Joris et al., 2004).

Procedure

In Experiment Two, the task irrelevant auditory stimulus’ modulation depth varied between blocks in block-randomized fashion. $Δ$ Freq was fixed within each 48-trial block, which was repeated three times. Each block began with six practice trials. Testing took about 50 minutes for each subject, yielding 432 trials per condition, which were fewer than in Experiment One. Curtailed access to the test equipment limited testing to just ten subjects.

Results: Response Accuracy

Figure 3 shows mean categorization accuracy for individual subjects as a function of the strength (modulation depth) of the concurrent auditory stimulus. Within each level of modulation depth, results are separated according to the category of avMatch. As a reminder, these were Congruent (both stimuli modulating in phase at the same frequency), Control (the auditory stimulus was unmodulated), and Incongruent (the visual and auditory stimuli modulated at frequencies that differed by one Hz). Table 8 summarizes the results of an ANOVA on subjects’ mean accuracy data. Both main effects and the interaction between them were statistically significant.

Figure 3.

Categorization accuracy for various combinations of (i) the difference on Incongruent trials between modulation frequencies of the visual stimulus and auditory stimulus, and (ii) three classes of audiovisual match. $Δ$ Freq = 1 for all conditions. Each error bar spans the 0.95% confidence interval. Data points represent individual subjects.

Table 8.

ANOVA on Accuracy Results From Experiment Two

	df s	$ε$	SSn	SSd	Fvalue	p value	$η_{p a r t i a l}^{2}$
Modulation	1.37,12.30	0.68	0.08	0.06	12.56	<.002	0.583
avMatch	1.24,11.20	0.62	0.34	0.16	18.53	<.001	0.673
avMatch $\times$ Modulation	2.60,23.40	0.65	0.85	0.10	7.46	<.002	0.453

p values corrected with Holm-Bonferroni method. $ε$ is the Greenhouse-Geisser correction factor. $η_{p a r t i a l}^{2}$ is a measure of effect size.

To uncover the components of the significant interaction between Modulation and avMatch, we examined pairwise differences between categories of avMatch within each level of Modulation. The results are summarized in Table 9. The table reveals a strong connection between depth of modulation and differences among types of avMatch. Not only are there no reliable differences at the lowest level of Modulation (20%), but also pairwise differences between avMatch categories increase systematically with Modulation depth.

Table 9.

Select Pairwise Differences From Experiment Two: Accuracy Results

Modulation		Cong – Ctrl	Ctrl – Incong	Cong – Incong
20%	Difference	0.0229	0.0354	0.0583
	p value	0.5330	0.5330	0.2128
60%	Difference	0.0771	0.0741	0.1512
	p value	<0.0374	<0.0374	<0.0001
100%	Difference	0.1000	0.1400	0.2410
	p value	<0.0027	<0.0001	<.0001

p values corrected with Holm-Bonferroni method.

As was the case in Experiment One, several subjects’ accuracy in some conditions approached the limit of measurable performance (a ceiling effect). To guard against the possibility that this was primarily responsible for the significant interaction in the ANOVA, we applied a generalized linear mixed model to Experiment Two’s trial-by-trial accuracy data. The model included a random, subject-dependent intercept, six main and interaction fixed effects, and a term for residual variance. Table 10 gives the results associated with each term. Term $[1]$ captures individual subjects’ regression intercepts, a second term, Modulation, represents the three depths of modulation of the task-irrelevant auditory stimulus: 20%, 60%, and 100%. Two fixed effects, $[3]$ and $[4]$ , represent the main effect of the match between the auditory and visual frequencies: Congruent, Incongruent, and Control. The model treated these as categorical variables, one of which ( $[3]$ ) contrasted Incongruent against Control, and the other of which ( $[4]$ ) contrasted Congruent against Control. The two remaining fixed effects ( $[5]$ and $[6]$ ) represent the interaction between Modulation and each type of avMatch, both relative to Control.

Table 10.

Summary of GLMER for Experiment 2: Accuracy Data

	Effect	Estimate	Std Error	z value	p>z
$[1]$	(Intercept)	1.690	0.002	−2.501	1.24 $10^{8}$
$[2]$	Modulation depth (Mod)	−0.005	0.059	16.187	0.012
$[3]$	avMatch.Incong	−0.012	0.195	−0.443	0.658
$[4]$	avMatch.Cong	0.080	0.209	0.383	0.702
$[5]$	Mod $\times$ avMatch.Incong	−0.006	0.003	−2.226	0.026
$[6]$	Mod $\times$ avMatch.Cong	0.006	0.003	2.070	0.039

Table 10 summarizes the regression’s results. First, the depth of modulation of the task-irrelevant auditory stimulus (Term [2]) was statistically reliable. Although neither comparison involving avMatch ([3, 4]) was significant, both interactions involving avMatch and depth of modulation were ([5, 6]). This last result is consistent with the finding shown in Table 9 that differences among categories of avMatch vary with the depth of auditory stimulus’ Modulation.

In order to unpack the interaction terms in Table 10, we followed up with three post hoc GLMERs, one for each Modulation depth. The analyses treated avMatch as a fixed effect and individual subjects as a random effect. The results are summarized in Table 11, which omits the signifiicant intercept value from each GLMER. Note first the opposite signs associated with estimates for the contrasts at each Modulation level: they confirm what was already discussed, namely, that, compared to the Control condition, the Incongruent auditory stimulus reduced accuracy, while the Congruent auditory stimulus increased it. Moreover, the effect of avMatch tended to increase with modulation depth, which is consistent with what can be seen in Figure 3. Finally, at the weakest Modulation of the task-irrelevant auditory stimulus (20%), avMatch had negligible effect, with neither contrast against the Control condition approaching statistical significance.

Table 11.

Summary of GLMERs for separate Modulation levels

	Estimate	Std Error	z value	p>z
Modulation = 20%
$[1]$ Match.Incong	−0.238	0.168	−1.421	0.155
$[2]$ Match.Cong	0.170	0.018	0.966	0.334
Modulation = 60%
$[3]$ Match.Incong	−0.408	0.153	−2.678	0.007
$[4]$ Match.Cong	0.526	0.170	3.086	0.002
Modulation = 100%
$[5]$ Match.Incong	−0.701	0.147	−4.780	1.75e-06
$[6]$ Match.Cong	0.658	0.168	3.920	8.85e-05

Results: Response Times

Taking the same approach as for Experiment One, we next examined subjects’ response times on correct trials. The data were each subject’s median response time in each condition. Figure 4 shows each condition’s results for individual subjects along with their mean and 95% confidence intervals.

Figure 4.

Median response times for various combinations of (i) the difference on Incongruent trials between modulation frequencies of the visual stimulus and auditory stimulus, and (ii) three classes of audiovisual match. $Δ$ Freq = 1 for all conditions. Each error bar spans the 0.95% confidence interval. Data points represent individual subjects.

Table 12 summarizes an ANOVA on subjects’ median response times from correct trials. Neither Modulation depth nor type of avMatch had a significant effect on speed of response, although their interaction did. As in Experiment One, we followed up the ANOVA by examining select pairwise differences. In particular, we restricted the focus to pairs within the same level of Modulation depth; Table 13 shows the result. Only at the highest Modulation did any pairwise response time comparison reach statistical significance. Interestingly, with that highest modulation (100%), response times in the Incongruent condition were significantly slower than in the Control condition, but times in the Congruent condition were not.

Table 12.

ANOVA on Response Time Results From Experiment Two

	df s	$ε$	SSn	SSd	F	p	$η_{p a r t i a l}^{2}$
Modulation	1.61,14.46	0.80	15029.59	124057.85	1.09	.348	.108
avMatch	1.47,13.21	0.73	21256.56	52566.05	3.64	.066	.288
avMatch $\times$ Modulation	2.77,24.91	0.69	15996.86	39238.93	3.67	<.028	.290

p values corrected with Holm-Bonferroni method. $ε$ is the Greenhouse-Geisser correction factor. $η_{p a r t i a l}^{2}$ is a measure of effect size.

Table 13.

Select Pairwise Differences From Experiment Two: Response Time Results

Modulation		Cong – Ctrl	Ctrl – Incong	Cong – Incong
20%	Difference	−1.3073	−6.0182	−7.3254
	p value	1.000	1.000	1.000
60%	Difference	−20.8383	−11.3991	−32.2374
	p value	.7972	.7972	.3312
100%	Difference	7.7917	−72.4157	−64.6240
	p value	.6569	<.0164	.0942

p values corrected with Holm-Bonferroni method.

Discussion and Conclusions

Our experiments show that categorizing the rate of size modulation is affected by both $Δ$ Freq, the difference between the slower and faster stimuli in a block of trials, and the depth of modulation of a task-irrelevant auditory stimulus. Overall, categorization accuracy increased as $Δ$ Freq grew, but decreased with stronger sound modulation. Additionally, response times tended to decrease with $Δ$ Freq, but were not appreciably affected by the irrelevant sound’s depth of modulation, except at the highest modulation. Relative to what is seen with the non-modulated Control, the general decrease in accuracy with a frequency mismatch tends to be larger than the increase in accuracy with a frequency match, that is, $|$ Ctrl–Incong $|$ > $|$ Cong–Ctrl $|$ . Table 4 shows that in Experiment One that same inequality held for all values of $D e l t a$ Freq so long as Modulation = 100%. Experiment Two showed the same inequality with Modulation = 100% (Table 9). The response times in Tables 7 and 13 show a similar pattern of inequalities, although with an inverted sign because of the speed-accuracy trade off in our results. Thus, in Experiment One for all values of $Δ$ Freq, $|$ Cong–Ctrl $|$ < $|$ Ctrl–Incong $|$ , and the same holds in Experiment Two so long as Modulation = 100%.

The accuracy results from Experiments One and Two go some way toward explaining many of the discrepancies among previous applications of Fish Police!! (shown in Table 1). Extrapolating from our results, combining small $Δ$ Freq with high modulation depth should cause accuracy on Incongruent trials to fall significantly below that on Control trials, as found in several prevous studies with Fish Police!!. Two previous results, however, depart from this expected pattern. First, Sun and Sekuler (2021) found a large accuracy loss with Incongruent stimuli relative to accuracy on Control trials despite using a large value of $Δ$ Freq and a low auditory modulation depth. The replicability of this finding might be tested in some future study that combined low Modulation amplitudes with a wide range of $Δ$ Freq values. Second, despite a small $Δ$ Freq and Strong modulation of the irrelevant sound, Zhou (2019) failed to find a substantial deficit of Incongruent relative to Contol, a result for which we have no explanation.

Experiments One and Two had a single condition in common: both included a condition in which $Δ$ Freq = 1 was combined with Modulation = 100%. We exploited this commonality to make a post hoc comparison of results from the two experiments. Figure 5’s panels show side-by-side the two sets of results, one based on more than twice as many subjects as the other. The experiments produced comparable performance, not just in overall level of accuracy, but also in the relationships among the various types of avMatch. Although the similarity of results from separate experiments seems to suggest that Fish Police!! results are highly replicable, we cannot rule out the possibility that the similarity might have been influenced by the fact that 40% of Experiment Two’s subjects had also served in Experiment One.

Figure 5.

Accuracy of judgments in the single condition common to both experiments, $Δ$ Freq = 1 and modulation = 100%. Each error bar spans the 95% confidence interval. Data points represent individual subjects. Panels are repeated from Figures 1 and 3

Our experiments showed that audiovisual interaction depends on the degree of mismatch between auditory and visual signals and on the modulation depth of a task-irrelevant sound. Both variables have counterparts in the interference seen in multi-speaker environments and in other noisy settings. Interference with speech reception depends upon the similarity between the characteristics of a speaker’s voice and speech, on one hand, and the characteristics of irrelevant, background speech or noise, on the other Brungart and Simpson (2002, 2007); interference also depends upon the relative loudness of the irrelevant background speech or noise (Rhebergen et al., 2008). Of course, to understand audiovisual interference fully, many other variables need to be considered, including ones not usually considered in perception or in decision-making research. As one example, shortening the interval between successive trials in Fish Police!! severely disrupts performance (Sun and Sekuler, 2021). This disruption was disproportionately large on Incongruent trials, perhaps mediated by diminished capacity to engage selective attention. Interestingly, the effect is accompanied by (i) heightened autonomic activation and (ii) self-reports of increased stress. Shrinking the interval between trials without altering other aspects of an experiment, also disrupts performance on Eriksen’s Flanker Task, especially on trials most dependent on selective attention (Sussman et al., 2021). It would be worthwhile to explore these seemingly parallel effects from the Flanker Task and Fish Police!! to determine what, if any, shared mechanisms may be behind the reduced accuracy seen in both.

Although a temporal correlation between auditory and visual stimuli usually benefits performance, that is not always true. Consider a case in which subjects’ task was very different from the one in our experiments. Strand et al. (2020) presented listeners with a disc whose size fluctuations were synchronized to amplitude variations in spoken sentences or words. The visual stimulus, then, provided information that tracked the acoustic envelope of what subjects were hearing. The presence of the dynamic disc led subjects report that the speech recognition task was easier, but paradoxically, it failed to improve recognition of the speech stimuli. Strand et al. speculated that the disc had been distracting while not providing sufficient phonetic detail information to aid task performance.

On Fish Police!!’s Incongruent trials discounting the frequency of the task-irrelevant auditory stimulus requires that subjects filter out some or all of that task-irrelevant information. Otherwise, the auditory stimulus could influence or even dominate judgments of visual frequency. Ineffective or incomplete filtering would allow the task-irrelevant stimulus to reduce accuracy and lengthen response times relative to the other conditions. The fact that accuracy on Incongruent trials tended to fall below that of other trials suggests that filtering was imperfect. Experiment One showed that filtering became less effective as the target visual stimulus and the irrelevant auditory stimulus grew more similar; Experiment Two showed that filtering became less effective when the task-irrelevant auditory stimulus was stronger, that is, when its modulation was greater. These effects fit well with what is known about selective attention (Carrasco, 2011).

The random intermixing of Congruent, Control, and Incongruent trials kept subjects from knowing before the trial what stimulus type would be presented. That temporal unpredictability was different from what is encountered when competing background speech or noise were relatively constant over extended periods. As a result, our task might demand more transient selective attentional than what would be called for in some common situations, like a noisy restaurant whose background din may be relatively constant over time. To bring selective attention online in Fish Police!!, subjects had to detect the presence of an audiovisual mismatch in real time, preferably early in the stimulus. To identify a possible marker of that detection process, Sun and Sekuler (2021) measured scalp electroencephalographic signals while subjects were playing Fish Police!!. When concurrent auditory and visual stimuli were mismatched in frequency (that is, on Incongruent trials), a large transient increase in theta band (4–8 Hz) power was seen $\sim$ 150 ms after stimulus onset. That result was not found with either Congruent or Control conditions. Sun and Sekuler suggested that this increase in theta power marked the onset of a control process needed to filter the mismatched, task-irrelevant auditory stimulus. Understanding the possible role of the EEG signal demands further work, including a trial-by-trial comparison of cortical oscillatory activity and performance.

Footnotes

Acknowledgements

The authors thank Mercedes B. Villalonga, Long Yi, and Professor Xiaodong Liu for very helpful advice at various stages of this project. Jaiyue Tai: ORCID orcid.org/0000-0003-3624-9556; Jack Forrester: ORCID orcid.org/0000-0001-9755-6854; Robert Sekuler: ORCID orcid.org/0000-0002-2519-4943. Raw data for each experiment and the R code used to analyze the data have been published to the Open Source Foundation repository.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported in part by the National Science Foundation (Center for Excellence in Education, Science, and Technology). JT was supported by Brandeis’ Provost Research fund.

ORCID iDs

Jiayue Tai

Jack Forrester

Robert Sekuler

References

Bates

Mächler

Bolker

Walker

(2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1), 1–48. https://doi.org/10.18637/jss.v067.i01

Bernstein

L. E.

Auer Jr

E. T.

Takayanagi

(2004). Auditory speech detection in noise enhanced by lipreading. Speech Comunication, 44, 5–18. https://doi.org/10.1016/j.specom.2004.10.011

Brungart

D. S.

Simpson

B. D.

(2002). Within-ear and across-ear interference in a cocktail-party listening task. The Journal of the Acoustical Society of America, 112(6), 2985–2995. https://doi.org/10.1121/1.1512703

Brungart

D. S.

Simpson

B. D.

(2007). Effect of target-masker similarity on across-ear interference in a dichotic cocktail-party listening task. The Journal of the Acoustical Society of America, 122(3), 1724. https://doi.org/10.1121/1.2756797

Carrasco

(2011). Visual attention: The past 25 years. Vision Research, 51, 1484–1525. https://doi.org/10.1016/j.visres.2011.04.012

Chandrasekaran

Trubanova

Stillittano

Caplier

Ghazanfar

A. A.

(2009). The natural statistics of audiovisual speech. PLoS Computational Biology, 5(7), e1000436. https://doi.org/10.1371/journal.pcbi.1000436

Conway

A. R. A.

Cowan

Bunting

M. F.

(2001). the cocktail party phenomenon revisited: The importance of working memory capacity. Psychonomic Bulletin and Review, 8(2), 331–335. https://doi.org/10.3758/bf03196169

Dias

J. W.

McClaskey

C. M.

Harris

K. C.

(2021). Audiovisual speech is more than the sum of its parts: Auditory-visual superadditivity compensates for age-related declines in audible and lipread speech intelligibility. Psychology and Aging, 36(4), 520–530. https://doi.org/10.1037/pag0000613

Goldberg

Sun

Hickey

Shinn-Cunningham

Sekuler

(2015). Policing fish at Boston’s museum of science: Studying audiovisual interaction in the wild. i-Perception, 6(4), 1–11. https://doi.org/10.1177/2041669515599332

10.

Golumbic

E. Z.

Cogan

G. B.

Schroeder

C. E.

Poeppel

(2013). Visual input enhances selective speech envelope tracking in auditory cortex at a “cocktail party”. The Journal of Neuroscience, 33(4), 1417–1426.

11.

Joris

P. X.

Schreiner

C. E.

Rees

(2004). Neural processing of amplitude-modulated sounds. Physiological Reviews, 84(2), 541–577. https://doi.org/10.1152/physrev.00029.2003

12.

Kerlin

J. R.

Shahin

A. J.

Miller

L. M.

(2010). Attentional gain control of ongoing cortical speech representations in a “cocktail party”. Journal of Neuroscience, 30(2), 620–628. https://doi.org/10.1523/JNEUROSCI.3631-09.2010

13.

Lakens

Caldwell

(2021). Superpower: Simulation-based power analysis for factorial analysis of variance designs. Advances in Methods and Practices in Psychological Science, 4(1), 251524592095150.

14.

Lamers

M. J. M.

Roelofs

(2011). Attentional control adjustments in Eriksen and Stroop task performance can be independent of response conflict. Quartlerly Journal of Experimental Psychology (Hove), 64(6), 1056–1081. https://doi.org/10.1080/17470218.2010.523792

15.

Lenth

R. V.

. (2021). emmeans: Estimated Marginal Means, aka Least-Squares Means. https://CRAN.R-project.org/package=emmeans. R package version 1.6.0.

16.

Wang

Chen

Cichocki

Sejnowski

(2018). The effects of audiovisual inputs on solving the Cocktail party problem in the human brain: An fMRI study. Cerebral Cortex, 28(10), 3623–3637. https://doi.org/10.1093/cercor/bhx235

17.

Maddox

R. K.

Atilgan

Bizley

J. K.

Lee

A. K. C

. (2015). Auditory selective attention is enhanced by a task-irrelevant temporally coherent visual stimulus in human listeners. eLife 4. DOI:maddox.

18.

Passow

Westerhausen

Wartenburger

Hugdahl

Heekeren

H. R.

Lindenberger

S. C.

(2012). Human aging compromises attentional control of auditory perception. Psychology and Aging, 27(1), 99–105. https://doi.org/10.1037/a0025667

19.

Peelle

J. E.

Sommers

M. S.

(2015). Prediction and constraint in audiovisual speech perception. Cortex; a Journal Devoted to the Study of the Nervous System and Behavior, 68, 169–181. https://doi.org/10.1016/j.cortex.2015.03.006

20.

Rhebergen

K. S.

Versfeld

N. J.

Dreschler

W. A.

(2008). Prediction of the intelligibility for speech in real-life background noises for subjects with normal hearing. Ear and Hearing, 29(2), 169–175. https://doi.org/10.1097/AUD.0b013e31816476d4

21.

Rosser

D. A.

Laidlaw

D. A.

Murdoch

I. E.

(2001). The development of a “reduced logmar” visual acuity chart for use in routine clinical practice. British Journal of Ophthalmology, 85(4), 432–436. https://doi.org/10.1136/bjo.85.4.432

22.

Schneider

B. A.

Daneman

Pichora-Fuller

M. K.

(2002). Listening in aging adults: From discourse comprehension to psychoacoustics. Canadian Journal of Experimental Psychology, 56(3), 139–152. https://doi.org/10.1037/h0087392

23.

Singmann

Bolker

Westfall

Aust

. (2016). afex: Analysis of Factorial Experiments. https://CRAN.R-project.org/package=afex. R package version 0.28-1.

24.

Singmann

Kellen

. (2019). An introduction to mixed models for experimental psychology. In D. H. Spieler and E. Schumacher (Eds.), New Methods in Cognitive Psychology (pp. 4–31). Psychology Press.

25.

Stevens

Bavelier

(2012). The role of selective attention on academic foundations: A cognitive neuroscience perspective. Developmental Cognitive Neuroscience, 2 Suppl 1, S30–S48. https://doi.org/10.1016/j.dcn.2011.11.001

26.

Strand

J. F.

Brown

V. A.

Barbour

D. L.

(2020). Talking points: A modulating circle increases listening effort wthout improving speech recognition in young adults. Psychonomic Bulletin and Review, 27(3), 536–543. https://doi.org/10.3758/s13423-020-01713-y

27.

Sumby

W. H.

Pollack

(1954). Visual contribution to speech intelligibility in noise. Journal of the Acoustical Society of America, 26, 212–215. https://doi.org/10.1121/1.1907309

28.

Sun

Hickey

Shinn-Cunningham

Sekuler

(2017). Catching audiovisual interactions with a first-person fisherman video game. Perception, 46(7), 793–814. https://doi.org/10.1177/0301006616682755

29.

Sun

Sekuler

(2021). Decision-making and multisensory combination under time stress. Perception, 50(7), 627–645. https://doi.org/10.1177/03010066211017458

30.

Sussman

R. F.

Villalonga

M. B.

Sekuler

(2021). Flanker task under (perceived) time pressure. Journal of Vision, 21(9), 2953. https://doi.org/10.1167/jov.21.9.2953

31.

Varghese

Mathias

S. R.

Bensussen

Chou

Goldberg

H. R.

Sun

Sekuler

Shinn-Cunningham

B. G.

(2017). Bi-directional audiovisual influences on temporal modulation discrimination. The Journal of the Acoustical Society of America, 141(4), 2474–2488. https://doi.org/10.1121/1.4979470

32.

Warton

D. I.

Hui

F. K. C.

(2011). The arcsine is asinine: The analysis of proportions in ecology. Ecology, 92(1), 3–10. https://doi.org/10.1890/10-0340.1

33.

Yuan

Meyers

Borges

Lleo

Fiorentino

K. A.

(2021). Effects of visual speech envelope on audiovisual speech perception in multitalker listening environments. Journal of Speech, Language, and Hearing Research, 64(7), 2845–2853. https://doi.org/10.1044/2021_JSLHR-20-00688

34.

Zhou

(2019). Onset asynchrony influences audiovisual interaction [Master’s Thesis]. Graduate School of Arts and Science, Brandeis University.