Abstract
In spite of accumulating evidence for the spatial rule governing cross-modal interaction according to the spatial consistency of stimuli, it is still unclear whether 3D spatial consistency (i.e., front/rear of the body) of stimuli also regulates audiovisual interaction. We investigated how sounds with increasing/decreasing intensity (looming/receding sound) presented from the front and rear space of the body impact the size perception of a dynamic visual object. Participants performed a size-matching task (Experiments 1 and 2) and a size adjustment task (Experiment 3) of visual stimuli with increasing/decreasing diameter, while being exposed to a front- or rear-presented sound with increasing/decreasing intensity. Throughout these experiments, we demonstrated that only the front-presented looming sound caused overestimation of the spatially consistent looming visual stimulus in size, but not of the spatially inconsistent and the receding visual stimulus. The receding sound had no significant effect on vision. Our results revealed that looming sound alters dynamic visual size perception depending on the consistency in the approaching quality and the front–rear spatial location of audiovisual stimuli, suggesting that the human brain differently processes audiovisual inputs based on their 3D spatial consistency. This selective interaction between looming signals should contribute to faster detection of approaching threats. Our findings extend the spatial rule governing audiovisual interaction into 3D space.
Introduction
We are incessantly exposed to multisensory information of the outer world. To understand how we process multimodal information simultaneously and perceive real-life events, it is necessary to reveal the general principles that govern multisensory interaction (Cappe, Thelen, Romei, Thut, & Murray, 2012; Ghazanfar & Schroeder, 2006; Wallace et al., 2004). Previous research has shown that the spatial consistency of multimodal stimuli plays a key role in encoding multisensory signals into a unified representation for perceiving a unified event (“unity assumption”; Chen & Spence, 2017; Chen & Vroomen, 2013; Vatakis & Spence, 2007; Welch & Warren, 1980). The spatial consistency of multisensory information, the so-called “spatial rule,” has been suggested to regulate the multisensory integration process; the integration of information and perceptual interaction can be facilitated when multisensory stimuli are presented in a spatially consistent manner (Leo, Romei, Freeman, Ladavas, & Driver, 2011; Lewald & Guski, 2003; for review, Spence, 2013). In the audiovisual domain so far, the spatial consistency of stimuli along the horizontal and vertical planes has been shown to facilitate information integration (Girard, Pelland, Lepore, & Collignon, 2013; Jones & Jarick, 2006; Kadunce, Vaughan, Wallace, Benedek, & Stein, 1997; Li, Yang, Sun, & Wu, 2015; Romei, De Haas, Mok, & Driver, 2011). For instance, Leo et al. (2011) demonstrated that an auditory cue enhanced participants’ visual orientation sensitivity only for a target in the same hemifield as the auditory cue.
It is still an open question whether multimodal information is integrated depending on the spatial consistency in a depth plane, especially in the front and rear spaces of the body (i.e., Cappe et al., 2012; Spence, Lee, & Van der Stoep, 2017; Van der Stoep, Nijboer, Van der Stigchel, & Spence, 2015; Van der Stoep, Serino, Farnè, Di Luca, & Spence, 2016). Some studies have reported no spatial consistency effect on cross-modal integration in the front/rear spaces. In the audiotactile domain, perceptual interaction has been shown to occur in the frontal space (Zampini et al., 2005), in the rear space (Kitagawa, Zampini, & Spence, 2005), and even regardless of the front/rear co-location of the multisensory stimulation (Zampini, Torresan, Spence, & Murray, 2007; see Occelli, Spence, & Zampini, 2011, for review). In the audiovisual domain, Lee and Spence (2015) reported that the exact co-location in the front/rear spaces is not necessary for an audiovisual cueing effect to occur, though spatial alignment in depth was less relevant to their spatial attention tasks for lateralized visual targets (see also Van der Stoep, Nijboer, & Van der Stigchel, 2016). Therefore, the current study focused on the auditory effect on visual size perception, for which the spatial consistency of audiovisual information in the depth plane is considered to be more relevant.
In 3D space, we need to detect objects to effectively avoid potentially damaging collisions; the ability to detect approaching objects in space is indispensable for survival (Bach, Neuhoff, Perrig, & Seifritz, 2009; Wann, Poulter, & Purcell, 2011). In fact, studies have shown that approaching or looming stimuli can be threats that elicit typical defensive actions in human infants (Ball & Tronick, 1971; Náñez & Yonas, 1994), adults (King, Dykeman, Redgrave, & Dean, 1992), and even monkeys (Schiff, Caviness, & Gibson, 1962). In addition, looming stimuli are known to receive priority processing (Finlayson, Remington, & Grove, 2012; Franconeri & Simons, 2003). Likewise, Shirai and Yamaguchi (2004) showed that the visual system is more sensitive to a looming than to a receding stimulus, which could facilitate faster responses to looming events toward the observer.
Detection of approaching objects requires integration of dynamic multisensory spatial information, especially visual and auditory distance information. In the auditory domain, sound intensity can be an important monaural cue for the perception of distance (Zahorik, 2002), and changing intensity induces an impression of motion in depth. Looming sound (i.e., sound with increasing intensity) is a salient class of auditory stimuli that causes a strong perception of an approaching sound source (Bach et al., 2009; Neuhoff, 1998). Humans often overestimate the intensity and duration of looming sounds (Grassi & Darwin, 2006; Schlauch, Ries, & DiGiovanni, 2001) and underestimate their distance and the time-to-arrival (Neuhoff, 2001; Vagnoni, Lourenco, & Longo, 2012). Neurophysiological research showed that, compared to receding sound, looming sound more strongly activates auditory cortex and superior temporal sulcus (Maier, Chandrasekaran, & Ghazanfar, 2008), as well as amygdala (Bach et al., 2008). It has been proposed that the approaching quality of looming sound represents a potential threat in the environment and causes these adaptive biases, advantageous for the observer to detect and deal with dangerous events faster (Guski, 1992; Neuhoff, 1998).
In the visual domain, the size of an object can be a reliable monocular cue for the perception of distance (Dosher, Sperling, & Wurst, 1986). The apparent size of the dynamic object contributes to a large degree to the effective detection of approaching objects because, with a few exceptions, the retinal size of an object changes according to its distance from the observer. Thus, changing the size of an object induces strong perception of motion in depth; the increase and decrease in retinal size represent the looming and receding of an object, respectively (Beverley & Regan, 1979; Howard, Fujii, & Allison, 2014; Regan & Beverley, 1978). Accumulating evidence indicates that the perception of visual size can be altered by threatening information (Stefanucci & Proffitt, 2009; Van Ulzen, Semin, Oudejans, & Beek, 2008; Vasey et al., 2012). Looming stimuli can be so threatening as to elicit escape-like actions (e.g., Ball & Tronick, 1971). In fact, a visual looming stimulus on a collision course with an observer can be perceived as larger than it actually is (Chen, Yuan, Xu, Wang, & Jiang, 2016; Whitaker, McGraw, & Pearson, 1999).
Interestingly, visual size perception has been shown to be modifiable by nonvisual, auditory information. Specifically, Sutherland, Thut, and Romei (2014) demonstrated that a static visual stimulus was overestimated in size when it was paired with looming sound, but not with receding or static sounds, which confirms the importance of looming stimuli as potential threats (Chen et al., 2016; Whitaker et al., 1999). This phenomenon has been explained as the auditory distance information transferring onto the visual dimension, causing the object to appear closer to the observer than it actually is.
The study by Sutherland et al. (2014), however, left some questions unanswered. First, their visual stimuli were presented only briefly, while the auditory stimuli dynamically changed in intensity. According to the “unity assumption,” it is less likely that a static visual stimulus and a dynamic auditory stimulus were integrated into a unified event (e.g., Chen & Spence, 2017). Thus, we employed dynamic visual stimuli with changing size to investigate how looming/receding sounds modulate the perceived size of visual looming/receding objects. Furthermore, as the perceived visual size of a static object has been shown to change merely by a high-intensity tone (Takeshima & Gyoba, 2013), sound intensity alone might be sufficient to alter perceived object size. To rule out this effect, we equalized sound intensity at its onset across conditions (Experiments 1 and 3) and accumulative sound intensity across conditions (Experiment 2). Moreover, there is a potential confound of response bias in the sequential comparison task of Sutherland et al. (2014), in which participants judged whether the second stimulus paired with a sound (target) was larger or smaller than the first stimulus (reference). For example, looming sound may result in a bias toward a “larger” response, and, as the authors acknowledged, it is difficult to rule out this kind of bias in a binary response. Thus, in this study, we devised a simultaneous size-matching task of a dynamically changing target stimulus with a reference stimulus, which provide continuous measures of participants’ online perception (Experiments 1 and 2). We also conducted a third experiment requiring participants to reproduce the perceived size of a target stimulus subjectively. Consistent results across the different experiments would provide strong support for the perceptual effect.
In summary, the current study aimed to evaluate the front/rear spatial consistency effect on the integration of dynamic audiovisual distance information. We investigated whether the sounds with changing intensity affect dynamic visual size perception depending on the front/rear spatial alignment of audiovisual stimuli. Based on the previous evidence for looming specific integration (Cappe et al., 2009, 2012), we hypothesized that audiovisual looming stimuli would be selectively integrated. Furthermore, if the front/rear co-location governs the integration of dynamic audiovisual information, according to the spatial rule (e.g., Girard et al., 2013; Lewald & Guski, 2003; Spence, 2013), visual size perception should be selectively modulated by the front-presented auditory stimuli.
Experiment 1
Methods
Participants
Ten participants (mean age = 22.3, SD = 1.19, four females) took part in this experiment. All participants had normal hearing and normal or corrected vision. They all gave written informed consent to the procedure, which was in accordance with the ethics standards in the Declaration of Helsinki and was approved by the local ethics committee of Kyoto University. All were naïve to the purpose of the experiment. The data from one participant were discarded due to failure to understand the requirements of the task in the main experiment.
Apparatus and stimuli
The audiovisual stimuli were presented using Psychopy software and its Pyo back-end for the sound presentation (version 1.83, Peirce, 2007, 2009) on a computer (Apple Mac Pro). The visual stimuli were presented on an OLED monitor screen (SONY PVM-A170, 60 Hz refresh rate), which has good temporal responses (Cooper, Jiang, Vildavski, Farrel, & Norcia, 2013; Ito, Ogawa, & Sunaga, 2013). The screen luminance was measured using a colorimeter (Photo Research PR-655). The auditory stimuli were presented through one of the loudspeakers (SONY PCVA-SP2) that were placed 50 cm in front and behind a chin rest. The intensity of the sounds was measured at the chin rest (i.e., roughly at the location of the participants’ ears) using a noise meter (SMART SENSOR AR814). The relative timing of the visual and auditory stimuli was measured and calibrated using an Arduino microcomputer attached to a photocell and a condenser microphone.
Auditory stimuli were 440 Hz pure tones with increasing intensity (55−75 dB SPL, looming sound: AL), decreasing intensity (55−35 dB SPL, receding sound: AR), or constant intensity (55 dB SPL, static sound: AS). We also tested a no-sound (AN) condition as a control. The sounds were presented for 5,000 ms and linearly changed in intensity over the duration (Figure 1). Previous research showed that tonal stimuli can induce reliable perception of motion in depth and can be involved in multisensory integration (Neuhoff, 2001). All sounds were generated with the free software Audacity (version 2.0.6.0).
(a) Visual stimuli in Experiment 1. Visual stimuli consisted of a gray-colored target disk expanding (VL: solid) or contracting (VR: dotted) monotonically. Gray lines show the diameters of the static reference disks for each target condition. (b): Auditory stimuli in Experiment 1. Pure tone of 440 Hz was presented for 5,000 ms with increasing- (AL: solid), static- (AS: gray), decreasing- (AR: dotted), or no (AN: not shown) intensity.
Visual stimuli were gray-colored disks (Figure 2). In each trial, a pair of gray disks (17.5 cd/m2), a target and a reference, appeared on a black background (0.01 cd/m2) from the distance of 50 cm. A reference disk was presented at an upper location on the monitor, and a red-colored fixation dot was presented at the center of the reference. A target disk was presented 4.5° below the fixation dot to maximize the auditory effect on vision (Alink et al., 2012; Hidaka et al., 2008; Shams, Kamitani, & Shimojo, 2002; Sutherland et al., 2014). The target disk changed size monotonically in each trial. The diameter of the target disk was 300 pixels (8.5° for visual angle) at the start and then increased by 25 pixels (0.7°) per second linearly (visual looming condition: VL) or decreased by 25 pixels per second (visual receding condition: VR). The reference disk was static, with the diameter of 400 pixels (11.3°) in VL condition and 200 pixels (5.7°) in VR condition. In both conditions, the size of the target and reference became physically the same at 4,000 ms after trial onset.
Time course of the experimental trial in Experiments 1 and 2. In each trial, a pair of the target (lower disk) and the reference (upper disk) visual stimuli and an auditory stimulus was simultaneously presented. The target and the sound changed in size or intensity. Auditory stimuli were presented from either of the two loudspeakers located in front of and behind the participants. Participants were asked to press a key when the two visual stimuli were perceived as the same size.
It is important to note that auditory stimuli were assumed to impact the target disk more strongly than the reference disk. Auditory modulation on vision occurs when visual information is ambiguous, such as when a visual stimulus is presented in a peripheral field (e.g., Shams et al., 2002) as the target disk here. In addition, auditory information from dynamic auditory stimuli should be attributed only to a similarly dynamic object based on the unity assumption between audiovisual stimuli (e.g., Chen & Spence, 2017). Thus, if auditory stimuli modulate vision, they were assumed to primarily affect the dynamic target disks but not the static reference disks that appeared in fovea. Therefore, in the unlikely event that the auditory modulation is as effective on the reference disk as on the target, the effects will simply produce a null result in our design.
Procedure
The experiment was conducted in a dark shielded room with normal room reverberation. The background noise level was about 31.0 dB SPL.
Before the main experiment, participants performed a sound localization test. They were asked to localize a 440 Hz pure tone of 500 ms duration that was randomly presented from the front- or the rear-placed loudspeakers. The localization test consisted of 30 repetitions for each sound. The trial order was randomized for each participant. All participants localized each sound with a high accuracy (average percent correct: 95.4, range: 85.0–100.0, SD = 4.42).
In the main experiment, participants performed a size-matching task of visual target and reference disks, while being exposed to the auditory stimuli that were presented in front of or behind their body. There were in total 14 experimental conditions, consisting of 12 audiovisual conditions (two visual stimuli: VL and VR; three auditory stimuli: AL, AR, and AS; two sound positions: front and rear) and two unisensory visual conditions for baseline (VL-AN and VR-AN). Participants performed four experimental blocks of the task. Each of the blocks consisted of 30 repetitions for each of the 14 conditions. Trial order was randomized for each block and participant.
The series of events in a trial is illustrated in Figure 2. Participants were instructed to attend to the visual stimuli by gazing at the fixation dot throughout a trial and to disregard the auditory stimulus. The reference disk and the fixation dot were presented in the upper part of the monitor, and the target disk was presented 4.5° below the fixation dot. After the 500 ms presentation of the fixation dot, the target, the reference, and the auditory stimulus were presented simultaneously. Then, the size of the target disk and the intensity of the auditory stimulus gradually changed. Participants were required to respond by pressing a key with their right index finger as soon as the target and reference disks were subjectively perceived as the same size. The visual stimuli remained on the screen until a response was made. In each trial, the elapsed time from stimulus onset to a key press was recorded as response latency. Shortening of the response latency should reflect overestimation of VL and underestimation of VR target size.
Data analysis
We analyzed all the individual data with linear mixed models, using the lme4 and lmerTest packages on the R software (version 3.3.1, http://www.r-project.org/). We first made a statistical inference using a maximum model (model with all possible fixed and random effects) and then conducted an automated model selection with backward elimination by using the step function of the lmerTest package. In the present case of the balanced factorial experiment, a full-factorial analysis of variance (ANOVA) is known to provide test statistics for all fixed effects: all main effects and interactions (Matuschek et al., 2017). Thus, we conducted backward elimination only for random effects, and, based on the selected model, we implemented an ANOVA F test for fixed effects using the anova function of the lmerTest package. Furthermore, to conduct multiple comparisons across the conditions, we used the difflsmeans function of the lmerTest package. To correct p values for multiple comparisons, we applied the Bonferroni method. This analytical approach was used throughout the article.
Results
A backward elimination of random effects selected the random by-participant intercepts and random slopes of Visual stimulus. Thus, we adopted the model including the main effects of Visual stimulus, Auditory stimulus, Position, and all possible interactions, as well as these random effects.
Figure 3 shows the least-squares mean of response latency in each condition estimated by the selected model. The model estimates and the measured means for each participant are shown in the Supplementary Document (Figure S1). The primary interest of this study is the synergistic interactions between visual, auditory, and positional factors in 12 multisensory conditions; we excluded the no-sound conditions (i.e., VL-AN and VR-AN) from the following factorial analysis accordingly. An ANOVA revealed no significant three-way interaction of Visual stimulus, Auditory stimulus, and Position, F(2, 3,222) = 0.99, p = .37. All two-way interactions were significant, Vision–Audition: F(2, 3,222) =4.53, p = .011; Vision–Position: F(1, 3,222) = 5.61, p = .018; Audition–Position: F(1, 3,222) = 5.89, p = .003. There were significant main effects of Visual stimulus, F(1, 9) = 8.37, p = .018, Auditory stimulus, F(2, 41) = 7.67, p = .001, and Position, F(1, 108) = 8.29, p = .005.
Results of Experiment 1 show the least-squares mean of response latency in each condition estimated by using a linear mixed model (described in the procedure section). Error bars refer to 95% confidence intervals. In a series of figures, V refers to the visual condition and A refers to the auditory condition. L, R, S, and N denote looming, receding, static, and no-sound conditions, respectively. *p < .05. **p < .01.
Given the significant two-way interactions, we split the data according to Visual stimulus (VL/VR) to test the effects of Auditory stimulus and Position in more detail. The data of VL/VR conditions were analyzed separately using the linear mixed model. As a result of the stepwise model selection for random effects, we adopted a model including all possible fixed effects (the main effects of Auditory stimulus (AL/AR), Position (front/rear), and their interaction) and random by-participant intercepts. In the VL condition, we observed a significant two-way interaction between Auditory stimulus and Position, F(2, 1,606) = 5.14, p = .006, and significant main effects of Auditory stimulus and Position, Auditory stimulus: F(2, 1,606) = 10.85, p < .001; Position: F(1, 1,606) = 12.48, p = .014. In contrast, there was no significant two-way interaction, F(2, 1,606) = 1.24, p = .29, nor main effects, Auditory stimulus: F(2, 1,606) = 1.05, p = .35; Position: F(1, 1,606) = 0.18, p = .67, in the VR condition.
We then conducted multiple comparisons across all conditions in VL condition and found that the response latency for AL-front condition was significantly shorter than for the other conditions, AL-front—AL-rear: t(1,606) = −4.46, p < .001; AL-front—AR-front: t(1,606) =−4.79, p < .001; AL-front—AS-front: t(264) = −4.47, p < .001; AL-front—AR-rear: t(1,606) = −4.76, p < .001; AL-front—AS-rear: t(1,606) = −6.16, p < .001. No significant difference was found for the other pairs of the conditions, AR-front—AR-rear: t(1,606) = 0.03, p > .99; AR-front—AS-front: t(1,606) = 0.32, p > .99; AR-front—AS-rear: t(1,606) = −1.37, p > .99; AR-rear—AS-front: t(1,606) = −0.29, p > .99; AR-rear—AS-rear: t(1,606) = −1.40, p > .99. Here, it is worth mentioning that the mean response latency for the AN condition was within the range of 95% confidence intervals for the other conditions except for the AL-front condition. The analyses showed that only the looming visual object was overestimated in size when paired with the front-presented looming sound. The receding sound had no significant influence on the response latency in either position.
Discussion
Experiment 1 revealed an asymmetric auditory effect on response latency, which is considered to reflect perceived visual size. Only the looming sound presented in front of the participants caused overestimation of the visual size of a looming visual target. The observed perceptual interaction was looming-specific and was further governed by the front/rear spatial consistency of audiovisual looming stimuli, which is our novel finding.
A few previous studies were consistent in that a congruent combination of audiovisual looming signals can be preferentially processed compared to a unisensory or incongruent audiovisual pair. Using electroencephalography recording, Cappe et al. (2012) demonstrated that the combination of audiovisual looming signals caused nonlinear interactions at an early stage of processing, and that those synergistic interactions positively correlated with the facilitation of response time. Likewise, the combination of audiovisual looming information has been reported to selectively enhance neural signaling in the superior temporal sulcus and low-level auditory/visual cortices (Tyll et al., 2013). These lines of evidence suggest that audiovisual looming signals can elicit supra-additive brain activities that lead to selective perceptual facilitation. Likewise, the result that only looming sound selectively affects the size perception of looming object, but not of receding object, might be explained by the synergetic interaction of audiovisual looming signals.
Response latencies in all the conditions were shorter than 4,000 ms when the target physically became the same size as the reference, but this is not surprising if participants responded when the target size came into the range of just noticeable differences (JNDs) with the reference. Also, the response latencies were shorter for the VR condition than the VL condition. This might be due to the different size of the reference disks for the visual conditions. The diameter of the reference disk in the VR condition was half as large as that in the VL condition (see Figure 1); the smaller disks could be more difficult to discriminate and hence yielded larger JND. A similar pattern of results was observed for the no-sound conditions (VL-AN and VR-AN), suggesting that it did not reflect the auditory influences but the property of the current task.
Another potentially more confounding factor is that the accumulative sound intensity was different across the auditory conditions. In this experiment, we unified the initial intensity of all auditory stimuli, in accordance with Cappe et al. (2012). However, lack of significant influence of receding sound on vision could be explained merely by lower accumulative intensity, considering the evidence that a static object can be overestimated in size when paired with a high-intensity tone (Takeshima & Gyoba, 2013). These differences in the stimuli might have caused the asymmetrical pattern of the results. To explore this possibility, in the next experiment, we controlled the reference size and the accumulative sound intensity across conditions.
Experiment 2
This experiment aimed to elucidate the influences of intensity and looming quality of auditory stimuli on visual size perception using a different set of audiovisual stimuli than those used in Experiment 1. We equalized the diameter of the reference disks and the accumulative intensity of auditory stimuli across conditions in order to reject possible artifacts derived from the stimulus differences. If the modulation of visual size perception is caused by the looming quality of the sound, not by the sound intensity, the pattern of the results in Experiment 1 should be replicated.
Methods
Participants
A different group of 10 healthy adults (mean age = 22.0, SD = 1.73, six females) with normal hearing and normal or corrected vision participated in this experiment. Participants gave written informed consent to the procedure, which was approved by the local ethical committee in Kyoto University, and were paid by the university standard.
Stimuli and procedure
We used the same apparatus and task as Experiment 1, except for the changing rates of the size and the intensity of audiovisual stimuli as shown in Figure 4. We tested three sound conditions: increasing intensity (55−75 dB SPL, looming sound: AL), decreasing intensity (75−55 dB SPL, receding sound: AR), and no-sound (AN) for control. Static sound was not used in this experiment. Note that the initial intensity was higher for receding sound than looming sound. Both looming and receding sounds were presented for 5,000 ms and linearly changed in intensity by 4 dB per second.
(a): Visual stimuli in Experiment 2. Visual stimuli consisted of a target disk expanding (VL: solid) or contracting (VR: dotted) monotonically. A gray line shows the static reference disk for both target conditions. (b): Auditory stimuli in Experiment 2. Auditory stimuli were increasing- (AL: solid), decreasing- (AR: dotted), and no- (AN: none) intensity sine tones of 5,000 ms durations.
We used the same gray-colored disks as visual stimuli as in the former experiment. The initial diameter of the target disk was 200 pixels (5.7°) in the VL condition and 400 pixels (11.3°) in the VR condition. The diameter increased or decreased linearly by 25 pixels (0.7°) per second. The diameter of the reference disk was always 300 pixels (8.5°) in Experiment 2. In both conditions, the target and reference disks physically became the same size at 4,000 ms from the onset.
Participants performed three blocks of the size-matching task, which consisted of eight audiovisual conditions (two visual stimuli: VL and VR; two auditory stimuli: AL and AR; two positions: front and rear) and two no-sound conditions (VL-AN and VR-AN). Each experimental block consisted of 30 repetitions of each of the 10 conditions, and trial order was randomized for each block and participant.
Before the main experiment, participants performed a sound localization task as in the previous experiment. All participants localized the sounds with high accuracy (average percent correct: 94.0, range: 86.7–100.0, SD = 3.74).
Results
As in Experiment 1, we conducted a model selection for the random effects, while keeping all the fixed effects (the main effects of Visual stimulus, Auditory stimulus, Position and their all possible interactions). Only the random by-participant intercepts and random slopes of Visual stimulus were selected.
Figure 5 shows the least-squares mean response latencies in each condition estimated by the selected model (For the model estimates and the measured means for each participant, see Supplementary Figure S2). To investigate whether response latencies differed across the multisensory conditions, we analyzed data of eight audiovisual conditions (excluding the AN conditions) using the selected model. We conducted an ANOVA for the fixed effects and found that the three-way interaction was not significant, F(1, 2,374) = 2.74, p = .098. While the two-way interactions between Visual stimulus and Auditory stimulus, F(1, 2,374) = 4.21, p = .04, and Visual stimulus and Position, F(1, 2,374) = 7.72, p = .005, were significant, the one between Auditory stimulus and Position was not significant, F(1, 2,374) = 1.67, p = .20. There were significant main effects of Visual stimulus, F(1, 9) = 23.03, p < .001, and Position, F(1, 2,374) = 8.61, p = .003, not of Auditory stimulus, F(1, 2,374) = 1.56, p = .21.
Results of Experiment 2. Bars show the least-squares mean of response latency in each condition estimated using a linear mixed model (described in the procedure section). Error bars refer to 95% confidence intervals. *p < .05. **p < .01.
To investigate the influence of each sound on vision, as in Experiment 1, we split the data by visual conditions to assess the auditory and positional effects on response latencies in detail. We reanalyzed each VL and VR data set using a linear mixed model that included all possible fixed effects (the main effects of Auditory stimulus and Position, and their interaction) and random by-participant intercepts. For the VL condition, an ANOVA revealed significant main effects of Auditory stimulus and Position, Auditory stimulus: F(1, 1,187) = 4.73, p = .03; Position: F(1, 1,187) = 14.14, p < .001. The interaction between Auditory stimulus and Position was only marginally significant, F(1, 1,187) = 3.77, p = .053. However, for the VR condition, there were no significant main effects of Auditory stimulus, F(1, 1,187) = 0.38, p = .54, or Position, F(1, 1,187) = 0.02, p = .90, and no significant interaction between Auditory stimulus and Position, F(1, 1,187) = 0.08, p = .078.
Finally, we compared response latencies across conditions of the VL condition as in Experiment 1. The response latency in the AL-front condition was significantly shorter than the other conditions, AL-front—AL-rear: t(1,187) = −4.03, p < .001; AL-front—AR-front: t(1,187) = −2.91, p = .024; AL-front—AR-rear: t(1,187) = −4.20, p < .001. There were no significant differences across the other conditions, AL-rear—AR-front: t(1,187) = −1.12, p > .99; AL-rear—AR-rear: t(1,187) = −0.17, p > .99; AR-front—AR-rear: t(1,187) = −1.29, p > .99.
The analyses showed that only the front-presented looming sound shortened response latencies for expanding target selectively. Here again, the mean response latency for the AN condition was in the range of 95% confidence intervals of response latencies for the other conditions, but not for the AL-front condition.
Discussion
The pattern of results in Experiment 2 was consistent with Experiment 1. Looming sound from the front position shortened response latency only for the VL condition, with no significant effect for the VR condition, suggesting that participants overestimated the size of an expanding visual target when it was presented together with a spatially consistent looming sound. Receding sound did not affect perceived visual size, even though its intensity was higher than that of a looming sound at the start of a trial. This confirms that the intensity of sound is not crucial in the modulation of dynamic visual size perception. The results also confirm that the observed asymmetric effect was not an artifact derived from different visual sizes of the reference disks in Experiment 1. Finally, as in Experiment 1, none of the auditory stimuli presented from the rear loudspeaker had any influence on vision, indicating the importance of front/rear spatial consistency for audiovisual interaction.
As in Experiment 1, response latencies in all conditions were shorter than the actual latencies for a pair of visual stimuli of the same diameter, and they were shorter for the VR than the VL condition. A similar pattern of results was observed for the no-sound conditions (VL-AN and VR-AN), again suggesting that it did not reflect the auditory influences but the property of the current task.
Experiment 3
Experiments 1 and 2 showed that the front-presented looming sound significantly shortened the response latency only for the expanding visual object. One might suspect that the observed results may reflect simple facilitation of the motor response for looming signals (e.g., Cappe et al., 2012; Noel et al., 2015) without changes in perceived size. To rule out this possibility, we conducted another experiment, which required participants to reproduce the perceived size of dynamic audiovisual objects subjectively after each trial.
Methods
Participants
A different group of seven healthy adults (mean age = 23.1, SD = 1.77, three females) with normal hearing and normal or corrected vision participated in this experiment. Participants gave written informed consent to the procedure, which was approved by the local ethical committee in Kyoto University, and were paid by the university standard.
Stimuli and procedure
We asked participants to perform a size adjustment task. Participants were exposed to dynamic audiovisual stimuli that changed in size and intensity, as in the previous experiments, and which disappeared at a certain point in time. The reference disk was not presented. After each presentation, they were required to reproduce the perceived size of the visual target at the time of its disappearance, by using a computer mouse to adjust the size of the probe disk.
The auditory stimuli included three types of sound as used in Experiment 1: with increasing intensity (55−75 dB SPL, looming sound: AL), decreasing intensity (55−35 dB SPL, receding sound: AR), and no-sound (AN) for control. The dynamic profiles of changing intensity for looming and receding sounds were same as in Experiment 1. In each trial, as in the former experiments, either the front- or rear-located loudspeaker presented an auditory stimulus. Hence, there were five auditory conditions (AL-front, AL-rear, AR-front, AR-rear, and AN). The visual stimuli consisted of a fixation dot and a gray-colored target disk presented 4.5° below the fixation dot, as in Experiment 1. There were two visual target conditions: VL and VR. The initial diameter and the dynamic profile of each target were the same as in Experiment 1; in both conditions, the target disk initially had a diameter of 300 pixels (8.5°) and increased or decreased linearly by 25 pixels (0.7°) per second.
The series of events in a trial is shown in Figure 6. In each trial, the audiovisual stimulus was presented after the initial presentation of the fixation dot for 500 ms. The duration of the audiovisual stimulus was randomly 3,800, 4,000, or 4,200 ms, which resulted in final target sizes of 395, 400, or 405 pixels in the VL condition, and 205, 200, or 195 pixels in the VR condition, respectively. After 100 ms from the offset of the audiovisual stimulus, a probe disk was presented at the same location as the fixation dot (i.e., the same location as the reference disk in Experiments 1 and 2). Participants were required to adjust the probe size to the perceived target size at the point of disappearing, using the computer mouse with their dominant hand. They were allowed to increase and decrease the diameter of the probe disk by moving the mouse forward and backward, respectively. When they felt that they reproduced the perceived target size correctly, they pressed the mouse button to respond and start the next trial. In each trial, we recorded the reported probe size in pixels. The initial diameter of the probe was randomly set either as twice or half as large as the diameter of the target at 4,000 ms from its onset (i.e., VL condition: 200 pixels (5.7°) and 800 pixels (22.6°); VR condition: 100 pixels (2.9°) and 400 pixels (11.4°), respectively). There was no pressure for fast responses.
Time course of a trial in Experiment 3. In each trial, a visual target (lower disk) and an auditory stimulus were simultaneously presented and changed in size or intensity. After a 100-ms blank, a visual probe was presented, and participants were asked to adjust the probe size to the target size at the point of disappearance.
Each participant performed four blocks of trials. There were eight audiovisual conditions (two visual stimuli: VL and VR; two auditory stimuli: AL and AR; and two positions: front and rear) and two no-sound conditions (VL-AN and VR-AN), and each experimental block consisted of six repetitions of each condition with the trial order randomized. Before the main experiment, participants performed the same sound localization test as in Experiments 1 and 2. Participants localized the front/rear sounds with high accuracy (average percent correct: 94.3, range: 83.3–100.0, SD = 5.76).
Results
We again conducted the model selection only for the random effects with backward elimination, while keeping all the fixed effects (the main effects of visual stimulus, auditory stimulus, position and their all possible interactions). The random by-participant intercepts and random slopes of visual stimulus were kept.
Figure 7 shows the least-squares mean of the adjusted probe size for each condition, estimated by the selected model (for the model estimates and the measured means for each participant, see Supplementary Figure S3). As in Experiment 1, we analyzed data for the eight audiovisual conditions (excluding the VL-AN and VR-AN conditions that did not have that factor of Position). We conducted an ANOVA and found a significant three-way interaction of Visual stimulus, Auditory stimulus, and Position, F(1, 1,318) = 9.35, p = .002. Two-way interactions were significant between Visual stimulus and Auditory stimulus, F(1, 1,318) = 4.38, p = .037, and Auditory stimulus and Position, F(1, 1,318) = 9.59, p = .002, but not between Visual stimulus and Position, F(1, 1,318) = 2.0, p = .16. Although there were significant main effects of Visual stimulus, F(1, 6) = 488.9, p < .001, and Position, F(1, 1,318) = 5.23, p = .002, the main effect of Auditory stimulus was not significant, F(1, 1,318) = 1.87, p = .17.
Results of Experiment 3. Bars show the least-squares mean of reported probe size in each condition estimated by using a linear mixed model (described in the procedure section). Error bars refer to 95% confidence intervals. *p < .05. **p < .01. ***p < .001.
Next, we split the data by Visual condition to see in detail how the auditory and positional factors affected perceived size. In consequence of the model selection with the backward elimination for the random effects for each of VL and VR data sets, the selected model included all possible fixed effects and also the random by-participant intercepts. For the VL condition, an ANOVA showed a significant interaction between Auditory stimulus and Position, F(1, 660) = 12.9, p < .001. The main effects of Auditory stimulus and Position were also significant, Auditory stimulus: F(1, 660) = 4.08, p = .044; Position: F(1, 660) = 4.67, p = .031. For the VR condition, there was no significant interaction between Auditory stimulus and Position, F(1, 657) = 0.002, p = .097, and no main effect of Auditory stimulus, F(1, 657) = 0.50, p = .48, or Position, F(1, 657) = 0.72, p = .40.
Finally, we compared the estimated probe size across conditions of the VL condition. The mean adjusted probe size was significantly larger for the AL-front than the other three conditions, AL-front—AL-rear: t(661) = 4.06, p < .001; AL-front—AR-front: t(661) = 3.97, p < .001; AL-front—AR-rear: t(661) = 2.96, p = .018. There were no significant differences among other conditions, AL-rear—AR-front: t(661) = − 9.96, p = .92; AL-rear—AR-rear: t(661) = −16.41, p = .27; AR-front—AR-rear: t(661) = −15.87, p = .31.
The analysis showed that participants overestimated the size of the looming target only when it was paired with the front-presented looming sound. By contrast, the perceived size of receding target was not significantly biased by any of the different sound conditions.
Discussion
The results were consistent with those obtained in Experiments 1 and 2: When participants were asked to reproduce the perceived size, we found a specific effect for the combination of audiovisual looming signals with spatial consistency. Only in the VL condition, looming sound from the frontal loudspeaker enlarged the perceived probe size; a receding sound did not modulate perceived visual size. Rear-presented sounds had no significant influence on visual size perception. These results again indicate the importance of front/rear spatial consistency for audiovisual interaction of looming signals. The results cannot be explained by motor response facilitation by looming signals, providing additional evidence for the auditory modulation of the perceived visual size per se. Besides, given the consistent and specific effect of looming sound only in the case of spatial congruency with the visual stimulus, mere response bias can be ruled out as the major factor in the results.
The reported probe size tended to be larger in the VL conditions and smaller in the VR condition, compared to the averaged final target size in each condition (VL: 400 pixels, VR: 200 pixels). This constant overshooting of size estimation might be due to representational momentum (e.g., Freyd & Finke, 1984; Hubbard, 2005). Given that a similar effect was observed for the no-sound conditions (VL-AN and VR-AN), it is less likely to reflect any auditory influences.
General Discussion
In three experiments, we consistently demonstrated that only the looming sound presented in spatial and directional congruence with the dynamic visual stimulus caused participants to overestimate the size of the visual stimulus. The perceived visual size was measured by response latency in Experiments 1 and 2 and by probe size adjustment in Experiment 3. The series of results demonstrates that looming sound has an impact on visual size perception not only of a static object (Sutherland et al., 2014) but also of a dynamic visual object that is perceived as moving in depth. Looming sound altered the visual perception of a looming (i.e., expanding) object but did not affect that of a receding (i.e., contracting) one.
There are several sources of potential confounds, but none were crucial for our results. First, the sound intensity itself cannot account for the modulation of vision, suggesting that the observed overestimation of visual size does not merely reflect an association between size and intensity that has been indicated to modulate the apparent size of a static object (Takeshima & Gyoba, 2013). Second, the results of Experiment 3 eliminate any confounding effects of looming sound on perceived target duration (Grassi & Darwin, 2006; Schlauch et al., 2001) or on motor responses (Cappe et al., 2009; Noel et al., 2015) as a warning signal (Ball & Tronick, 1971; Schiff et al., 1962). Third, while participants might have predicted the target motion direction at the start of the trial in Experiments 1 and 2, it was hardly possible in Experiment 3. The consistent results across the three experiments indicate that predictability is unlikely to be a crucial factor. Finally, the cross-modal effects of front/rear sounds should depend on successful localization of the sounds. The participants showed good localization abilities in our pretest, possibly resulting in the specific effect of the front-presented sound. It remains possible, however, that the results might have been clearer had we used sounds with more frequency components (e.g., white noise) that are more clearly localized in front/rear spaces (Heffner, Koay, & Heffner, 1996).
Our findings suggest the importance of 3D spatial consistency of audiovisual stimuli for synergistic neural activity and the audiovisual interaction of looming signals (Cappe et al., 2009, 2012) and extend the spatial rule of audiovisual integration into 3D space. Our results showed clearly that looming sound modulated the perceived size of an expanding object only when they were presented in front of the observer; rear-presented sounds did not affect frontal vision. One plausible explanation is that audiovisual stimuli are selectively integrated in case that they are assumed to be a unified event based on the spatial relationship in 3D space (unity assumption, Chen & Vroomen, 2013; Chen & Spence, 2017; Vatakis & Spence, 2008; Welch & Warren, 1980).
Sound of decreasing intensity did not significantly alter visual size perception in the three experiments, though it is known to produce the impression of a receding sound source motion (Neuhoff, 1998). This result is consistent with our prediction that receding sound has no effect on dynamic vision because it is not associated with threatening circumstances such as collision, and thus does not elicit adaptive perceptual biases (Neuhoff, 1998).
We provide evidence that vision can be modulated by auditory inputs, even in the case of size perception that is thought to be vision-dominated (Guttman, Gilroy, & Blake, 2005; Hassan, Thompson, & Hammet, 2012; Jain, Sally, & Papathomas, 2008; Kitagawa & Ichihara, 2002; Sanabria, Lupiáñez, & Spence, 2007; Soto-Faraco, Kingstone, & Spence 2003; Soto-Faraco, Lyons, Gazzaniga, Spence, & Kingstone, 2002; Soto-Faraco, Morein-Zamir, & Kingstone, 2005; Soto-Faraco, Spence, Lloyd, & Kingstone, 2004). The observed effect recalls previous reports that ambiguous visual stimuli are captured by salient auditory information (Alink et al., 2012; Freeman & Driver, 2008; Hidaka et al., 2008; Maeda, Kanai, & Shimojo, 2004; Sekuler & Sekuler, 1997; Shams et al., 2002). In ambiguous visual conditions, auditory cues become more meaningful in perceiving external events. In the current experiments, the actual size of the target stimuli was difficult to determine because they changed dynamically and were presented peripherally (4.5° below a fixation dot). Thus, vision may be prone to modulation by significant auditory spatial information such as the looming sound.
Brain mechanisms underlying the observed audiovisual interaction remain to be clarified. Given the present results, one could speculate that the brain uses not only visual but also auditory information to perceive the size of objects. Sutherland et al. (2014) explained that perceived size of a static object paired with a looming sound could be modulated because the significant distance information of the sound is remapped onto the visual domain. Previous research has suggested that visual size perception is shaped depending on visual distance information (Qian & Petrov, 2016). Likewise, Tanaka and Fujita (2015) found that V4 neurons encode the actual size of objects and accomplish size constancy by combining retinal image size and visual distance information. Together with these pieces of evidence, our findings suggest that information about distance change conveyed by looming sound impacts visual size perception. Strong spatial information from looming sound (i.e., decreasing distance of the sound source) could be utilized to compensate for ambiguity in perceived visual size.
To summarize, the current study provides the novel finding that looming sound modulates the perceived size of a looming visual object only in the case of audiovisual front/rear spatial consistency. Together with several established psychological and neurophysiological findings, our results indicate specific multimodal influences of looming sound. The revealed front-rear spatial specificity of audiovisual interaction provides a new insight for understanding the spatial rule governing the integration of multisensory information. The brain may differentially process audiovisual inputs according to whether they are assumed to be a unified looming event or not. This selective process might facilitate efficient perception of natural dynamic events and faster reactions to approaching threats.
Supplemental Material
Supplemental material for Front-Presented Looming Sound Selectively Alters the Perceived Size of a Visual Looming Object
Supplemental material for Front-Presented Looming Sound Selectively Alters the Perceived Size of a Visual Looming Object by Daiki Yamasaki, Kiyofumi Miyoshi, Christian F. Altmann and Hiroshi Ashida in Perception
Footnotes
Acknowledgement
Portions of the results were presented at International Multisensory Research Forum 2017.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported by Japan Society for the Promotion of Science Grant-in-Aid for Scientific Research (KAKENHI) #26285165 for HA.
Supplementary Material
Supplementary material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
