Abstract
When two eyes view dissimilar images, an observer typically reports ambiguous perception called binocular rivalry where the subjective perception fluctuates between the two inputs. This perceptual instability is often comprised of exclusive dominance of each image and a transition state called piecemeal state where the two images are intermingled in patchwork manner. Herein, we investigated the effects of multimodal association of sensory congruent pair, arbitrary pair, and reverse pair on piecemeal state in order to see how each level of association affects the ambiguous perception during binocular rivalry. To induce the multisensory associations, we designed a matching task with audiovisual feedback where subjects were required to respond according to given pairing rules. We found that explicit audiovisual associations can substantially affect the piecemeal state during binocular rivalry and that this congruency effect that reduces the amount of visual ambiguity originates primarily from explicit audiovisual association training rather than common sensory features. Furthermore, when one information is associated with multiple information, recent and preexisting associations work collectively to influence the perceptual ambiguity during rivalry. Our findings show that learned multimodal association directly affects the temporal dynamics of ambiguous perception during binocular rivalry by modulating not only the exclusive dominance but also the piecemeal state in a systematic manner.
Introduction
When a pair of incompatible images are shown to each eye, the subjective perception falls in ambiguity while the optical input to each eye remains unchanged (Wheatstone, 1838). To resolve the unstable state, our brain endeavors to achieve a single unequivocal interpretation by constantly alternating between the two sensory inputs. Each percept takes turn, winning only a few seconds of subject’s experience and this competition continues as long as the rival stimuli are present (Blake, 2001). Because of this dissociation between the physical input and the subjective experience, binocular rivalry has been considered as an important experimental tool for exploring the neural correlate of conscious visual awareness and how our brain deals with ambiguous information.
Perhaps one of the most striking differences between binocular rivalry and other perceptual bistability experienced in ambiguous figures (Boring, 1930; Necker, 1832; Rubin, 1921) is the existence of mixed percept where the two given images are perceived simultaneously (Paffen, Naber, & Verstraten, 2008). Also called the piecemeal state due to its patchwork-like appearance, this transient state is the common ground before exclusive selection or reversal takes place. Thus, inspecting the piecemeal state can lend us insights on mechanism of perceptual disambiguation. For example, it has been shown that the piecemeal state increases with larger visual angle in which distributed zones undergo rivalry independently, lending the idea that visual disambiguation transpires not globally between the eyes but in each separate receptive field of monocular vision (Blake, 2001). Another study using complementary patchworks of intermingled rivalrous images reported that subjects experienced competition between unscrambled coherent images rather than separate patchwork images (Kovacs, Papathomas, Yang, & Feher, 1996). This entails that rather than oscillating between the given low-level optical input itself, our brain first converts them into familiar subjective patterns and then tries to conjoin them in order to disambiguate the apparent paradox. These studies reveal that the dynamics of piecemeal state is affected not only by low-level neural signal but also by higher level cognitive processes such as semantics, making the piecemeal rivalry an intriguing phenomenon to test the perceptual disambiguation mechanism.
Over decades, the mechanism of perceptual disambiguation in binocular rivalry has been thoroughly investigated by studying several bottom-up and top-down factors that influence the behavior of binocular rivalry (Blake & Logothetis, 2002; Brascamp, Klink, & Levelt, 2015; Tong, Meng, & Blake, 2006). Earlier studies using luminance difference (Fox & Rasche, 1969; Kakizaki, 1960), contrast polarity (Hollins, 1980; Whittle, 1965), or color difference (Carney, Shadlen, & Switkes, 1987; Hollins & Leung, 1978) showed that any left- and right-eye stimulus difference could instigate the ambiguity, making lateral inhibition and self-adaptation of optical neural signal as the main building blocks of ambiguity control (Wilson, 2003). Studies using top-down factors such as attention (Lack, 1969, 1974; Ooi & He, 1999; Paffen & Alais, 2011; von Helmholtz & Southall, 1924) or emotional salience (Alpers & Gerdes, 2007) showed the possibility of attentional control during disambiguation process. Other studies in similar vein using induced color difference (Andrews & Lotto, 2004; Hong & Shevell, 2008) or orientation aftereffect (Chopin, Mamassian, & Blake, 2012) showed that even the physically identical stimuli could cause rivalry when they signify different information at subjective experience level, supporting the idea that the disambiguation process takes place at higher level neurons along the visual pathway rather than low-level neural signals (Dijkstra, van de Nieuwenhuijzen, & van Gerven, 2016; Scocchia, Valsecchi, & Triesch, 2014). Furthermore, brain-imaging studies using functional magnetic resonance imaging, magnetoencephalography, or electroencephalography imply that there is no isolated cortical area selectively correlating with the participant’s current percept during binocular rivalry (Kornmeier & Bach, 2012; Lumer, Friston, & Rees, 1998; Maier et al., 2008; Polonsky, Blake, Braun, & Heeger, 2000; Srinivasan, Russell, Edelman, & Tononi, 1999; Tong, Nakayama, Vaughan, & Kanwisher, 1998; Wilke, Logothetis, & Leopold, 2006).
Recently, several studies reported the influences of even higher levels of information processing such as multimodal congruence (Chen, Yeh, & Spence, 2011; Conrad, Bartels, Kleiner, & Noppeney, 2010; Einhäuser, Methfessel, & Bendixen, 2017; Kang & Blake, 2005; Lunghi & Alais, 2013; Lunghi, Binda, & Morrone, 2010; Lunghi, Morrone, & Alais, 2014; Maruya, Yang, & Blake, 2007; Pápai & Soto-Faraco, 2017; Piazza, Denison, & Silver, 2018; van Ee, van Boxtel, Parker, & Alais, 2009), or prior experience (Klink, Boucherie, Denys, Roelfsema, & Self, 2017; Klink, Brascamp, Blake, & van Wezel, 2010). The fact that even cross-modal influence or prior experience can directly modulate the dynamics of rivalry implies that the mechanism of conscious univocal interpretation of given environmental cues is highly plastic and integrative than previous psychophysics and brain-imaging studies have speculated. Moreover, these studies showed that previous experience as recent as laboratory-induced multimodal association tasks could significantly influence the contents of conscious perception under ambiguity.
However, these studies primarily focused on unitary rivalry, investigating the cross-modal influence of disambiguation on initial selection or dominance enhancement of exclusively congruent stimuli during multimodal binocular rivalry. Dissociating the effect of cross-modal processing on unitary rivalry and piecemeal rivalry will allow to better understand how multimodal interaction affects different states of perceptual ambiguity during binocular rivalry. Herein, we investigated the due effects of novel multimodal association on the piecemeal state, the common ground of selection and alternation, in an attempt to shed more light on how the perceptual ambiguity is resolved under multidimensional environment.
Experiment 1
Objective
In our first experiment, we investigated the effects of multimodal association of sensory congruent audiovisual pairs on piecemeal state. Specifically, we compared the level of disambiguation effects of low-level sensory congruence and high-level cognitive congruence by probing the temporal dynamics of piecemeal state. Studies on multisensory integration show that visual and auditory information can significantly affect each other to dramatically change the perceptual interpretation of given input (Alais & Burr, 2004; McGurk & MacDonald, 1976; Shimojo & Shams, 2001). Although some studies argued the perceptual synergy of low-level sensory congruence such as flicker frequency (Kang & Blake, 2005), motion direction (Conrad et al., 2010), or spatial orientation (Stein & Stanford, 2008), many agree that the multisensory congruence effect becomes stronger with high-level, more naturalistic audiovisual pairs formed as a function of repetitive experience in complex natural scene over time (Conrad et al., 2013; Gilbert & Sigman, 2007; Piazza et al., 2018; Shams & Seitz, 2008; Stein, Stanford, & Rowland, 2014; Talsma, 2015). Although the mechanisms for forming new multisensory association that may influence the contents of conscious perception are unclear, recent works utilizing explicit multisensory association showed that laboratory-induced audiovisual association tasks can significantly modulate the predominance during unitary rivalry. Thus, in this experiment, we investigated the modulatory effect of auditory information on piecemeal state before and after the audiovisual association task to compare the disambiguation effects of sensory congruence and cognitive congruence.

Stimuli set and association rules for Experiment 1. For the low-level sensory congruent audiovisual stimuli, flickering sinusoidal gratings (1 Hz and 3 Hz) and amplitude-modulated tones (1 Hz and 3 Hz) were used. In the matching task, the audiovisual stimuli were paired in matching temporal frequencies.
Stimuli and Apparatus
For the low-level sensory congruent audiovisual stimuli, we designed a pair of flickering sinusoidal gratings (1 Hz and 3 Hz) and amplitude-modulated (AM) tones (1 Hz and 3 Hz) so that the audiovisual stimuli can form sensory congruent pairs in each temporal frequency (Figure 1). The sinusoidal gratings and AM tones were generated using MATLAB in conjunction with the PsychToolBox-3 (Brainard, 1997; Kleiner, Brainard, & Pelli, 2007). Both gratings had the spatial frequency of 2.5 c/deg oriented either horizontally (1 Hz flicker) or vertically (3 Hz flicker). The gratings appeared within a circular Gaussian mask. Both patches were surrounded by a black line-drawing box with a red fixation cross in the middle in order to help subjects keep stable stereoscopic match during the rivalry procedure. The fixation cross subtended 0.57° in size with line width of 0.11° in the viewing distance of 50 cm. The contrast for both gratings ranged from 0 (invisible) to 1 with π/4 phase shift to ensure that at least one of the stimuli would be visible at any given time. The size of horizontal and vertical gratings was 600 × 600 pixels which subtended approximately 8.02° in the viewing distance of 50 cm. Here, relatively large visual stimuli were used on purpose in order to instigate more piecemeal rivalry during the sessions (Hollins & Hudnell, 1980). All the visual stimuli were presented in gray (mid value of the color look-up-table of the screen) background. The sampling rate of AM tones was 48 kbps, and the carrier frequency was 500 Hz monotone. The 1 Hz and 3 Hz AM tones were also phase shifted by π/4 to sync with the horizontal and vertical flickering gratings, respectively. Both tones were prepared for stereo playback.
The visual stimuli were presented on a 27-in. LCD display of iMac with the screen resolution of 2,560 × 1,440. The subjects viewed the monitor through a mirror stereoscope, each eye seeing only the left or right half of the monitor, respectively. The mirror stereoscope could be adjusted to fit each subject’s vantage point to ensure binocular superposition of the left and right visual fields at the center. An adjustable chin rest was used to help subjects keep their head still during the calibration and rivalry procedure. The auditory stimuli were presented using a Sony noise-canceling headphone to isolate any irrelevant outside noise.
Subjects
Fourteen undergraduate volunteers (4 females, mean age = 21.2 years, standard deviation [SD] = 1.8 years, range 19–25) with normal or corrected-to-normal vision and normal hearing participated in this experiment in exchange for course credit. The subjects were naive as to the specific purpose of this study. The subjects were also naive to each visual and auditory stimuli or their combination, meaning that they did not have any explicit knowledge of association to infer cross-modal congruence from given auditory information. Also, notice that the subjects did not know the exact temporal frequencies (1 Hz or 3 Hz) of the visual and auditory stimuli nor did they know that the stimuli could be audiovisually paired by matching temporal frequencies. All subjects provided written informed consent before participating in this study, and the tenets of the Declaration of Helsinki were followed. Every aspect of this study was approved and carried out in accordance with the regulations of the Korea Advanced Institute of Science and Technology (KAIST) Institutional Review Board.
Procedure
Subjects were initially told to sit in a comfortable position and adjust the mirror stereoscope and the chin rest to snugly fit their views toward the monitor. The subjects were trained in advance to report their subjective experience using three predesignated keyboard keys (‘d’ for vertical grating, ‘v’ for horizontal grating, and ‘f’ for piecemeal). Each key was to be pressed only once as soon as an image came into dominance. Here, the criteria for the piecemeal state were defined as ‘when neither one of the stimuli is exclusively dominant.’
The recording procedure began with three 60-second rivalry sessions (no sound, with 3 Hz sound, and with 1 Hz sound) followed by an explicit matching task with given association rules and then finished by repeating the three 60-second rivalry sessions (no sound, with 3 Hz sound, and with 1 Hz sound) to see the intervention effect of sound and audiovisual association. Before each 1-minute rivalry sessions, the guiding box and the red fixation cross appeared on a gray background to let the subjects adjust the mirror stereoscope and calibrate the views. After stereoscopic matching, the subjects pressed the ‘space bar’ to begin the actual rivalry session. During the audiovisual rivalry, the visual and auditory stimuli were presented continuously from the beginning to the end of the session. The subjects were given sufficient break time between every 60-second session in order to recover from any eye strain and minimize the effects of adaptation. Also, the dichoptic presentation of visual stimuli was counterbalanced across the eyes for each subject in order to eliminate any effects of governing eye.
In the audiovisual matching task, the subjects were told the exact temporal frequencies of the stimuli and that they can be audiovisually paired in matching frequencies. After learning the audiovisual association rules, the subjects went through the matching task. During the matching task, one of the four possible audiovisual combinations (1 Hz grating—1 Hz sound, 1 Hz grating—3 Hz sound, 3 Hz grating—1 Hz sound, and 3 Hz grating—3 Hz sound) was presented to the subjects. The subjects pressed ‘O’ key when the temporal frequencies were congruent and ‘X’ when incongruent. A simple audiovisual feedback with O/X image and correct/incorrect tone was given on each keyboard responses. This procedure was repeated until the subjects could score 20 consecutive correct answers, making them practice the rule exhaustively. If a subject made a mistake during the procedure, the score count was reset to 0, starting from the beginning. The average time took for the subjects to pass the matching task was approximately 4 minutes (M = 251 seconds, SD = 32.70 seconds).
Results
The duration of time for 3 Hz, 1 Hz, and piecemeal percepts was recorded for each subject. As some of the subjects occasionally pressed the same key twice by accident (average error response rate = 0.79%, SD = 0.020), the logged data were trimmed to exclude these events (both initial and repeated keypress events) leaving only the valid events. The mean dominance duration, total dominance, and switch rates were then analyzed using SPSS statistics software version 22.0. Two-way repeated measures analyses of variance (ANOVAs) were conducted on the influence of two independent variables (association and sound) to compare the main effects of association and sound and the interaction effects between association and sound on 3 Hz, 1 Hz, and piecemeal percepts and switch rate, respectively. Association included two levels (before and after), and sound consisted of three levels (no sound, 3 Hz sound, and 1 Hz sound). Here, the basic idea was that if a congruent auditory stimulus helps the selection of coinciding visual stimulus during intermittent ambiguous states, the overall mean and total durations of the piecemeal state should decrease with 3 Hz or 1 Hz sound in addition to its congruence effects on complete dominance. Also, if the low-level temporal congruence of visual and auditory stimuli themselves was sufficient to induce multisensory integration, the auditory modulatory effects should be present without any explicit association process. Should this not be the case, the piecemeal resolution should occur only after learning the audiovisual pairing rules and forming higher level cognitive congruence. However, if the intervention of sound or association cannot cause any difference in conscious perception during piecemeal state, neither sound nor association would result in significant main effects on piecemeal durations.
First, the analysis on complete dominance of 3 Hz and 1 Hz percepts revealed consistent enhancement effects of congruent auditory stimuli after association in mean and total dominance duration (Figure 2). In the mean dominance duration of 3 Hz percept, there was a significant interaction effect between association and sound, F(2, 26) = 5.81, p = .008,

Results of Experiment 1. Mean and total dominance duration of 3 Hz visual percept (top row), 1 Hz visual percept (second row), piecemeal percept (third row), and switch rate (bottom row). The small cross symbols represent individual data points. Error bars represent 1 standard error of the mean. ns = not statistically significant. *p<0.05, **p< 0.01.
In the mean duration of piecemeal state, the results revealed a significant interaction effect between association and sound, F(2, 26) = 5.81, p = .008,
This implies that the matching temporal frequency without explicit association was insufficient for subjects to infer audiovisual congruence, whereas learning the audiovisual pairing rules could help the subjects break away from the visual ambiguity during the rivalry.
Experiment 2
Objective
In our previous experiment, we showed that the multimodal association of congruent audiovisual pairs could significantly reduce the duration of piecemeal state. However, one can speculate that this congruence effect emerged not entirely from the explicit association but also in part from the apparent sensory congruence itself. In this case, the cognitive congruence acquired by the association task could have merely facilitated the preexisting sensory congruence into significance. Also, as mentioned in ‘Objective’ section of Experiment 1, the cross-modal congruence effect is expected to be stronger with ecologically valid audiovisual pairs which in many cases bear the intrinsic low-level sensory congruence such as spatiotemporal synchrony (De Gelder & Bertelson, 2003; Macaluso & Driver, 2005; Macaluso, George, Dolan, Spence, & Driver, 2004; Stevenson, Fister, Barnett, Nidiffer, & Wallace, 2012).
However, recent studies using motion or simple gratings paired with pure tones have demonstrated that learning the given rules in the laboratory could bias the initial perception or increase the dominance duration to some extent during binocular rivalry (Einhäuser et al., 2017; Piazza et al., 2018). This implies that arbitrary pairings of nonnaturalistic stimuli could also affect the temporal dynamics of piecemeal state. Thus, in Experiment 2, drawing from the results of Experiment 1, we did an in-depth investigation of multimodal association on piecemeal state using arbitrary audiovisual pairs void of any sensory congruence in order to isolate the effects of cognitive congruence.

Stimuli set and association rules for Experiment 2. For the arbitrary audiovisual stimuli, images of animals (dog/duck) and amplitude-modulated tones (1 Hz/3 Hz) were used. In the matching task, the given audiovisual pairing rules were ‘Dog—3 Hz’ and ‘Duck—1 Hz.’
Stimuli and Apparatus
For the arbitrary audiovisual pairs without any apparent sensory congruence, we adopted line drawings of animals (dog/duck) and AM tones (1 Hz/3 Hz) (Figure 3). The line drawings were adapted from the ‘Snodgrass and Vanderwart-Like Objects’ (Rossion & Pourtois, 2004, stimulus set files #073, #081) to subtend approximately 8.02° in the viewing distance of 50 cm. To induce extended periods of piecemeal state, the images were flipped horizontally to look the opposite sides (dog looking right and duck looking left) so that relatively large portions would result in nonoverlapping areas. Both images were presented with the same black box and the red fixation cross used in Experiment 1 for stable stereoscopic matching. All the visual stimuli were presented in a gray (mid value of the color look-up-table of the monitor) background. The AM tones were generated using MATLAB with PsychToolBox-3 with the same specifications used in Experiment 1 without the phase shift. The sampling rate of AM tones was 48 kbps, and the carrier frequency was 500 Hz monotone. Both tones were prepared for stereo playback.
The visual stimuli were presented on a 27-in. LCD display of iMac with the screen resolution of 2,560 × 1,440. The subjects viewed the monitor through a mirror stereoscope with an adjustable chin rest. The auditory stimuli were presented with a Sony noise-cancelling headphone to minimize outside noise.
Subjects
Thirteen undergraduate volunteers (5 females, mean age = 22.5 years, SD = 2.3 years, range 19–26) with normal or corrected-to-normal vision and normal hearing participated in this experiment in exchange for course credit. The subjects were naive as to the specific purpose of this study. The subjects were also naive to the visual and auditory stimuli other than the fact that the images signify ‘dog’ and ‘duck’. All subjects provided written informed consent before participating in this study, and the tenets of the Declaration of Helsinki were followed. Every aspect of this study was carried out in accordance with the regulations of the KAIST Institutional Review Board.
Procedure
The experimental flow remained the same as in Experiment 1. The subjects first adjusted the mirror stereoscope and the chin rest to achieve a secure vantage point that perfectly matches the guiding cues (black box and red fixation cross). After the calibration process, the recording began with three 60-second rivalry sessions (no sound, with 3 Hz sound, and with 1 Hz sound) as a baseline period. The subjects were trained in advance to report their subjective experience using three keyboard keys (‘h’ for dog, ‘m’ for duck, and ‘j’ for piecemeal) pressing only once as soon as an image came into dominance. The criteria for piecemeal state were defined as ‘when neither dog nor duck is exclusively dominant.’
After the baseline recording, the subjects learned the pairing rules (dog—3 Hz tone, duck—1 Hz tone) and practiced the rule using the explicit matching task. During the matching task, one of the four possible combinations (dog—3 Hz tone, dog—1 Hz tone, duck—3 Hz tone, and duck—1 Hz tone) was randomly presented on the monitor, and the subjects were told to press ‘O’ key when the given stimuli match the pairing rules and ‘X’ key when mismatch. A simple audiovisual feedback with O/X image and correct/incorrect tone was given on each keyboard responses. This procedure continued until the subjects could score 20 consecutive correct answers, making them practice the rule exhaustively. Any mistake during the procedure reset the score count to 0. The average time took for the subjects to pass the matching task was 4 minutes (M = 233 seconds, SD = 23.34 seconds).
Finally, another set of three 60-second rivalry sessions (no sound, with 3 Hz sound, and with 1 Hz sound) were recorded again to see the effects of multimodal association. The subjects were given sufficient break time between every 60-second session in order to recover from any eye strain and minimize the effects of adaptation.
Results
The dominance duration of each percept was recorded and trimmed as in Experiment 1 to eliminate any erroneous responses (average error response rate = 0.13%, SD = 0.003), leaving only the valid events. The mean duration, total duration, and switch rate values were then analyzed using SPSS statistics software. Two-way repeated measures ANOVAs were conducted on the influence of two independent variables (association and sound) to compare the main effects of association and sound and the interaction effects between association and sound on dog, duck, piecemeal percepts, and switch rate, respectively. Association included two levels (before and after), and sound consisted of three levels (no sound, 3 Hz sound, and 1 Hz sound).

Results of Experiment 2. Mean and total dominance duration of dog visual percept (top row), duck visual percept (second row), piecemeal percept (third row), and switch rate (bottom row). The small cross symbols represent individual data points. Error bars represent 1 standard error of the mean. ns = not statistically significant. *p<0.05, **p<0.01.
First, the analysis on complete dominance of dog and duck percepts revealed consistent enhancement effects of explicitly paired auditory stimuli after association in mean and total dominance duration (Figure 4). In the mean dominance duration of dog percept, there was a significant interaction effect between association and sound, F(2, 24) = 6.82, p = .005,
In the mean piecemeal duration, the results revealed a significant interaction effect between association and sound, F(2, 24) = 9.40, p = .001,
This implies that, with appropriate audiovisual matching procedure, previously irrelevant auditory information void of any low-level sensory congruence can become a relevant cue to induce piecemeal resolution. Furthermore, the results of Experiment 2 support the idea that the multisensory disambiguation effect is determined not primarily by the intrinsic sensory congruence such as spatiotemporal synchrony but rather by explicit associations between distinct audiovisual information which form higher level cognitive congruence. Also, notice that the planned cognitive congruence here is acquired through repetitive exposure to novel stimuli, judging the appropriate pairing patterns and feedback process to reinforce the decision which resembles how we experience multisensory pairs in natural scenes.
Experiment 3
Objective
In the previous experiments, we showed that brief explicit associations can directly alter the ambiguous perception during binocular rivalry by modulating the duration of piecemeal state even with arbitrary audiovisual stimuli void of any apparent sensory congruence. Contemplating on the results of this study thus far and previous reports of effects of explicit associations on binocular rivalry (Einhäuser et al., 2017; Piazza et al., 2018), we noticed that the potential multisensory congruence acquired in these laboratory-designed tasks were newly formed recent associations compared with common naturalistic associations. As mentioned in ‘Objective’ section of Experiment 1, the influence of multisensory congruence is expected to become more evident over time with repetitive experience (e.g., mouth and speech association leading to ventriloquist illusion) possibly even becoming an automatic process. However, other studies on top-down effects of multisensory integration suggest the contrary that automatic multisensory responses can be overruled depending on the current task or expectations of the observer (Koelewijn, Bronkhorst, & Theeuwes, 2010; Macaluso et al., 2016; van Atteveldt, Formisano, Goebel, & Blomert, 2007).
Thus, in Experiment 3, we aimed to examine how recent audiovisual associations compare to already existing strong associations in terms of piecemeal modulation during binocular rivalry. Specifically, we adopted a pair of semantically congruent stimuli (images and soundtracks of animals) and designed a multimodal reverse association task to compare the modulatory effects of recent reverse associations and long-standing natural associations.

Stimuli set and association rules for Experiment 3. For the semantically congruent audiovisual stimuli, images and soundtracks of dog and duck were used which subjects were highly familiar and could correctly categorize each visual and auditory stimuli into either ‘dog’ or ‘duck’ without any instruction. In the matching task, reverse association rules were given where subjects had to pair dog image with crying duck sound and vice versa.
Stimuli and Apparatus
For the semantically congruent audiovisual stimuli pairs with robust natural associations, we chose the images and soundtracks of dog and duck (Figure 5). The line drawings of animals were adapted from the ‘Snodgrass and Vanderwart-Like Objects’ as in Experiment 2. The images were flipped horizontally to instigate more piecemeal state with each image subtending approximately 8.02° in the viewing distance of 50 cm. Both images were presented with the guiding cues (black box/red fixation cross) for stable stereoscopic matching. All the visual stimuli were presented in a gray background. The real-life recordings of barking dog and crying duck soundtracks were downloaded from Soundsnap website (www.soundsnap.com) and prepared for 60-second stereo playback to ensure smooth continuous presentation of auditory context. The volume of each soundtrack could be adjusted by the subjects before the presentation to have similar subjective strength. The apparatus used to present the stimuli were the same as in previous experiments.
Subjects
Fifteen undergraduate volunteers (3 females, mean age = 22.2 years, SD = 1.5 years, range 19–24) with normal or corrected-to-normal vision and normal hearing participated in this experiment in exchange for course credit. The subjects were naive to the specific purpose of this study but were highly familiar with the semantic contents of the visual and auditory stimuli that they could easily categorize the stimuli into two groups (dog/duck) without any instruction, meaning that the subjects already retained the natural associations (dog image—barking dog soundtrack and duck image—crying duck soundtrack) from previous experiences. All subjects provided written informed consent before participating in this study, and the tenets of the Declaration of Helsinki were followed. Every aspect of this study was carried out in accordance with the regulations of the KAIST Institutional Review Board.
Procedure
The subjects first calibrated the mirror stereoscope to superimpose the left and right guiding cues. The rivalry report began with three 60-second recordings (no sound, with barking dog sound, and with crying duck sound) to see the effects of semantically congruent soundtracks in natural associations.
Then, the subjects underwent the matching task to learn the reverse associations. During the matching task, one of the four possible audiovisual combinations (dog—barking dog soundtrack, dog—crying duck soundtrack, duck—barking dog soundtrack, and duck—crying duck soundtrack) was randomly presented on the monitor. The subjects were told to press ‘O’ key when the presented pair matches the given reverse association rules (either dog image with crying duck sound or duck image with barking dog sound) and ‘X’ when mismatch. Every response was reinforced using an audiovisual feedback of O/X image with correct/incorrect tone. The matching task continued until the subjects could score 20 correct answers. Any mistake during the task reset the score count to 0, making the subjects stay attentive to the rules. The average task completion time was 4 minutes (M = 245 seconds, SD = 28.16 seconds).
Next, the rivalry report was repeated in the same manner (no sound, with barking dog sound, and with crying duck sound) to see whether the same soundtracks would pose different effects in reverse associations. The criteria for piecemeal were ‘when neither dog or duck is exclusively dominant’ for all rivalry sessions and subjects could rest in between the sessions to recover from any eye strain to minimize the effects of adaptation.

Results of Experiment 3. Mean and total dominance duration of dog visual percept (top row), duck visual percept (second row), piecemeal percept (third row), and switch rate (bottom row). The small cross symbols represent individual data points. Error bars represent 1 standard error of the mean. ns = not statistically significant. *p<0.05, **p<0.01.
Results
The duration of time for each dominant percept was recorded for each subject. After eliminating accidental keypresses (average error response rate = 1.10%, SD = 0.023), the trimmed data were analyzed using SPSS statistics software. Two-way repeated measures ANOVAs were conducted on the influence of two independent variables (association and sound) to compare the main effects of association and sound and the interaction effects between association and sound on dog, duck, piecemeal percepts, and switch rate, respectively. Association included two levels (before and after), and sound consisted of three levels (no sound, barking dog sound, and crying duck sound).
First, the analysis on complete dominance of dog and duck percepts revealed enhancement effects of semantically congruent auditory stimuli before reverse association in mean and total dominance duration (Figure 6). In the mean dominance duration of dog percept, there was a significant interaction effect between association and sound, F(2, 28) = 3.52, p = .043,
In the mean piecemeal duration, the results revealed a significant interaction effect between association and sound, F(2, 28) = 10.41, p = .0004,
This result raises several questions regarding how multisensory associations work to affect the perceptual ambiguity and how recent novel associations interact with preexisting associations. We will discuss these and other questions in similar vein in the following discussions.
Discussions and Conclusion
Our findings show that novel multisensory association formed by explicit matching task can significantly affect not only the complete dominance but also the piecemeal state during binocular rivalry. Specifically, in Experiment 1, we compared implicit sensory association and explicit cognitive association in order to investigate the mechanism of cross-modal modulation effect and showed that sensory congruence of shared low-level feature is not sufficient to guarantee the resolution of visual ambiguity whereas high-level cognitive association formed by matching task could instigate the multisensory congruence effect and significantly reduce the piecemeal duration. Experiment 2 extended these findings and further clarified that the observed multisensory congruence effect on piecemeal state derived primarily from the arbitrary mappings between dissimilar sensory information. Finally, in Experiment 3, we utilized explicit reverse associations to examine how recent novel audiovisual associations compare to long-standing associations in terms of piecemeal modulation and showed that reversing preexisting associations result in significant increase of piecemeal percept during binocular rivalry. To our knowledge, this is the first demonstration of systematic study on the effects of laboratory-induced multisensory association on piecemeal state.
Although there were several previous studies investigating multisensory congruence effects on binocular rivalry, there are a few critical differences in our study design. First, instead of assuming binocular rivalry as serial unitary alternations, we incorporated large visual stimuli and focused on the piecemeal state which not only differentiates binocular rivalry from other forms of multistable phenomena but also provides insights on how ambiguous sensory information is handled as it is the conjunctional bridge of exclusive selection or reversal during the visual competition. This enabled us to see how different types of audiovisual associations could either mitigate or aggravate the visual confusion. Next, we tightly isolated congruence paradigms into sensory congruence and cognitive congruence by designing appropriate stimuli pairs. Rather than using familiar audiovisual pairs and assuming the subjects’ ability to integrate distinct visual and auditory information, we presented novel stimuli and controlled the level of association to minimize the confounding results of congruence effects. Last but not least, the procedure of inducing audiovisual association differed from other studies using multisensory learning. We used a more engaging explicit training with active judgment of match/mismatch in every given pair during the task, whereas previous studies used a go/no-go paradigm or a passive statistical learning where subjects either ignored the mismatch pairs or simply did not respond during the whole induction phase (Einhäuser et al., 2017; Piazza et al., 2018). Also, we presented an audiovisual feedback at every iteration to reinforce the subjects’ decision. With all this and as small as 20 total repetitions per training that lasted 4 minutes, the association training time was considerably shorter than previous reports of 8 to 20 minutes.
It should be noted that according to our experimental design, the association had to be induced only after testing without explicit association, and our participants had to go through six consecutive binocular rivalry sessions in a row which altogether may pose ordering effect, practice effect, and adaptation issue. First, in order to minimize the effects of adaptation, we not only counterbalanced the visual stimuli but also instructed the subjects to rest between the sessions until they felt no eye strain or any possible afterimage. In addition to minimizing the adaptation carryover between sessions, we also tried to minimize the effect of adaptation within a session by limiting the length of single session to 60 seconds. While keeping a relatively short single session duration, we tried to obtain best quality data within this period by making the subjects go through a warm-up session before the actual recording and even repeated several times if needed to assure the subjects performance while minimizing the practice effect in recorded data. Last but not least, we argue that ordering effect did not play a significant role in our current results as several previous evidences strongly support the idea that temporal dynamics of binocular rivalry is mostly involuntary and robustly chaotic (Blake, 2001; Blake & Logothetis, 2002). However, with all these efforts, we acknowledge that we cannot fully rule out the possibility of ordering effect and larger data overall may provide more comprehensive results.
There have been several speculations on how multisensory integration occurs and subsequently resolves the given perceptual ambiguity to achieve unequivocal interpretation (Deroy, Spence, & Noppeney, 2016; Faivre, Filevich, Solovey, Kühn, & Blanke, 2018; Hsiao, Chen, Spence, & Yeh, 2012; Hu & Knill, 2010; Lunghi, Lo Verde, & Alais, 2017; Salomon et al., 2016; Salomon, Kaliuzhna, Herbelin, & Blanke, 2015; Salomon, Lim, Herbelin, Hesselmann, & Blanke, 2013; Shams & Beierholm, 2010; Smith, Grabowecky, & Suzuki, 2007). One of the highly plausible explanations is that some intrinsic commonality in one stimulus can bias the perception of the other stimulus. Those intrinsic features have included spatiotemporally synced motion (Conrad et al., 2010), temporal structure (Lunghi & Alais, 2015; Lunghi et al., 2014; van Ee et al., 2009), or semantic relatedness (Chen et al., 2011; Zhou, Jiang, He, & Chen, 2010). However, it is worth noting that in order for the intrinsic commonality such as spatiotemporal synchrony to take effect, a near-perfect temporal congruence is required (Sekuler, Sekuler, & Lau, 1997; Shimojo & Shams, 2001). Furthermore, these studies were done using adult subjects which makes it difficult to evaluate whether the intrinsic commonality arises from the sensory input themselves or the subjects’ previous experience. Based on the findings of this study, we thereby suggest that the cross-modal information that may serve to modulate the weighting of elements in visual competition emerges from learned associations from an observer’s daily life.
Another aspect to consider is the existence of multisensory neurons and its neurophysiological behavior. Developmental studies of multisensory integration using the cat have shown that neither superior colliculus nor anterior ectosylvian sulcus neurons have multisensory properties at birth and are incapable of generating enhanced multisensory responses (Stein, Labos, & Kruger, 1973; Wallace, Carriere, Perrault, Vaughan, & Stein, 2006; Wallace & Stein, 1997). Rather, the integrative capacity developed over time with cumulative cross-modal sensory experience and recalibration of senses (Gori, Del Viva, Sandini, & Burr, 2008; Stein, Wallace, Stanford, & Jiang, 2002; Wallace & Stein, 2000). A study with human babies has also shown the delay in their ability to integrate visual and auditory cues for spatial localization, suggesting that humans might also acquire visual–auditory multisensory integration only after substantial postnatal experience of cross-modal stimuli (Neil, Chee‐Ruiter, Scheier, Lewkowicz, & Shimojo, 2006). These studies suggest that although intrinsic commonality of cross-modal stimuli can be an effective factor for facilitating multisensory integration, experiencing those inputs is also essential for enhanced multisensory responses. Given the fact that binocular rivalry is a visual competition between two different images within separate intraocular pathways, the mixed percept presumably takes place at a higher level of neural pathway. Thus, the cross-modal influence on mixed percept found in this study suggests that cross-modal disambiguation is likely a form of top-down mechanism based on learned associations rather than bottom-up modulation by multisensory neurons.
Several binocular rivalry studies using multisensory stimuli have argued the attentional effects of congruent auxiliary input biasing the predominance of visual perception by either boosting the coinciding stimulus (Conrad et al., 2010; Ooi & He, 1999; Shams, Kamitani, & Shimojo, 2000; Talsma, Senkowski, Soto-Faraco, & Woldorff, 2010; van Ee, Van Dam, & Brouwer, 2005) or suppressing the dominance of nonassociated stimulus (Brascamp et al., 2015). Although these studies do not provide direct remarks on piecemeal dynamics, they do acknowledge that implicit or explicit attention can alter the visual perception to some degree. The current findings of piecemeal duration modulation by explicit audiovisual association can also be explained in terms of response bias. In Experiments 1 and 2, novel visual and auditory stimuli could not interact in initial presentation as there was no existing relation. However, when subjects were trained with explicit matching task, the given rules between the visual and auditory cues could have caused response biases in favor of congruent percepts while in piecemeal state, leading to overall decreased duration of piecemeal percept. The same logic expands to the results of Experiment 3 with one caveat. In Experiment 3, it is worth noting that learning the reverse associations did not eliminate previously existing natural associations, as the subjects could still clearly recognize and correctly attribute the semantic contents of the stimuli. Thus, learning a reverse association and creating a response bias that contradicts prior associations could have interfered with natural association rules and led to increased duration of piecemeal percept. This also relates to one of the questions we aimed to answer in Experiment 3 of how multiple associations, although formed in vastly different timescale, interact to affect the ambiguous perception during rivalrous situation. In statistical sense, when visual stimuli are ambiguous and auditory stimulus is unambiguous, the optimal usage of environmental cues would be to select the auditory cue and put more weight on perceptually certain interpretation (Alais & Burr, 2004; Einhäuser et al., 2017). However, in reverse association, previously unambiguous auditory cues are now also ambiguous, as barking dog sound refers to duck image and vice versa which is an addition to the natural association. For example, when hearing the barking dog sound, the subjects are not simply directed to duck image as given by the reverse association rules, but first they recognize the semantic content of auditory cue as dog and redirect it to duck image. This suggests two important remarks regarding how multisensory associations affect the piecemeal state during rivalry. First, although in vastly different timescale, recent novel association do not simply overrule preexisting association, if there was any, and work in tandem to pose integral effects on ambiguous state. Second, when multiple associations contradict (single auditory cue associated with multiple visual stimuli), presenting an ambiguous auditory cue cannot selectively boost or suppress the visual stimuli and result in the increase of piecemeal duration. Further research is required to clarify how different levels of congruency (arbitrary, sensory, semantic) with different timescale (short, long) and multiplicity of associations (singular, plural) would affect the perceptual ambiguity during binocular rivalry.
In conclusion, we demonstrate that explicit audiovisual associations with given rules substantially affect the piecemeal state during binocular rivalry. The congruency effect by shared audiovisual representation that can subsequently reduce the amount of visual ambiguity originates primarily from repetitive active judgment of given pairs rather than common sensory features between different modalities. Furthermore, when one information is associated with multiple other information, recent and preexisting associations work collectively to influence the perceptual ambiguity during binocular rivalry. These results suggest that the temporal dynamics of perceptual ambiguity represented by the piecemeal state in this study is determined by a highly plastic process that involves evaluating arbitrary mappings and multiplicity of associations.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
