Abstract
Two studies were conducted on cross-modal matching between pitch and sound source localization on the vertical axis, and pitch and size. In the first study 100 Hz, 200 Hz, 600 Hz, and 800 Hz tones were emitted by a loudspeaker positioned 60 cm above or below to the participant’s ear level. Using a speeded classification task, 30 participants had to indicate the sound source in 160 trials. Both reaction times and errors were analyzed. The results showed that in the congruent condition of high-pitched tones emitted from the upper loudspeaker, reaction times were significantly faster and the number of errors was significantly lower. Pitch was mapped on the vertical axis for sound localization. A main effect for sound source direction was also found. Tones coming from the upper loudspeaker were recognized faster and more accurately. Males were faster than females in identifying sound source direction. In the second experiment, 20 participants had to match 21 tones varying in pitch with 9 circles differing in visual angle on 42 trials. The results showed a clear inverse linear association between log-spaced tone pitch and circle diameter.
Although there is a strong tendency to analyze and investigate perceptual processes separately for each modality, in everyday life we often integrate information across our senses, combining disparate signals to gain a seamless conscious percept of events in the world. Multisensory integration has received extensive attention in recent years (for a review, see Spence, 2011). Spatial and temporal coincidences are the usual cues for integration, but the brain may also rely on a feature correspondence between the different sensory inputs (Welch & Warren, 1986).
Pitch, the auditory feature considered in the two studies of our article, has an important role in many cross-modal correspondences (Küssner, Tidhar, Prior, & Leech-Wilkinson, 2014; Marks, Ben-Artzi, & Lakatos, 2003). Stumpf (1883/2013) was the first to observe that several languages apply the labels “high” and “low” to the pitch of tones. The first experiment that confirmed Stumpf’s observation was conducted by Pratt (1930), while Pedley, Harper, and College (1959) showed that the categorization in “high” and “low” pitch was not absolute, but relative to the pitch range of available tones. Bernstein and Edelstein (1971) found that reaction times were more rapid when both frequency and vertical position of visual stimuli were similar (both high or both low), rather than when they were opposite.
Similarly, in Melara and O’Brien (1987) participants were presented simultaneously with a sequence of visual stimuli whose elevation (e.g., higher vs. lower) varied with a low- or high-frequency tone (174.6 vs. 1046.5 Hz, respectively). Response latencies to classify the elevation of the visual stimuli were longer when the irrelevant tones were synesthetically incongruent with the target (e.g., a higher light presented with a lower frequency sound), than when they were synesthetically congruent (e.g., a higher visual stimulus presented with a higher frequency sound). These results were also confirmed by Patching and Quinlan (2002), and Wagner, Winner, Cicchetti, and Gardner (1981) who found that young infants “matched” visual arrows pointing up or down with tones sweeping up or down in frequency. Moreover, Evans and Treisman (2010) reported a bidirectional correspondence between pitch and size, and pitch and spatial frequency.
Lidji, Kolinsky, Lochy, and Morais (2007), and Rusconi, Kwan, Giordano, Umiltà, and Butterworth (2006) showed a stimulus-response compatibility when the response buttons were positioned in an upper or lower location: responding to a high (low) tone was faster with a button at an upper (lower) location (SMARC effect). The effect was also present, although with reduced effect size, on the horizontal axis, with high-pitched tones mapped on the right and low-pitched tones mapped on the left, mirroring the spatial representation of numbers.
The horizontal mapping of pitch was first demonstrated by Mudd (1963), and then confirmed also by Hartmann (2017) in a line bisection task, while Wolter, Dudschig, and Kaup (2016) confirmed this horizontal association only for pianists. Pitteri, Marchetti, Priftis, and Grassi (2017) showed that the SMARC effect depended on both pitch and tone brightness.
From a development perspective, Nava, Grassi, and Turati (2016) have tested the ability of 4–5-year-old children to match auditory pitch to the spatial motion of visual objects and to the spatial motion of touch. When the children had to report whether two presented stimuli fitted well together they found a significant but rather weak sensitivity to the congruency in both the audio-visual and audio-tactile conditions. A cross-modal correspondence between audition and verticality was also reported by Chiou and Rich (2012, 2014), who demonstrated that high and low tones induced attention shifts to upper or lower locations, depending on pitch height. Furthermore, they found that pitch-induced cuing effect was susceptible to contextual manipulations and volitional control, suggesting that the interaction between pitch and location originates from an attentional level rather than from response mapping alone.
A test on the influence of musical pitch on haptic height perception was performed by Geronazzo, Avanzini, and Grassi (2015). Participants had to estimate the height of a virtual step by haptic exploration while listening to tones differing for pitch. The results showed no effect of musical pitch on height estimation with haptic feedback showing that in this specific case the pitch-verticality mapping was not confirmed.
The use of continuously changing sounds instead of discrete ones tends to induce an illusory motion perception. As reported by Kitagawa and Ichihara (2002), a sound of fixed intensity tends to be judged as decreasing in intensity after adaptation to looming visual stimuli, or as increasing in intensity after adaptation to receding visual stimuli. Hidaka et al. (2009) have shown that sounds alternating in space can induce strong illusory visual motion perception, and Teramoto et al. (2010) reported that sounds that alternate in vertical space (up/down) also induce vertical illusory motion. Sounds that alternate in higher/lower pitch fail to induce the illusion (Hidaka, Teramoto, Keetels, & Vroomen, 2013).
Vertical elevation mapping applies also to sound loudness (Eitan, Schupak, & Marks, 2008), where louder sounds are mapped higher than lower sounds. Loud sounds can improve the perception of bright lights and large objects, whereas soft sounds facilitate the perception of dim lights and small objects (Marks, 1987; L. Smith & Sera, 1992). Auditory pitch is also matched to other visual attributes. One is lightness (i.e., the perceptual dimension of surfaces varying from black to white), with participants showing faster responses to a higher pitch presented with a lighter surface, and a lower pitch matched to a darker surface (Marino & Marks, 1999; Marks, 1987; Melara, 1989). Other studies have investigated the association between pitch and brightness (i.e., the perceptual dimension of lights varying from dim to bright). In Marks, Hammeal, and Bornstein (1987), adult subjects were virtually unanimous in matching the brighter of two lights with both the higher-pitched and the louder of two sounds. Furthermore, the vast majority of 4-year-old children agreed with adults, as assessed by their cross-modal matches, regarding the congruence of both loudness and brightness (75% of 4-year-olds), and pitch and brightness (90% of 4-year-olds). These results show that the congruence between loudness and brightness, and between pitch and brightness, characterizes audiovisual perception in young children as well as in adults.
Another visual attribute associated with pitch is size. In Walker and Smith (1984), participants were required to press one of two keys as quickly as possible depending on which of four possible words appeared in the center of the screen. A 50 Hz or a 5500 Hz tone accompanied each test word, and participants responded on two keys that differed in size. Participants responded more slowly when either the pitch of the incidental sound, or the size of the key on which they responded was incongruent with the multimodal features represented by the test word. Finally, pitch can also generate cross-modal correspondences with visual shape, where high-pitched tones matched to angular rather than rounded shapes (Marks, 1987).
The origin of the pitch effect on visual-spatial tasks may be due to cross-modal mappings affecting response-selection processes. According to the polarity-correspondence principle (Proctor & Cho, 2006), stimuli and response alternatives are coded as positive and negative polarity along physical or conceptual dimensions. Thus, a conflict in polarity codes means that stimuli and responses activate opposite ends of a polarity spectrum, which slows down response-selection processes. Another hypothesis is that cross-modal mappings may be based on shifts in decision criteria rather than perceptual phenomena (see Keetels & Vroomen, 2011, for pitch-size mapping; and Marks et al., 2003, for pitch-brightness mapping). A third hypothesis is that pitch biases spatial attention towards an upper or lower location in space, suggesting that cross-modal correspondence occurs at an earlier processing stage than response selection.
In this paper we conducted two studies that matched pitch to sound source localization on the vertical axis, and to visual size.
Study 1: Pitch and sound source elevation
Previous studies on pitch and verticality have mainly visually manipulated the spatial dimension (Bernstein & Edelstein, 1971; Chiou & Rich, 2012; Lidji et al., 2007; Melara & O’Brien, 1987; Patching & Quinlan, 2002; Rusconi et al., 2006). Differently, in this study verticality was manipulated as elevation of a sound source. The aim was to test if the vertical mapping of pitch could also apply to sound source elevation on the median plane. In line with previous literature we adopted the “speeded classification” paradigm, a task that has long been used to study a variety of issues in selective attention research (see Marks, 2004, for a review). In a typical study, participants have to discriminate one characteristic of a stimulus (e.g., its spatial location) as rapidly as possible while trying to ignore any “irrelevant” characteristic of the stimulus (e.g., its pitch), which may also vary on a trial-by-trial basis.
In our study participants had to discriminate between auditory stimuli that were emitted from a loudspeaker positioned up or down in comparison to their ear level, ignoring tone pitch. The interaction between tone pitch and auditory source direction was tested with both reaction times and errors. We hypothesized faster reaction times and lower error rate in congruent conditions (i.e., high pitch – high loudspeaker and low pitch – low loudspeaker), in comparison to incongruent conditions (i.e., high pitch – low loudspeaker and low pitch – high loudspeaker).
Method
Participants
Thirty participants without self-reported auditory deficits were enrolled in the study. They were 15 males (mean age 32.13 ±4.76) and 15 females (mean age 34.93 ±11.46). Age range was 25–60 years. A univariate ANOVA between sex and age showed no significant difference between the two groups (p = .76). Every aspect of this study was carried out with the approval of the Ethical Committee of the University of Bologna, and informed consent was required prior to participation.
A post-hoc power analysis performed with G*Power (Faul, Erdfelder, Lang, & Buchner, 2007), considering an effect size of .25, 30 participants, and 10 repeated measures for each condition resulted in a power index of .99.
Material and procedure
The study was conducted with E-prime 2.0 software, and consisted of 160 trials (80 congruent and 80 incongruent), plus 8 practice trials. Auditory stimuli were sine-wave tones differing in pitch at 4 levels: 100 Hz, 200 Hz, 600 Hz, and 800 Hz. The first two tones were considered low-pitched, while 600 Hz and 800 Hz were considered high-pitched. Four congruent conditions (100 Hz – low loudspeaker, 200 Hz – low loudspeaker, 600 Hz – high loudspeaker, 800 Hz – high loudspeaker), were compared with four incongruent conditions (100 Hz – high loudspeaker, 200 Hz – high loudspeaker, 600 Hz – low loudspeaker, 800 Hz – low loudspeaker; Figure 1). Loudness was calibrated with a Delta Ohm HD2010 UC/A class 1 sound pressure meter with ponderation curve A. Stimuli were presented at 70 dB(A). For each auditory stimulus 20 trials were congruent (high-pitched – high loudspeaker; low-pitched – low loudspeaker), and 20 trials were incongruent (high-pitched – low loudspeaker; low-pitched – high loudspeaker). Trial order was randomized by the E-prime 2.0 software. Participants were requested to pay attention to the sound source direction pressing the “6” key on a QWERTY keyboard as fast as possible if the auditory stimulus was perceived to come from the high loudspeaker, or the “b” key if the auditory stimulus was perceived to come from the low loudspeaker.

Experimental setting for Study 1, showing congruent and incongruent conditions for tones presented in the upper and lower loudspeaker.
The high loudspeaker was positioned 60 cm higher than the participant’s ear level, while the low loudspeaker was positioned 60 cm lower than the participant’s ear level. Both loudspeakers were in axis with the vertical midline of the participant’s head. On the anterior plane both loudspeakers were positioned 50 cm ahead of the participant’s ears (Figure 1). Sound sources were therefore at an angle of 50.2° from the ears.
Eight trials (4 congruent, 4 incongruent) were used for training, to familiarize the participant with the procedure. The intertrial interval ranged from 1.5 s to 3 s. During the intertrial interval a fixation cross was presented at the center of the computer screen to keep the participant’s attention directed on the task. Each auditory stimulus had a duration of 1 s. At the onset of the auditory stimulus, a “6” was presented on the top of the screen, and a “b” was presented at the bottom of the screen to show possible responses. The participant was allowed a maximum of 3 s after stimulus onset to answer. After the response, the participant was given feedback to keep motivation high for accuracy, showing if the response was correct or wrong and the reaction time. For half of the participants the index finger of the left hand was positioned on the “6” key and the index finger of the right hand was positioned on the “b” key; vice versa for the remaining half of the participants.
Data analysis
Reaction times for valid trials and errors were analyzed separately. Reaction times exceeding 2 s were excluded (0.6%). When analyzing reaction times, only valid answers were included. The proportion of errors was on average .25 ±.08. A first analysis was performed comparing congruent and incongruent trials with an ANOVA including pitch (4 levels: 100 Hz, 200 Hz, 600 Hz, 800 Hz), and tone direction (2 levels: high loudspeaker, low loudspeaker) as within-subject factors, and sex as between-subject factor. Dependent variable was mean reaction time for valid trials. Pairwise comparisons were adjusted with Bonferroni correction. The difference of errors between congruent and incongruent trials was tested with a chi-square test. The error rate for each participant as a function of pitch, loudspeaker position, and sex was compared with a mixed-ANOVA. Grennhouse-Geisser correction was applied when analyzing within-subject factors, and effect-size was computed as partial eta-squared (ηp2).
Results
Reaction times
There was a significant main effect of congruency: F(1, 28) = 6.95, p = .01, ηp2 = .20. Mean reaction time was 687 ±152 ms for congruent trials and 711 ±170 ms for incongruent trials. The main effect of sex was also significant: F(1, 28) = 5.69, p = .02, ηp2 = .17. Mean reaction time was 764 ±162 ms for females, and 635 ±135 ms for males.
A more specific analysis was performed with an ANOVA that included tone pitch, tone direction, and sex. The main effect of pitch was significant: F(3, 78) = 11.07, p < .001, ηp2 = .30. Mean reaction times for the 4 tones were: 722 ±204 ms for 100 Hz, 747 ±226 ms for 200 Hz, 696 ±201 ms for 600 Hz, and 724 ±219 ms for 800 Hz. The main effect of tone direction was significant: F(1, 26) = 10.87, p < .001, ηp2 = .29. Mean reaction times were 698 ±214 ms for tones emitted by the high loudspeaker, and 758 ±236 ms for tones emitted by the low loudspeaker. The interaction between pitch and direction was significant: F(3, 78) = 4.67, p = .007, ηp2 = .15. Pairwise comparisons showed that tone direction was significant only for 600 Hz (p = .001), and 800 Hz (p < .001) (Figure 2). The main effect of sex was also significant: F(1, 26) = 6.43, p = .02, ηp2 = .20, confirming the previous difference found between congruent and incongruent trials. None of the interactions with sex or the other factors were significant.

Boxplots showing reaction time distributions for the interaction between tone pitch and tone direction (** p < .01). The central lines in the boxes represent the median reaction time, the boxes indicate the first and third quartiles, and the whiskers, the range of the data.
Errors
Of 1244 total errors, 448 (36.01%) were in congruent trials, whereas 796 (63.99%) were in incongruent trials. The difference was significant: χ2 = 97.35, p < .001.
The ANOVA that included tone pitch, tone direction, and sex showed significant main effects for tone pitch: F(3, 78) = 5.87, p = .004, ηp2 = .18, tone direction: F(1, 26) = 77.74, p < .001, ηp2 = .75, and for the interaction between tone pitch and direction: F(3, 78) = 29.39, p < .001, ηp2 = .53. The main effect of sex was not significant (p = .16).
Mean error numbers were 4.5 ±2.66 for 100 Hz, 6.01 ±3.27 for 200 Hz, 5.66 ±2.65 for 600 Hz, and 5.21 ±3.01 for 800 Hz. Mean error number for tones emitted from the upper loudspeaker was 3.44 ±2.56, and 7.25 ±3.24 for those emitted from the lower loudspeaker. Boxplots for the interaction between pitch and direction are shown in Figure 3. Pairwise comparisons showed that sound direction was significant for 100 Hz (p < .001), 600 Hz (p < .001), and 800 Hz (p < .001), as highlighted in Figure 3.

Boxplots showing the distributions of errors in the interaction between tone pitch and tone direction (*** p < .001).
Study 2: Pitch and size
Vertical elevation is only one of the visual-spatial features associated with pitch. Another important spatial dimension connected to pitch perception is visual size. In this paper, to better map pitch on a two-dimensional space, we complemented the pitch-elevation results with a second study focused on pitch-size matching.
Previous studies that have investigated the mapping of pitch on size showed a connection between high pitch and small size and between low pitch and big size in simple categorical designs, using dichotomic levels for both pitch and size.
In a perceptual matching paradigm, Marks, Hammeal, and Bornstein (1987) showed that children between 9- and 13-years-old matched high-pitched tones with small sizes and low-pitched tones with large sizes. Three-year-olds reliably matched high-pitched tones with small bouncing balls and low-pitched tones with large ones (Mondloch & Maurer, 2004). Fernández-Prieto, Navarra, and Pons (2015) showed a cross-modal association between pitch and size in 6-month-old infants. In Gallace and Spence (2006) adults judged the size of a variable visual stimulus relative to a standard more rapidly when the irrelevant sound frequency presented simultaneously was congruent with the size of the visual stimulus (e.g., high-pitched tone with a small visual stimulus).
Similarly, Parise and Spence (2009) investigated the effects of congruent or incongruent synesthetic pairs between two tone pitches and two circle sizes. Participants had to report the relative order of presentation of the two stimuli that appeared with different asynchrony intervals. In the case of matched pairs, participants exhibited a higher differential threshold in detecting the temporal asynchrony between visual and auditory stimuli.
The aim of our second study was to assess the association, determining its function, between a broad range of pitches and nine circles varying in diameter. We hypothesized an inverse linear function between pitch (log-spaced) and circle size.
Method
Participants
Twenty-two university students without self-reported auditory deficits participated in the study. The sample was composed of 11 males (mean age: 33.09 ± 13.53) and 11 females (mean age: 30.63 ± 12.67). Age range was 24–60 years. A univariate ANOVA on sex and age showed no significant difference between the two groups (p = .67). Every aspect of this study was carried out with the approval of the Ethical Committee of the University of Bologna, and an informed consent was signed by each participant.
A post-hoc power analysis performed with G*Power (Faul et al., 2007), considering an effect size of .5, 22 participants, and two repeated measures for each condition resulted in a power index of .99.
Materials and procedure
Auditory stimuli were 21 sine-wave tones, lasting 1 s, whose pitches were logarithmically spaced. The frequencies were 100, 125, 158, 199, 251, 316, 398, 501, 630, 794, 1000, 1258, 1584, 1995, 2511, 3162, 3981, 5011, 6309, 7943, and 10000 Hz. We did not include tones with a frequency greater than 10000 Hz since their detectability is strongly affected by age (Robinson & Sutton, 1979). Each of the 21 tones was presented twice for a total of 42 trials. Trial order was randomized by the E-prime 2.0 software. The stimuli were presented through Sennheiser HD 437 headphones (frequency band 14 Hz-26 kHz). Loudness for the 21 stimuli was calibrated and equalized with a Delta Ohm HD2010 UC/A class 1 sound pressure level with ponderation curve A, and presented at 70 dB(A). The participant was requested to match the pitch of the sound with the size of 9 circles varying in diameter and presented on the screen (Figure 4). The visual angles subtended by the 9 circles were: 0.41° (1), 0.66° (2), 1.33° (3), 2.45° (4), 4.69° (5), 6.92° (6), 8.73° (7), 11.51° (8), 14.32° (9), with a magnitude ratio of 34.9 between the smallest and biggest circles. Each circle was labeled with a number ranging from 1 to 9. The participant had to answer pressing the corresponding number key. The answer was not time limited. Responses were recorded automatically by the E-prime software.

Circles varying in diameter that had to be associated with 21 tones varying in pitch in Study 2.
Results
The association between auditory stimulus pitch (Hz) and circle diameter expressed as visual angle (°) was tested with a linear regression that was significant: F(1, 922) = 335.57, p < .001; Adjusted R² = 0.27, β = –.51.
To test the influence of sex we performed a repeated-measures ANOVA entering pitch (21 levels) as within-subjects factor and sex as between-subjects factor. Consistent with the previous regression, the ANOVA showed a significant effect for pitch, F(20, 400) = 36.57, p < .001, ηp2 = .65, while the interaction between pitch and sex was not significant (p = .62).
Boxplots showing the circle size distributions for the 21 pitch classes are reported in Figure 5.

Boxplots showing response distributions for the circle size assigned to the 21 tone pitch categories in Study 2.
General discussion
Our two studies reported evidence about audio-visual cross-modal associations, matching pitch with sound source localization on the vertical axis and visual size. In the first study, using a speeded classification task, we obtained a facilitating effect in the congruent condition of high-pitched tones coming from the upper loudspeaker, but not in the symmetrical congruent condition of low-pitched tones coming from the lower loudspeaker. Error analysis confirmed that the vertical match for high-pitched tones resulted in a better accuracy. This asymmetry might be explained by the finding that low frequencies are harder to localize in space, both for azimuth and elevation (Middlebrooks & Green, 1991; Moore, 2012; Schnupp, Nelken, & King, 2011; Stevens & Newman, 1936).
All previous studies on verticality and pitch have manipulated verticality on a visual dimension presenting stimuli on the higher or lower part of the screen (Bernstein & Edelstein, 1971; Chiou & Rich, 2012; Lidji et al., 2007; Melara & O’Brien, 1987; Patching & Quinlan, 2002; Rusconi et al., 2006). Our study is the first that has mapped pitch on the sound source elevation space. The results can probably be extended to frequencies different from those used in our study (i.e., 100, 200, 600, 800 Hz), considering that Pedley et al. (1959) have demonstrated that the vertical scaling of “high” and “low” pitch is not absolute, but relative to the range of frequencies that are presented to the participant.
In the first experiment, the evaluation of sound source direction, independently from tone pitch, led to an error rate of .25, which is moderately high. This result could be explained by the greater difficulty in judging sound source localization on the vertical axis compared to the azimuthal one (Fay & Popper, 2005). Sound source localization on the horizontal axis relies on interaural time difference (ITD), and interaural intensity difference (IID) parameters. Intensity difference caused by the head shadowing effect is the major index for azimuthal localization in high frequencies (over 2 kHz). Below 1 kHz, interaural intensity difference is much lower and sound localization is dominated by comparison of timing differences at each ear (ITD). On the azimuthal plane, interaural level differences below 200 Hz are very small and a precise evaluation of the input direction is nearly impossible on the basis of level difference alone. For frequencies below 80 Hz it is impossible to use either time or level differences to determine sound lateral source, since the phase difference between the ears becomes too small for a directional evaluation (R. Smith & Price, 2014). Auditory acuity displays a non-linear variation with azimuthal angle being around 1° at the midline, increasing to around 10° laterally (Mills, 1958).
Auditory stimuli presented on the medial plane produce no interaural differences, and therefore vertical elevation on the medial plane cannot rely on IID and ITD cues. The pinna and the head produce a spectrum change of the sound source that reaches the tympanic membrane (Middlebrooks & Green, 1991). This spectrum change is the most critical factor in sound elevation perception. The head-related transfer function (HRTF) defines the spectrum changes that produce spatial cues that account for vertical localization. Pinna shape, size and convolution lengths determine the constraints of the frequency bandwidth on which spectra changes can operate. For this reason, and as confirmed by our results, accurate vertical localization is observed only when an auditory stimulus contains energy at high frequencies. The results of the second study, in which low tones were associated with large sizes, could also contribute to explain why low-pitched tones were not facilitated when emitted by a sound source positioned under ear level. The association with large size implies a lower spatial resolution. For acute tones, to the contrary, the association with low size geometries translates to better spatial distinctiveness and acuity.
Reaction time to sound source direction on the vertical medial plane was faster for males than females. Previous research by Lewald and Hausmann (2013), and Zündorf, Karnath, and Lewald (2011) demonstrated that men outperform women in spatial analysis of complex auditory scenes. Their focus was specifically on the ability to extract auditory information of a specific sound source when multiple competing sound sources were present. In that context males exhibited better performance than females for localizing target sounds in a multi-source sound environment. Lewald (2004) investigated monaural sound localization in the vertical plane using a simple pointing task, finding a male advantage for precision. Our study shows that this gender difference also holds in the case of binaural presentation, and that this was not due to a difference in accuracy since error rate between males and females was similar. This gender difference on spatial sound source localization could be accounted among the well-documented differences in specific spatial abilities between males and females (Reilly & Neumann, 2013).
An effect that was superimposed to the interaction between pitch and sound source elevation, in both reaction times and error rate, was the facilitating effect of tones coming from the high loudspeaker in comparison to tones coming from the low loudspeaker, regardless of their pitch. This result is new to the literature and refers to an elevation angle of +50.2° and − 50.2°. As the sound sources were only 50 cm away from the participant’s frontal plane it is possible that the tones emitted by the low loudspeaker were more shadowed by the participant’s body. This result needs to be further investigated by varying elevation on a wider range.
In our second study the results showed a strong association between pitch and visual size, consistent with previous literature (Gallace & Spence, 2006; Marks et al., 1987; Mondloch & Maurer, 2004). Specifically, the matching was demonstrated over an extended bandwidth ranging from 100 Hz to 10000 Hz, with a linear inverse relation to circle diameter.
The inverse relation between frequency and size is shared by different sensorial modalities. Békésy (1957, 1959) noted that in auditory pure tones, vibrations, and alternating electric current applied to the skin, as sinusoidal frequency increased, the perceived “size” of the sensory percept decreased. Walker and Smith (1984) demonstrated that high-pitched sounds and small objects share cross-modal qualities (e.g., being sharp, thin, light, fast, and little), as do low-pitched sounds and large objects. This association, at least in the auditory domain, could be explained by pitch-size similarity in experience: larger objects tend to resonate at lower frequencies than smaller objects (Grassi, 2005; Grassi, Pastore, & Lemaitre, 2013). The inverse physical correlation between resonance frequency and size appears also in the human voice with children having higher pitched voices than adults, and females characterized by higher pitched voices than males (Pisanski, Jones, et al., 2016; Pisanski, Mora, et al., 2016).
People are able to discriminate between objects of different sizes by hearing the sounds that the objects make when they are dropped onto a surface (see, e.g., Carello, Anderson, & Kunkler-Peck, 1998; Spence & Zampini, 2006). In a comparative perspective, information regarding vocalization pitch is used by certain species of animals to estimate the size of their competitors (see, e.g., Charlton et al., 2011; Harris, Fitch, Goldstein, & Fashing, 2006; Ohala, 1983, 1984). Also, 30- to 36-month-old children exhibit an association between the sizes of balls and the pitches of sounds (i.e., higher-pitched sounds were associated with smaller balls, and lower-pitched sounds with larger balls; Mondloch & Maurer, 2004).
A Bayesian explanation of cross-modal correspondences as “internalization” of statistical co-occurrences that can be found in the physical domain has been developed by Parise, Knorre, and Ernst (2014). Recording a large sample of environmental sounds, these authors showed a statistically consistent mapping between the frequency of sounds and the average elevation of their sources in the external space. High-frequency sounds, particularly in the middle range of the spectrum, between 1 and 6 kHz, have a tendency to originate from elevated sources in natural auditory scenes.
However, for frequencies below 1 kHz Parise et al. (2014) did not find a frequency-elevation mapping in environmental sound recordings. Furthermore, Pedley et al. (1959) showed that the classification of pitch as “low” and “high” is not strictly absolute, but relative to the frequency range to which the listener is exposed. In our first study low-pitched stimuli were 100 and 200 Hz tones, while high-pitched stimuli were 600 and 800 Hz tones, all below 1 kHz, and hence in a range that, according to Parise et al. (2014), does not mirror any frequency-elevation mapping in the physical domain. Nevertheless, we found a significant effect of pitch on the evaluation of sound source elevation. For these reasons, the hypothesis that cross-modal mapping represents an internalization of cross-modal statistical co-occurrences in the physical domain needs to be better investigated in the range of frequencies below 1 kHz, a range that is very important because it includes fundamental frequencies for the human voice (Horii, 1975).
The need for cognitive and neural maturation in the emergence of pitch-related cross-modal associations is supported by Nava et al. (2016) who found the development of audio-visual, visual-tactile, and audio-tactile correspondences only in 4- to 5-year-old preschool children. Walker et al. (2010, 2014) reported that 3- to 4-month-old preverbal infants showed longer preferential looking at cross-modal congruent bimodal displays. These results were not confirmed by Lewkowicz and Minar (2014) who showed that 4- to 12-month-old infants were not able to link auditory pitch to visual-spatial height.
This direct explanation of cross-modal associations as statistical matching between two sensorial dimensions cannot be generalized to all cases. There are examples, as in the association between high-pitched tones and angular shapes, and low-pitched tones and curvilinear shapes (Marks, 1987), or in the association between pitch, lightness, and brightness (Marks, 1987; Martino & Marks, 1999; Melara, 1989), in which the auditory and visual dimensions are disjoined in the physical domain, having no relations. High-pitched sounds are not more frequently emitted by luminous, bright, or V-shaped sources, and low-pitched sounds are not emitted with a higher probability by dark, dull, or U-shaped sources. It can be suggested that all of these cross-modal matches could rely on second-order perceptual attributes. For example, a higher vertical position is invested with positive attributes (Costa & Bonetti, 2016; Gottwald, Elsner, & Pollatos, 2015; Meier & Robinson, 2004), and higher pitch during human speech is perceived as expressing more positive emotions (e.g., Gobl & Chasaide, 2003; Patel, Scherer, Björkner, & Sundberg, 2011; Scherer, 1995). The overlap in perceived attributes between the acoustic and the visual domains might evoke the cross-modal match. Future studies could address the role of musical features other than pitch in sound source spatial localization. It can be suggested that loudness, musical mode, and specific aspects of timbre could significantly interact with vertical sound source localization. A specific result that needs to be better explored is the facilitating effect of spatially high sources on sound localization. Furthermore, considering previous evidence that linked musical preferences and perception to intelligence (e.g., Bonetti & Costa, 2016, 2017), and personality (e.g., Chamorro-Premuzic, Fagan, & Furnham, 2010; Rentfrow & Gosling, 2003), future research could investigate possible connections between individual differences along cognitive and personality scales, and the level of efficiency in integrating cross-modal stimuli.
Footnotes
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
