Abstract
Theories on the origins of cross-modal correspondences involving pitch speculate on statistical, semantically-mediated, and structural factors. We hypothesize that five apparently different conceptualizations of pitch sequencing are based on an underlying structure consisting of at least two conceptual primitives:
Musical cross-modal correspondences (CMC) represent an area of study with a long tradition. Numerous experiments have tested the interactions between pitch and its perceived correspondence in height, size, or brightness (extensively reviewed in Marks, 2004; Spence, 2011; Walker, 2016). As a result, the prevailing opinion on cross-modal correspondences is that they do not represent a single phenomenon, but rather “utilize a variety of psychological mechanisms [and] have different origins, […] ranging from basic perceptual and motor functions to the shaping of language metaphors and cultural practices” (Eitan, 2017, p. 214). Despite experimental and anthropological data corroborating such diversity (Eitan & Timmers, 2010), many authors propone only one epistemological framework such as: implicit statistical inference from cross-modal bindings in nature (e.g., high frequencies tend to be produced by elevated sources, Parise, Knorre, & Ernst, 2014); “structural” motivation, which may be either neurological (e.g., an increase in stimulus intensity tends to increase neural firing, Spence, 2011) or “amodal” (wherein distant pitches correspond to any contrasting features used to probe them, be it brightness or aromas or haptic opposites, Walker, 2016); and finally, semantic mediation, wherein the use of a particular metaphor in the (native) language influences the construction of CMCs (Martino & Marks, 1999).
These last two theories provide an interesting connection with linguistic semantics, where many scholars agree that concepts are grounded in some sort of schematic relation – “schemas” (Rumelhart, 1980), “image schemas” (Johnson, 1987), or “conceptual primitives” (Mandler, 1992; Jackendoff, 1990) – yet the question remains whether conceptual “cross-field parallelisms are derivational [… or rather are] parallel instantiations of a more abstract schema” (Jackendoff, 2002, p. 359). Regarding cross-modal bindings in pitch relations, the dilemma is therefore (1) whether an abstract principle underlies the various instances of a CMC, suggesting that the correspondence is conceptual rather than merely perceptual, and if so, (2) whether its motivation is predominantly “semantically mediated” or more generally “amodal.” To address this matter, the present study considers this debate in light of the perception of pitch relations in scales.
Numerous studies of pitch relations have revealed differences in the way musical concepts are constructed cross-culturally. Eitan and Timmers (2010) located 35 cross-cultural antonym pairs, from thick/thin and old/young to apparent idiosyncrasies such as “crocodiles/those who follow crocodiles,” seemingly confounding the possibility for universal commonalities beneath such variegated lexical choices (but see Walker et al., 2010). In their study of “height” and “thickness” descriptions of pitch relations cross-linguistically, Dolscheid, Shayan, Majid, and Casasanto (2013) concluded that one’s mother tongue may motivate one’s conceptualization of pitch relations (although with a brief fill-in-the-blanks task, adults can still be trained to use a foreign conceptualization).
In contrast, our previous studies (Antović, 2009; Antović, Bennett, & Turner, 2013) show that children, despite the expressions common to their mother tongue, freely describe pitch relationships using conceptualizations common to foreign languages, such as “thick/thin” in Turkish and Farsi (Shayan, Ozturk, Bowerman, & Majid, 2014; Shayan, Ozturk, & Sicoli, 2011) and “large/small” in Javanese (Perlman, 2004), African Manza (Stone, 1981), and Venda (Blacking, 1970/1995). Furthermore, certain remote populations associate pitch with verticality despite the non-vertical metaphor used in their language (Parkinson, Kohler, Sievers, & Wheatley, 2012). Even infants (Mondloch & Maurer, 2004) and non-human primates (Ludwig, Adachi, & Matsuzawa, 2011) regularly associate pitch with spatial dimensions. Without a mother tongue to speak of among such populations, these examples question the deterministic role of language on constructing conceptualizations; however, one may still propose that one’s native language strengthens pre-existing associations of spatial dimensions with pitch relations (Casasanto, 2010; Dolscheid, Hunnius, Casasanto, & Majid, 2014). Therefore, the aim of the present study is to address the following questions: 1) does a longer exposure to a language enhance a particular CMC (e.g., pitches as heights) at the expense of others?; 2) within a language group, is one CMC “primary,” or do many, or even all, share an underlying abstract structure?
This study includes stimuli representing five different modes – “vertical movement,” “shrinking in size,” “thinning in width,” “rotation,” and “hue change” – corresponding to two different scales – diatonic and non-diatonic. We hypothesize that the participants’ conceptualizations of pitch sequencing are implicitly informed by an abstract schematic structure, comprising two conceptual primitives –
Participants rate the visual representations of a musical scale conceptually, based on the aforementioned primitives, rather than on its visual agreement with the lexical item from their mother tongue or on lower-level perceptual factors;
Longer exposure to the mother tongue and/or musical training does not affect ratings.
Experiment 1
Method
Participants
The sample size was determined in G*Power (inputs: ANOVA, repeated measures, within-factors, ηp2 = .1, f = .33, Power = .95, p = .05, two groups, three measurements, correlation among repeated measures = .20). This necessitated 80 participants: non-musicians, native speakers of Serbian (a “pitch/height” language), of whom 40 were primary-school third-graders (mean age 8.8 years, SD = 0.4, 37.5% male) and 40 were adults (mean age 20.15, SD = 0.85, 17.5% male). 1 No participants were excluded.
Materials
In each trial, a sine-wave C-major scale was played (C4-C5, 0.5 s per tone, equal in dynamics at approximately 70 dB) through A4 Tech HS800 headphones and was accompanied by visual animations on a 17-inch monitor, observed from the distance of 0.5 meters (refresh rate: 60 Hz). The mp4 files were presented in GOM Player, with one trial per animation. The 12 visual sequences corresponded to three scale conceptualizations: Vertical Movement (down → up, pitch/height), Shrinking (big → small, pitch/size) and Thinning (thick → thin, pitch/width). The directionality of the Shrinking dimension was determined in accordance with previous studies (Antović, 2009; Fernández-Prieto, Navarra, & Pons, 2015), yet the matter is controversial and remains a limitation of this experiment (we address this separately in Experiment 3). Each sequence appeared as a black square on a white surface and demonstrated all permutations of the two postulated conceptual primitives in four conditions. The “two-primitives” condition included both
Procedure
Participants were asked to listen to each of the counterbalanced stimuli and circle a number on a 10-point Likert scale (0–9) to assess how well the animation “agreed” with the music (9 meaning perfect correspondence). Response time was not limited, but never exceeded five seconds. We expected scores to increase with each additional primitive, but irrespective of the animation type, such that there would be no significant differences among “zero-,” “one-,” and “two-primitives” conditions across the three animations. To assess any effect of mother tongue bias within the older population (i.e., preference for Vertical Movement), we tested two different age groups.
Participants’ ratings followed an ordinal scale and were not normally distributed, motivating us to use non-parametric Friedman’s ANOVA. For “zero” and “two-primitives” conditions, we analyzed data directly from the database. However, because the “one-primitive” condition contained two stimuli per animation type (
We then compared the distributions both “within-animation” (to determine if adding primitives increased the resulting scores) and “cross-animation” (to compare respective scores for “zero-,” “one-,” and “two-primitives” conditions across animation types).
Results
Adults
There were no differences in any pairwise “one-primitive” comparison (Wilcoxon signed-rank test, Bonferroni corrected to p < .01 due to six pairwise comparisons in each ANOVA): Vertical Movement: Z = -0.66, p = .51, r = .07; Shrinking: Z = -0.24, p = .81, r = .03; Thinning: Z = -1.25, p = .21, r = .14.
With no differences, we then averaged these two scores, to establish our final “one-primitive” variable alongside “zero-” and “two-primitives.” Figure 1 presents within-animation results for adults.

Distributions of grades, adults, Experiment 1 (n = 40). X axis – percentage of participants; Y axis – configuration of primitives; shades of purple (0 to 9) – scores.
There was a significant difference in the scores awarded to “zero-,” “one-,” and “two-primitives” conditions (Freidman’s ANOVA): Vertical Movement: χ2(2) = 65.66, p < .01; Shrinking: χ2(2) = 66.76, p < .01; Thinning: χ2(2) = 58.51, p < .01. In a post-hoc analysis, we ran pairwise Wilcoxon comparisons between adjacent conditions. In all cases, the significance remained well below the Bonferroni-corrected p < .02 (for three pairs, .05/3), with effect sizes r ranging from 0.36 to 0.62.
Across animation, we ran Friedman’s ANOVAs on the respective scores for “zero-,” “one-,” and “two-primitives” conditions through three animation types: “zero-primitives”: χ2(2) = 8.20, p < .05; “one-primitive”: χ2(2) = 3.74, p = .15; “two-primitives”: χ2(2) = 4.61, p = .10. The difference in the “zero-primitives” condition occurred because grades for the largest static square in Thinning were higher than those for the smallest static square in Vertical Movement, Bonferroni-corrected (Z = -2.67, p < .01, r = .30).
Children
Comparing individual “one-primitive” responses by means of a Wilcoxon signed-rank test, Bonferroni-corrected (p < .01), we found no differences in two out of the three pairwise comparisons: Vertical Movement: Z = -3.30, p < .01, r = .037; Shrinking: Z = -0.56, p = .58, r = .06; Thinning: Z = -2.49, p = .013, r = .28. Only in Vertical Movement did

Distributions of grades, children, Experiment 1 (n = 40). X axis – percentage of participants; Y axis – configuration of primitives; shades of purple (0 to 9) – scores.
Again, within animation, the distributions for “zero-,” “one-,” and “two-primitives” conditions differed: Vertical Movement: χ2(3) = 65.56, p < .01; Shrinking: χ2(2) = 24.70, p < .01; Thinning: χ2(2) = 29.66, p = <.01. Post-hoc Bonferroni-adjusted pairwise Wilcoxon comparisons showed significant differences between the “zero-” and “one-primitive” conditions (p < .01), but no difference between “one-” and “two-primitives” conditions, as follows: Vertical Movement 1 (
There were no differences for the same number of primitives cross-animation, with “zero-”: χ2(2) = 1.86, p = .39; “one-” (averaged): χ2(2) = -2.31, p = .31; and “two-primitives”: χ2(2)=1.14, p = .56. With “one-primitive” conditions separated, the difference emerged with
Comparison of adults and children
To compare the two groups and test any interactions between subjects, we rank-transformed all responses and then performed a mixed ANOVA with repeated measures on one factor.
The main effect of group was found in the “zero-primitives” condition, in which children gave higher scores than adults, F(1, 78) = 8.15, p < .01, ηp2 = .095, and adults gave higher scores than children in the “two-primitives” condition, F(1, 78) = 3.88, p < .05, ηp2 = .047. There was no difference between groups in the “one-primitive” condition, F(1, 78) = 0.56, p = .455, ηp2 = .007. There were no significant interactions between group and parameter in any of the three conditions.
Discussion
Adults’ results seem to support Hypothesis 1. Whether the scale was represented as a square ascending, shrinking, or thinning, they increased their ratings whenever underlying primitives appeared. Importantly for the discussion on language bias, adults did not seem to prefer an animation type for the “one-primitive” conditions (
The results with children support Hypothesis 1, yet less definitively. First, children assessed the “one-primitive” animations differently for Vertical Movement, wherein the presence of
We included children, however, to test Hypothesis 2: whether a discernible effect of native language would be present in older participants. This does not seem to obtain. Considering both groups, 21 out of 24 pairwise cross-animation comparisons indicate no preference for Vertical Movement, which would reflect lexicalization from the mother tongue. The following three isolated cases do not seem to interfere with this conclusion: “zero–primitive” Thinning received slightly higher scores among adults, but this stimulus does not display vertical movement. Secondly, Vertical Movement was rated more highly than Thinning and slightly more than Shrinking (“one-primitive,
Experiment 2
Experiment 2 additionally tested directionality. Here one half of the stimuli were directed congruently with pitch movement (i.e., in following the scale, the square goes up and down, becomes smaller and bigger, or gets thinner and thicker). The other half were directed incongruently (i.e., the square first goes down then up, becomes bigger then smaller, or becomes thicker then thinner). Hypothesis 3 assumes that not even this reversal of movement would interfere with scores either within-animation or cross-animation. In this segment as well, our assumption was that in the second visual presentation, congruent movement following an ascending scale is that of reduction rather than increase in size (Antović, 2009; Fernández-Prieto et al., 2015). Yet this choice is not uncontroversial (Eitan, 2013) and is additionally addressed in Experiment 3.
Method
Participants
The sample size was determined as in Experiment 1. Anticipating equally high effect sizes within-stimulus for adults (r ranging from .36 to .62), we increased the expected effect size here to ηp2 = .15 but left the other parameters intact. This necessitated 52 participants, native speakers of Serbian, divided into 26 students with neither formal nor informal musical instruction (mean age 21.35 years, SD = 0.69, 42.6% male) and 26 students with at least five years of professional musical instruction (mean age 21.80, SD = 2.06, 38.5% male). Musicians were included to test whether their familiarity with the jargon for pitch movement and vertical pitch notation would result in a bias toward the “Original-Direction Vertical” stimuli. There were no exclusions in the non-musician group. Two musicians were immediately replaced from within the pool due to inappropriate behavior.
Materials
Because the scale moved in two directions (C4-C5-C4), the stimuli included a directionality variable. Thus, each animation type appeared in eight variants, resembling all permutations of the same two postulated primitives with congruent or incongruent directionality. The “two-primitives” condition included both
Procedure
The procedure was identical to Experiment 1 with twice the number of calculations to account for original- and reversed-direction stimuli.
Results
Non-musicians
We first looked for any differences in two individual “one-primitive” conditions (Wilcoxon signed rank test), again finding no differences in any of the six comparisons: Vertical Movement, Original: Z = -1.15, p = .25, r = .016, and Reversed: Z =-0.23, p = .82, r = .03; Shrinking, Original: Z =-0.99, p = .32, r = .014, and Reversed: Z =-0.82, p = .41, r = .11; Thinning, Original: Z =-1.08, p = .28, r = .015, and Reversed: Z = -1.78, p = .07, r = .025.
We thereby averaged each pair, established the new grand “one-primitive” variable, and compared it to the “zero-” and “two-primitives” variables. Distributions are given in Figure 3.

Distribution of grades, non-musician adults, Experiment 2 (n = 26). X axis – percentage of participants; Y axis – configuration of primitives; shades of purple (0 to 9) – scores.
For all animations, grades for “zero-,” “one-,” and “two-primitives” significantly differed (Friedman’s ANOVA): Original: Vertical Movement: χ2(2) = 48.06, p < .01; Shrinking: χ2(2) = 44.17, p < .01; Thinning: χ2(2) = 41.18, p < .01. Reversed: Vertical Movement: χ2(2) = 46.39, p < .01; Shrinking: χ2(2) = 40.10, p < .01; Thinning: χ2(2) = 47.94, p < . 01.
All post-hoc pairwise comparisons were significantly different, well below the Bonferroni-corrected threshold (p < .01, with effect sizes r between 0.39 and 0.62.) Comparing the respective scores for “zero-,” “one-,” and “two-primitives” conditions cross-animation, we found there were no differences for original-direction conditions: Zero Primitives: χ2(2) = 3.93, p = .14; One Primitive: χ2(2) = 5.77, p = .06; Two Primitives: χ2(2) = 0.57, p = .75. For opposite-direction conditions, scores for “one-” and “two-primitives” treatments differed: Zero Primitives: χ2(2) = .52, p = .77; One Primitive: χ2(2) = 19.00, p < .01; Two Primitives: χ2(2) = 14.52, p < 01. Post-hoc analysis (Wilcoxon signed-rank tests, Bonferroni adjusted to p < .02), reveals that these last two differences were from the lower scores for Reversed Vertical Movement (“one-primitive” Shrinking vs. Vertical Movement, Z = -2.88, p < .01, r = .040; “one–primitive” Thinning vs. Vertical Movement, Z = -3.36, p < .01, r = .047; “two-primitives” Shrinking vs. Vertical Movement, Z = -2.57, p < .01, r = .36; “two-primitives” Thinning vs. Vertical Movement, Z = -3.17, p < .01, r = .44). Opposite Shrinking and Opposite Thinning did not differ (“one-primitive,” Z = -.31, p = .75, r = .004; two primitives Z = -0.69, p = .49, r = .010).
Musicians
In this group only,
Thus, we considered two “one-primitive” treatments separately in all calculations for musicians. Distributions are given in Figure 4.

Distribution of grades, musician adults, Experiment 2 (n = 26). X axis – percentage of participants; Y axis – configuration of primitives; shades of purple (0 to 9) – scores.
Scores for “zero-,” “one-,” and “two-primitives” conditions differed within each animation type: Original: Vertical Movement: χ2(3) = 62.78, p < .01; Shrinking: χ2(3) = 53.42, p < .01; Thinning: χ2(3) = 58.41, p < .01; Reversed: Vertical Movement: χ2(3) = 55.58, p < .01; Shrinking: χ2(3) = 54.40, p < .01; Thinning: χ2(3) = 62.10, p < .01.
Post-hoc tests showed a clear difference between “zero-primitives” and any of the “one-primitive” conditions (p < .01, effect size r ranging from 0.57 to 0.64). Between individual “one-primitive” conditions and the “two-primitives” condition, the latter received higher scores overall, with particular consistency when compared with
For reversed-direction animations, scores for the “zero-primitives” condition and “one–primitive,
Comparison of musicians and non-musicians
In the same procedure as in Experiment 1, the main effect of group was found in the “zero-primitives” condition, in which musicians gave the lowest grade (0) while non-musicians’ scores slightly varied: original direction F(1, 50) = 5.06, p < .05, ηp2 = .092; opposite direction F(1, 50) = 3.78, p = .06, ηp2 = .070. There were no differences in either “one-primitive, +
Discussion
The population new to this experiment (musically-untrained adults) again significantly increased the scores in all six animations as primitives were added, with no preference for a particular “one-primitive” configuration. There were no differences across-animation with original directionality, although with opposite directionality, grades for Opposite Vertical Movement were significantly lower in the “one-” and “two-primitives” conditions than in those for Original Vertical Movement.
Musicians consistently preferred
In Shrinking and Thinning, directionality seemed irrelevant. Indeed, in prior experiments participants have naturally interpreted higher pitch as “bigger” (Antović, 2009; Krugliak & Noppeney, 2015), and falling pitch as “shrinking” (Eitan, Schupak, Zotler, & Marks, 2014). This occasional “reversed direction” preference for size is confounded by the fact that static low pitches are often considered “big” while dynamic movement from low to high is often perceived as expanding (Eitan, 2013). Experiment 3 addresses this problem in more detail.
Additionally, musicians’ preference for
Experiment 3
The first two experiments were limited in three ways. First, by using a major scale, we may have evoked unintended musical confounds (e.g., the sense of closure or tonal center) in participants’ choices. Secondly, the visual dimensions of verticality, size, and thickness used in this study have proven pertinent in other studies, not necessarily suggesting the crucial influence of a conceptual, primitive structure. Finally, the issue of “congruence” in directionality was somewhat sensitive, considering the common assumption that ascending pitches become smaller (e.g., Fernández-Prieto et al., 2015) and the surmise by Dolscheid et al. (2013) that “greater spatial height corresponds to higher frequency, but greater spatial thickness corresponds to lower frequency” (2013, p. 615). Yet, some authors note that size corresponds to pitch in opposite ways depending on whether the tones are played individually or in sequence (Eitan, 2013). To corroborate this critique, we asked participants in a post-hoc test to listen twice to two extreme tones of a scale and indicate which was bigger and then which was thicker (counterbalanced with smaller and thinner). Thereafter, we played the entire scale and asked whether they perceived it as “growing” and then also as “thickening.” Admittedly, responses for size perception were random, while those for thickness were not (Table 1). To address these issues, the third experiment tested CMCs of an upward, non-diatonic Bohlen-Pierce scale, in four animation types: two (verticality and thickness) latent in previous cross-linguistic work and two (rotation and hue change) both absent in cross-linguistic work (Eitan & Timmers, 2010) and not perceptually salient (Bernstein, Eason, & Schurman, 1971; Iwamiya, 2013). Additionally, we recruited children one year older, to better distinguish between “one-” and “two-primitives” conditions.
Percentages of forced-choice, binary verbal responses to static and dynamic stimuli.
Method
Participants
We gathered 26 primary-school fourth-graders (mean age 9.8 years, SD = 0.4, 38.5% male) and 26 adults (mean age 20.8, SD = 0.9, 42.3% male), all native speakers of the same “pitch/height” language, Serbian. No adults were excluded. Three children were immediately replaced from within the pool due to not following instructions.
Materials
In each trial, participants heard a seven-tone ascending sine-wave Bohlen-Pierce scale (approximate frequencies in Hz: 261, 283, 370, 440, 476, 568, 610) with 0.5 s per tone at 70 dB and, in response to the limitation of the first two experiments, with equal loudness correction (Parise, 2016). By using seven non-diatonic tones, we removed the effects of musical closure and tonal hierarchy. While equipment and procedure were identical to Experiments 1 and 2, the stimuli comprised 16 visual sequences, corresponding to four scale conceptualizations: Vertical Movement (down → up, pitch/height), Thinning (thick → thin, pitch/width), Rotation (clockwise → counterclockwise, pitch/rotation), and Hue Change (blue → red, pitch/hue).
The manipulation of the two conceptual primitives in each presentation mirrored Experiment 1. However, the square sizes and steps were altered to match the frequency changes in the Bohlen-Pierce scale. Thus, in the “two-primitives” condition in which both
Procedure
The procedure was identical to Experiment 1, except that, to avoid potential spatial interference (Lidji, Kolinsky, Lochy, & Morais, 2007; Rusconi, Kwan, Giordano, Umilta, & Butterworth, 2006), we asked participants to state their score rather than circle it on a horizontal scale. We again expected that the scores would increase with the addition of each primitive but would not differ across-animation for the same number of primitives. If this obtained with rotation and hue change, animations neither salient cross-linguistically nor preferred in prior perceptual studies, it would strengthen our case for an underlying primitive structure.
Results
Adults
We first looked for any differences in the distributions of two individual “one-primitive” responses (Wilcoxon signed-rank test, Bonferroni corrected to p < .01). The distributions differed in two (Vertical Movement and Hue Change) out of four animations: Vertical Movement: Z = -2.91, p < .01, r = .40; Thinning: Z = -2.39, p < .05, r = .33; Rotation: Z = -2.05, p < .05, r = .28; Hue Change: Z = -3.76, p < .01, r = .52. Figure 5 provides within-animation results.

Distributions of grades, adults, Experiment 3 (n = 26). X axis – percentage of participants; Y axis – configuration of primitives; shades of purple (0 to 9) – scores.
Within-animation, the scores awarded to “zero-,” “one-,” and “two-primitives” conditions differed (Freidman’s ANOVA): Vertical Movement: χ2(2) = 47.62, p < .01; Thinning: χ2(2) = 44.25, p < .01; Rotation: χ2(2) = 42.56, p < .01; Hue Change: χ2(2) = 51.36, p < .01. In a post-hoc analysis, calculated with “one-primitive” conditions separated for Verticality and Hue Change and averaged for Thinning and Rotation, all relevant adjacent values were significantly different, p < .01. The only exception was “one-primitive
Across animations, there were no differences in any of the four comparisons (Friedman’s ANOVA): “zero-primitives”, χ2(2) = 3.70, p = .30; “one-primitive
Children
There were no differences in any of the four pairwise “one-primitive” comparisons (Bonferroni-corrected to p < .01): Vertical Movement, Z = -2.91, p = .77, r = .04; Thinning, Z = -1.59, p = .11, r = .22; Rotation, Z = -.3, p = .76, r = .04; Hue Change, Z = -.36, p = .72, r = .05. Therefore, we used mean values for “one-primitive” conditions in all four animations with children. Figure 6 provides distributions for all conditions.

Distributions of grades, children, Experiment 3 (n = 26). X axis – percentage of participants; Y axis – configuration of primitives; shades of purple (0 to 9) – scores.
Again, within animation types, the distributions for “zero-,” “one-,” and “two-primitives” conditions differed: Vertical Movement, χ2(3) = 29.41, p < .01; Thinning, χ2(2) = 27.53, p < .01; Rotation, χ2(2) = 18.57, p < .01; Hue Change, χ2(2) = 24.82, p < .01.
Post-hoc, all cases showed differences between the “zero-” and “one-primitive” treatments (p < .01), but no difference between “one-” and “two-primitives” conditions: Vertical Movement, Z = -1.26, p = .21, r = .15; Thinning, Z = -1.16, p = .25, r = .16; Rotation, Z = -1.06, p = .29, r = .15; Hue Change, Z = -0.33, p = .74, r = .05.
Across animations, there were no differences for the same number of primitives with children: “zero-”: χ2(2) = 7.92, p = .05; “one-” (averaged): χ2(2) = -1.08, p = .78; “two-primitives”: χ2(2) = 3.81, p = .28. The borderline difference in the “zero-primitives” condition was from slightly lower average grades for the Verticality static square (insignificant in individual post-hoc Wilcoxon comparisons).
Comparison of adults and children
The main effect of group was found in all conditions, with children giving higher scores than adults in the “zero-primitives” treatment, F(1, 50) = 9.08, p < .01, ηp2 = .154, and “one-primitive” treatment, F(1, 50) = 6.74, p < .05, ηp2 = .119, but with adults giving higher grades to “two-primitives”, F(1, 50) = 4.29, p < .05, ηp2 = .079. There were no significant interactions between group and parameter.
Discussion
Similarly to Experiment 1, adults increased the grades as primitives were added, equally across animations. The only real exception was “one-primitive
Children also responded as in Experiment 1. They distinguished between “zero-” and “one-primitive” conditions, but never between “one-” and “two-primitives” treatments; likewise, across-animation, they provided practically identical scores to “zero-,” “one-,” and “two-primitives” tasks.
While 10-year-olds may still have been too young to focus on the nuance between “one-” and “two-primitives,” it is more important that both populations equally scored each primitive across animations. This strongly suggests that participants were influenced more by an internal schematic structure than by the animation type, solving the task conceptually rather than perceptually. Strikingly, this pattern also obtained with two animations that were neither cross-linguistically prevalent nor salient in prior perceptual studies, further supporting the thesis that an abstract schematic structure motivates the construction of this concept. Moreover, the Verticality conceptualization was not preferred by either population, questioning the thesis that longer use of a language strengthens one particular CMC.
The only exception was the highly ranked “one-primitive
Conclusions
Results suggest that participants experience a sequence of pitches as a concept, not simply a percept. They seem to infer an underlying, schematic structure, prompting them to increase grades with the addition of conceptual primitives. This holds even when the directionality is reversed, and, crucially, when the animations are inconsistent with cross-linguistic options or previous experimental results. We believe this finding supports the proposal for conceptual primitives by linguist Ray Jackendoff, and also psychologist Peter Walker’s claim that different cross-modal correspondences may involve a “modality independent conceptual representation of stimulus features” (Walker, 2016, p. 105). And one need not conclude this just based on the use of five animation types. Additionally, when we introduced a non-diatonic scale, removed the horizontal (visual) Likert sequence, or recruited new groups of participants, the result remained the same. In fact, nothing changed even in spite of two technical omissions in the first two experiments – the failure to apply equal loudness correction and the use of the mistaken 3:2 ratio for semitones (instead of 2:1). With these two issues resolved, results again obtained in Experiment 3. Participants seemed unaffected by structural subtleties in the stimuli across the three experiments because they were basing their responses on a deep abstract structure rather than on any “lower-level” clues – perceptual, statistical, or psychophysical.
It turns out that not two, but at least three schematic factors inform the conceptualizing of scales.
Finally, the influence of the mother tongue on scale conceptualizations remains unclear. Unfortunately, none of our participants were speakers of a “pitch/thickness” language, and we are motivated to recruit some for further work. Yet these results already call into question the thesis that a lexicalization from the native language strengthens one mapping at the expense of others. If such were the case, then 12 more years of exposure to the language (adults vs. children) and knowledge of conventional visual and technical cues (musicians vs. non-musicians) would result in at least a slight preference for vertical movement in the older and musically-trained populations. Apart from slightly lower grades for incongruent vertical movement in Experiment 2, this was practically never the case with our 184 participants in three separate experiments. Further cross-linguistic and cross-cultural work is needed to address this most interesting problem.
In short, we have put forward a case for the idea that musical concepts are based on abstract, schematic mental representations. Naturally, the model needs development, yet if some of the findings prove relevant to psychologists, musicologists, and linguistic semanticists, this could spark further, exciting debate in theories of musical conceptualization.
Footnotes
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Serbian Ministry of Science (project no. 179013).
