Conceptual rather than perceptual: Cross-modal binding of pitch sequencing is based on an underlying schematic structure

Abstract

Theories on the origins of cross-modal correspondences involving pitch speculate on statistical, semantically-mediated, and structural factors. We hypothesize that five apparently different conceptualizations of pitch sequencing are based on an underlying structure consisting of at least two conceptual primitives: discrete distance and unidirectional scalar change. A total of 184 children and adults were asked to rate 52 animations set to a tonal and non-tonal scale and presented as a square moving vertically, shrinking/expanding in size, narrowing/thickening in width, rotating clockwise/counterclockwise, and changing in hue. We varied the underlying structure of each visual stimulus by including or excluding one or both postulated primitives. The scores generally increased as one and then two primitives were added but did not differ across animation type (“pitch/height,” “pitch/size,” “pitch/width,” “pitch/rotation,” “pitch/hue”) if the same number of primitives was present. Overt movement may be preferred to static representations as the third primitive factor. Results suggest that cross-modal binding of pitch-sequencing is a conceptual task – based on an abstract schematic structure rather than lexicalizations from the mother tongue or lower-level perceptual clues.

Keywords

concepts cross-modal correspondence movement pitch schemas

Musical cross-modal correspondences (CMC) represent an area of study with a long tradition. Numerous experiments have tested the interactions between pitch and its perceived correspondence in height, size, or brightness (extensively reviewed in Marks, 2004; Spence, 2011; Walker, 2016). As a result, the prevailing opinion on cross-modal correspondences is that they do not represent a single phenomenon, but rather “utilize a variety of psychological mechanisms [and] have different origins, […] ranging from basic perceptual and motor functions to the shaping of language metaphors and cultural practices” (Eitan, 2017, p. 214). Despite experimental and anthropological data corroborating such diversity (Eitan & Timmers, 2010), many authors propone only one epistemological framework such as: implicit statistical inference from cross-modal bindings in nature (e.g., high frequencies tend to be produced by elevated sources, Parise, Knorre, & Ernst, 2014); “structural” motivation, which may be either neurological (e.g., an increase in stimulus intensity tends to increase neural firing, Spence, 2011) or “amodal” (wherein distant pitches correspond to any contrasting features used to probe them, be it brightness or aromas or haptic opposites, Walker, 2016); and finally, semantic mediation, wherein the use of a particular metaphor in the (native) language influences the construction of CMCs (Martino & Marks, 1999).

These last two theories provide an interesting connection with linguistic semantics, where many scholars agree that concepts are grounded in some sort of schematic relation – “schemas” (Rumelhart, 1980), “image schemas” (Johnson, 1987), or “conceptual primitives” (Mandler, 1992; Jackendoff, 1990) – yet the question remains whether conceptual “cross-field parallelisms are derivational [… or rather are] parallel instantiations of a more abstract schema” (Jackendoff, 2002, p. 359). Regarding cross-modal bindings in pitch relations, the dilemma is therefore (1) whether an abstract principle underlies the various instances of a CMC, suggesting that the correspondence is conceptual rather than merely perceptual, and if so, (2) whether its motivation is predominantly “semantically mediated” or more generally “amodal.” To address this matter, the present study considers this debate in light of the perception of pitch relations in scales.

Numerous studies of pitch relations have revealed differences in the way musical concepts are constructed cross-culturally. Eitan and Timmers (2010) located 35 cross-cultural antonym pairs, from thick/thin and old/young to apparent idiosyncrasies such as “crocodiles/those who follow crocodiles,” seemingly confounding the possibility for universal commonalities beneath such variegated lexical choices (but see Walker et al., 2010). In their study of “height” and “thickness” descriptions of pitch relations cross-linguistically, Dolscheid, Shayan, Majid, and Casasanto (2013) concluded that one’s mother tongue may motivate one’s conceptualization of pitch relations (although with a brief fill-in-the-blanks task, adults can still be trained to use a foreign conceptualization).

In contrast, our previous studies (Antović, 2009; Antović, Bennett, & Turner, 2013) show that children, despite the expressions common to their mother tongue, freely describe pitch relationships using conceptualizations common to foreign languages, such as “thick/thin” in Turkish and Farsi (Shayan, Ozturk, Bowerman, & Majid, 2014; Shayan, Ozturk, & Sicoli, 2011) and “large/small” in Javanese (Perlman, 2004), African Manza (Stone, 1981), and Venda (Blacking, 1970/1995). Furthermore, certain remote populations associate pitch with verticality despite the non-vertical metaphor used in their language (Parkinson, Kohler, Sievers, & Wheatley, 2012). Even infants (Mondloch & Maurer, 2004) and non-human primates (Ludwig, Adachi, & Matsuzawa, 2011) regularly associate pitch with spatial dimensions. Without a mother tongue to speak of among such populations, these examples question the deterministic role of language on constructing conceptualizations; however, one may still propose that one’s native language strengthens pre-existing associations of spatial dimensions with pitch relations (Casasanto, 2010; Dolscheid, Hunnius, Casasanto, & Majid, 2014). Therefore, the aim of the present study is to address the following questions: 1) does a longer exposure to a language enhance a particular CMC (e.g., pitches as heights) at the expense of others?; 2) within a language group, is one CMC “primary,” or do many, or even all, share an underlying abstract structure?

This study includes stimuli representing five different modes – “vertical movement,” “shrinking in size,” “thinning in width,” “rotation,” and “hue change” – corresponding to two different scales – diatonic and non-diatonic. We hypothesize that the participants’ conceptualizations of pitch sequencing are implicitly informed by an abstract schematic structure, comprising two conceptual primitives – discrete distance (stepwise movement in accordance with pitch intervals) and unidirectional scalar change (preservation of a unidirectional path/transformation). We predict the following:

Participants rate the visual representations of a musical scale conceptually, based on the aforementioned primitives, rather than on its visual agreement with the lexical item from their mother tongue or on lower-level perceptual factors;

Longer exposure to the mother tongue and/or musical training does not affect ratings.

Experiment 1

Method

Participants

The sample size was determined in G*Power (inputs: ANOVA, repeated measures, within-factors, η_p² = .1, f = .33, Power = .95, p = .05, two groups, three measurements, correlation among repeated measures = .20). This necessitated 80 participants: non-musicians, native speakers of Serbian (a “pitch/height” language), of whom 40 were primary-school third-graders (mean age 8.8 years, SD = 0.4, 37.5% male) and 40 were adults (mean age 20.15, SD = 0.85, 17.5% male).¹ No participants were excluded.

Materials

In each trial, a sine-wave C-major scale was played (C₄-C₅, 0.5 s per tone, equal in dynamics at approximately 70 dB) through A4 Tech HS800 headphones and was accompanied by visual animations on a 17-inch monitor, observed from the distance of 0.5 meters (refresh rate: 60 Hz). The mp4 files were presented in GOM Player, with one trial per animation. The 12 visual sequences corresponded to three scale conceptualizations: Vertical Movement (down → up, pitch/height), Shrinking (big → small, pitch/size) and Thinning (thick → thin, pitch/width). The directionality of the Shrinking dimension was determined in accordance with previous studies (Antović, 2009; Fernández-Prieto, Navarra, & Pons, 2015), yet the matter is controversial and remains a limitation of this experiment (we address this separately in Experiment 3). Each sequence appeared as a black square on a white surface and demonstrated all permutations of the two postulated conceptual primitives in four conditions. The “two-primitives” condition included both discrete distance and scalar change. As a result, the square moved or transformed from initial to final position in eight unidirectional, discrete steps of 0.60° or 0.40° according to the whole tones or semitones in the scale (this ratio of 3:2 instead of 2:1 was an oversight of the first author and a limitation of the study, yet given the full results of all three experiments, it did not seem to significantly influence the outcome; see Conclusions). The Vertical Movement square ascended discretely from lowest to highest position; the Shrinking square shrank discretely until reaching the 0.60° centered square; and the Thinning square narrowed discretely until reaching the thinnest, 0.60°-in-width centered, stretched rectangle.² The “one-primitive” condition comprised two variants. The first included discrete movement but excluded scalar change, resulting in vertical movement/transformation by discrete steps in one direction for the first five pitches but a “return” for the final three pitches. In Vertical Movement, the square ascended for the first five pitches but descended for the last three. Similarly, the Shrinking square shrank by height and width for the first five pitches but grew “back” for the final three; and the Thinning square shrank horizontally for five tones but expanded “back” for the final three. The second variant of the “one-primitive” condition excluded discrete distance but included scalar change. Thus, the squares moved/transformed smoothly (without discrete steps) and unidirectionally from initial to final position, as follows: in Vertical Movement from lowest to highest location; in Shrinking from largest to smallest size; and in Thinning from widest to thinnest size. With both primitives excluded, the “zero-primitives” condition consisted of a motionless black square in the following initial size/positions: for Thinning, a large, centered square (visual angle 4.98°); for Shrinking, a medium centered square (2.59°); and for Vertical Movement, the smallest square in the lowest position (0.60°). These sizes were selected to make the final square in the “two-primitives” condition identical in all examples.

Procedure

Participants were asked to listen to each of the counterbalanced stimuli and circle a number on a 10-point Likert scale (0–9) to assess how well the animation “agreed” with the music (9 meaning perfect correspondence). Response time was not limited, but never exceeded five seconds. We expected scores to increase with each additional primitive, but irrespective of the animation type, such that there would be no significant differences among “zero-,” “one-,” and “two-primitives” conditions across the three animations. To assess any effect of mother tongue bias within the older population (i.e., preference for Vertical Movement), we tested two different age groups.

Participants’ ratings followed an ordinal scale and were not normally distributed, motivating us to use non-parametric Friedman’s ANOVA. For “zero” and “two-primitives” conditions, we analyzed data directly from the database. However, because the “one-primitive” condition contained two stimuli per animation type (+discrete distance and +scalar change), we needed to determine if there were differences in the distributions of these two individual variants before treating the “one-primitive” variable as a single condition. If there were no differences between the two “one-primitive” variants, we averaged the two grades and then compared the result with the grades for “zero-” and “two-primitives.” Whenever we did find a difference between individual “one-primitive” conditions, we ran Friedman’s ANOVAs on all four conditions. This helped us assess the prevalence of either discrete distance or scalar change in such cases.

We then compared the distributions both “within-animation” (to determine if adding primitives increased the resulting scores) and “cross-animation” (to compare respective scores for “zero-,” “one-,” and “two-primitives” conditions across animation types).

Results

Adults

There were no differences in any pairwise “one-primitive” comparison (Wilcoxon signed-rank test, Bonferroni corrected to p < .01 due to six pairwise comparisons in each ANOVA): Vertical Movement: Z = -0.66, p = .51, r = .07; Shrinking: Z = -0.24, p = .81, r = .03; Thinning: Z = -1.25, p = .21, r = .14.

With no differences, we then averaged these two scores, to establish our final “one-primitive” variable alongside “zero-” and “two-primitives.” Figure 1 presents within-animation results for adults.

Figure 1.

Distributions of grades, adults, Experiment 1 (n = 40). X axis – percentage of participants; Y axis – configuration of primitives; shades of purple (0 to 9) – scores.

There was a significant difference in the scores awarded to “zero-,” “one-,” and “two-primitives” conditions (Freidman’s ANOVA): Vertical Movement: χ²(2) = 65.66, p < .01; Shrinking: χ²(2) = 66.76, p < .01; Thinning: χ²(2) = 58.51, p < .01. In a post-hoc analysis, we ran pairwise Wilcoxon comparisons between adjacent conditions. In all cases, the significance remained well below the Bonferroni-corrected p < .02 (for three pairs, .05/3), with effect sizes r ranging from 0.36 to 0.62.

Across animation, we ran Friedman’s ANOVAs on the respective scores for “zero-,” “one-,” and “two-primitives” conditions through three animation types: “zero-primitives”: χ²(2) = 8.20, p < .05; “one-primitive”: χ²(2) = 3.74, p = .15; “two-primitives”: χ²(2) = 4.61, p = .10. The difference in the “zero-primitives” condition occurred because grades for the largest static square in Thinning were higher than those for the smallest static square in Vertical Movement, Bonferroni-corrected (Z = -2.67, p < .01, r = .30).

Children

Comparing individual “one-primitive” responses by means of a Wilcoxon signed-rank test, Bonferroni-corrected (p < .01), we found no differences in two out of the three pairwise comparisons: Vertical Movement: Z = -3.30, p < .01, r = .037; Shrinking: Z = -0.56, p = .58, r = .06; Thinning: Z = -2.49, p = .013, r = .28. Only in Vertical Movement did scalar change score better than discrete distance. The calculations below therefore average grand “one-primitive” variables for Shrinking and Thinning but preserve all four variables for Vertical Movement, distinguishing the two “one-primitive” conditions. Distributions are given in Figure 2.

Figure 2.

Distributions of grades, children, Experiment 1 (n = 40). X axis – percentage of participants; Y axis – configuration of primitives; shades of purple (0 to 9) – scores.

Again, within animation, the distributions for “zero-,” “one-,” and “two-primitives” conditions differed: Vertical Movement: χ²(3) = 65.56, p < .01; Shrinking: χ²(2) = 24.70, p < .01; Thinning: χ²(2) = 29.66, p = <.01. Post-hoc Bonferroni-adjusted pairwise Wilcoxon comparisons showed significant differences between the “zero-” and “one-primitive” conditions (p < .01), but no difference between “one-” and “two-primitives” conditions, as follows: Vertical Movement 1 (+ scalar change only vs.“two-primitives”): Z = -1.36, p = .17, r = .015; Vertical Movement 2 (+discrete movement only vs. “two-primitives”): Z = -2.18, p = .03, r = .024; Shrinking (“one-” vs. “two-primitives”): Z = -0.53, p = .59, r = .006; Thinning (“one-” vs. “two-primitives”): Z = -0.83, p = .41, r = .09.

There were no differences for the same number of primitives cross-animation, with “zero-”: χ²(2) = 1.86, p = .39; “one-” (averaged): χ²(2) = -2.31, p = .31; and “two-primitives”: χ²(2)=1.14, p = .56. With “one-primitive” conditions separated, the difference emerged with +scalar change: χ²(2) = 10.94, p < .01. Pairwise, Vertical Movement scored higher than Thinning (Z =-2.86, p < .01, r = 0.32) and slightly higher than Shrinking (Z = -2.42, p < .02, r = 0.27), Bonferroni adjusted to p < .02.

Comparison of adults and children

To compare the two groups and test any interactions between subjects, we rank-transformed all responses and then performed a mixed ANOVA with repeated measures on one factor.

The main effect of group was found in the “zero-primitives” condition, in which children gave higher scores than adults, F(1, 78) = 8.15, p < .01, η_p² = .095, and adults gave higher scores than children in the “two-primitives” condition, F(1, 78) = 3.88, p < .05, η_p² = .047. There was no difference between groups in the “one-primitive” condition, F(1, 78) = 0.56, p = .455, η_p² = .007. There were no significant interactions between group and parameter in any of the three conditions.

Discussion

Adults’ results seem to support Hypothesis 1. Whether the scale was represented as a square ascending, shrinking, or thinning, they increased their ratings whenever underlying primitives appeared. Importantly for the discussion on language bias, adults did not seem to prefer an animation type for the “one-primitive” conditions (discrete distance only or scalar change only) and rated similarly the “one” and “two-primitives” conditions across animations. The difference for the “zero-primitives” conditions occurred only in the pairwise comparison of the largest (Thinning) and smallest (Vertical Movement) initial square, which was likely a result of the squares’ different sizes, inevitable in the present design.

The results with children support Hypothesis 1, yet less definitively. First, children assessed the “one-primitive” animations differently for Vertical Movement, wherein the presence of scalar change was scored more strongly than discrete distance. More conspicuously, in all three animations, children increased the scores for the addition of the first primitive, but not further for the addition of the second. They may have simply been too young (e.g., before Piaget’s [1952] “formal-operational stage”), to focus on the fine nuance between “one-” and “two-primitives” conditions. There were no differences, however, between animation type in grades given to “zero-,” “one-,” and “two-primitives” conditions, the only exception out of 12 being a somewhat stronger score for “one–primitive, +scalar change” in Vertical Movement. Otherwise, it seems that 9-year-olds rated the stimuli without preference for the animation type.

We included children, however, to test Hypothesis 2: whether a discernible effect of native language would be present in older participants. This does not seem to obtain. Considering both groups, 21 out of 24 pairwise cross-animation comparisons indicate no preference for Vertical Movement, which would reflect lexicalization from the mother tongue. The following three isolated cases do not seem to interfere with this conclusion: “zero–primitive” Thinning received slightly higher scores among adults, but this stimulus does not display vertical movement. Secondly, Vertical Movement was rated more highly than Thinning and slightly more than Shrinking (“one-primitive, +scalar change”), yet in children and not adults, negating the possibility for a lexical strengthening of a particular cross-modal mapping in older participants. Together, our participants’ responses seem motivated by the primitives underlying the stimuli and not by the animation type. Moreover, their choices look hardly biased by lexicalization for pitches and scales standard in their native language.

Experiment 2

Experiment 2 additionally tested directionality. Here one half of the stimuli were directed congruently with pitch movement (i.e., in following the scale, the square goes up and down, becomes smaller and bigger, or gets thinner and thicker). The other half were directed incongruently (i.e., the square first goes down then up, becomes bigger then smaller, or becomes thicker then thinner). Hypothesis 3 assumes that not even this reversal of movement would interfere with scores either within-animation or cross-animation. In this segment as well, our assumption was that in the second visual presentation, congruent movement following an ascending scale is that of reduction rather than increase in size (Antović, 2009; Fernández-Prieto et al., 2015). Yet this choice is not uncontroversial (Eitan, 2013) and is additionally addressed in Experiment 3.

Method

Participants

The sample size was determined as in Experiment 1. Anticipating equally high effect sizes within-stimulus for adults (r ranging from .36 to .62), we increased the expected effect size here to η_p² = .15 but left the other parameters intact. This necessitated 52 participants, native speakers of Serbian, divided into 26 students with neither formal nor informal musical instruction (mean age 21.35 years, SD = 0.69, 42.6% male) and 26 students with at least five years of professional musical instruction (mean age 21.80, SD = 2.06, 38.5% male). Musicians were included to test whether their familiarity with the jargon for pitch movement and vertical pitch notation would result in a bias toward the “Original-Direction Vertical” stimuli. There were no exclusions in the non-musician group. Two musicians were immediately replaced from within the pool due to inappropriate behavior.

Materials

Because the scale moved in two directions (C₄-C₅-C₄), the stimuli included a directionality variable. Thus, each animation type appeared in eight variants, resembling all permutations of the same two postulated primitives with congruent or incongruent directionality. The “two-primitives” condition included both discrete distance and scalar change, resulting in eight unidirectional, discrete steps from initial to final position in either original or reversed directions, followed by seven discrete steps in the opposite direction from final to initial position. Similar to Experiment 1, the “one-primitive” condition either included unidirectional scalar change and excluded discrete distance, or did the reverse. In the first scenario, the square moved or transformed smoothly (without discrete steps) from initial to final position: from bottom to top and back for Original Vertical Movement, biggest to smallest and back for Original Shrinking, thickest to thinnest and back for Original Thinning, and vice versa in opposite movement conditions (top to bottom and back, etc.) In the second scenario, the squares moved or changed their shape accordingly, yet now in discrete steps corresponding to whole- and half-tones: five steps in one direction, three steps back, then three more steps in the opposite (original) direction, and finally four steps back. The static “zero-primitives” conditions appeared as follows: a 4.98° centered square for Original Thinning and a 0.60°-wide centered rectangle for Reversed Thinning; a 2.59° centered square for Original Shrinking and a 0.60° centered square for Reversed Shrinking; a 0.60° lowermost square for Original Vertical Movement and a 0.60° topmost square for Reversed Vertical Movement.²

Procedure

The procedure was identical to Experiment 1 with twice the number of calculations to account for original- and reversed-direction stimuli.

Results

Non-musicians

We first looked for any differences in two individual “one-primitive” conditions (Wilcoxon signed rank test), again finding no differences in any of the six comparisons: Vertical Movement, Original: Z = -1.15, p = .25, r = .016, and Reversed: Z =-0.23, p = .82, r = .03; Shrinking, Original: Z =-0.99, p = .32, r = .014, and Reversed: Z =-0.82, p = .41, r = .11; Thinning, Original: Z =-1.08, p = .28, r = .015, and Reversed: Z = -1.78, p = .07, r = .025.

We thereby averaged each pair, established the new grand “one-primitive” variable, and compared it to the “zero-” and “two-primitives” variables. Distributions are given in Figure 3.

Figure 3.

Distribution of grades, non-musician adults, Experiment 2 (n = 26). X axis – percentage of participants; Y axis – configuration of primitives; shades of purple (0 to 9) – scores.

For all animations, grades for “zero-,” “one-,” and “two-primitives” significantly differed (Friedman’s ANOVA): Original: Vertical Movement: χ²(2) = 48.06, p < .01; Shrinking: χ²(2) = 44.17, p < .01; Thinning: χ2(2) = 41.18, p < .01. Reversed: Vertical Movement: χ²(2) = 46.39, p < .01; Shrinking: χ²(2) = 40.10, p < .01; Thinning: χ²(2) = 47.94, p < . 01.

All post-hoc pairwise comparisons were significantly different, well below the Bonferroni-corrected threshold (p < .01, with effect sizes r between 0.39 and 0.62.) Comparing the respective scores for “zero-,” “one-,” and “two-primitives” conditions cross-animation, we found there were no differences for original-direction conditions: Zero Primitives: χ²(2) = 3.93, p = .14; One Primitive: χ²(2) = 5.77, p = .06; Two Primitives: χ²(2) = 0.57, p = .75. For opposite-direction conditions, scores for “one-” and “two-primitives” treatments differed: Zero Primitives: χ²(2) = .52, p = .77; One Primitive: χ²(2) = 19.00, p < .01; Two Primitives: χ²(2) = 14.52, p < 01. Post-hoc analysis (Wilcoxon signed-rank tests, Bonferroni adjusted to p < .02), reveals that these last two differences were from the lower scores for Reversed Vertical Movement (“one-primitive” Shrinking vs. Vertical Movement, Z = -2.88, p < .01, r = .040; “one–primitive” Thinning vs. Vertical Movement, Z = -3.36, p < .01, r = .047; “two-primitives” Shrinking vs. Vertical Movement, Z = -2.57, p < .01, r = .36; “two-primitives” Thinning vs. Vertical Movement, Z = -3.17, p < .01, r = .44). Opposite Shrinking and Opposite Thinning did not differ (“one-primitive,” Z = -.31, p = .75, r = .004; two primitives Z = -0.69, p = .49, r = .010).

Musicians

In this group only, scalar change engendered higher grades than discrete distance in all pairwise “one-primitive” comparisons, exceeding the strict Bonferroni threshold in all cases but Original Shrinking: Vertical Movement, Original: Z = -2.71, p < .01, r = .037; Reversed: Z = -2.68, p < .01, r = .37; Shrinking, Original: Z = -1.98, p < .05, r = .27; Reversed: Z = -3.45, p < .01, r = .48; Thinning, Original: Z = -2.84, p < .01, r =.39; Reversed: Z = -3.45, p < .01, r = .048.

Thus, we considered two “one-primitive” treatments separately in all calculations for musicians. Distributions are given in Figure 4.

Figure 4.

Distribution of grades, musician adults, Experiment 2 (n = 26). X axis – percentage of participants; Y axis – configuration of primitives; shades of purple (0 to 9) – scores.

Scores for “zero-,” “one-,” and “two-primitives” conditions differed within each animation type: Original: Vertical Movement: χ²(3) = 62.78, p < .01; Shrinking: χ²(3) = 53.42, p < .01; Thinning: χ²(3) = 58.41, p < .01; Reversed: Vertical Movement: χ²(3) = 55.58, p < .01; Shrinking: χ²(3) = 54.40, p < .01; Thinning: χ²(3) = 62.10, p < .01.

Post-hoc tests showed a clear difference between “zero-primitives” and any of the “one-primitive” conditions (p < .01, effect size r ranging from 0.57 to 0.64). Between individual “one-primitive” conditions and the “two-primitives” condition, the latter received higher scores overall, with particular consistency when compared with +discrete distance only (p < .01, effect size r between 0.41 and 0.57). While no differences were found in any “zero-” or “two-primitives” conditions for all original-direction stimuli, a difference did emerge in a single “one-primitive” comparison: Original: Zero Primitives, χ²(2) = 3.85, p = .15; One Primitive, +scalar change: χ²(2) = 5.65, p = .06; One Primitive, +discrete distance: χ²(2) = 7.41, p < .05; Two Primitives: χ²(2) = 3.19, p = .20. The difference in the second “one-primitive” condition was from lower scores for Shrinking as compared to Vertical Movement: Z = -2.94, p < .01, r = 0.41.

For reversed-direction animations, scores for the “zero-primitives” condition and “one–primitive, +discrete distance” did not differ, but those for “one-primitive, +scalar change” and “two-primitives” treatments did: Opposite: Zero Primitives, χ²(2) = 1.64, p=.44; One Primitive, +scalar change: χ²(2) = 8.80, p < .05; One Primitive, +discrete distance: χ²(2) = 4.28, p = .12; Two Primitives: χ²(2) = 13.84, p < .01. Post-hoc Wilcoxon tests reveal that these differences originated from lower scores for Reversed Vertical Movement as opposed to both Shrinking and Thinning, exceeding the Bonferroni threshold in all cases except “+scalar change, Shrinking” vs. “+scalar change, Vertical Movement” (Z = -2.27, p < .05, r = 0.31).

Comparison of musicians and non-musicians

In the same procedure as in Experiment 1, the main effect of group was found in the “zero-primitives” condition, in which musicians gave the lowest grade (0) while non-musicians’ scores slightly varied: original direction F(1, 50) = 5.06, p < .05, η_p² = .092; opposite direction F(1, 50) = 3.78, p = .06, η_p² = .070. There were no differences in either “one-primitive, +scalar change”: original direction, F(1, 50) = .66, p = .42, η_p² = .013; opposite direction, F(1, 50) = .75, p = .39, η_p² = .015; or “one-primitive, +discrete distance”: original direction, F(1, 50) = .56, p = .46, η_p² = .011; opposite direction, F(1, 50) = 1.85, p = .18, η_p² = .036. In “two-primitives,” there were no differences: original direction F(1, 50) = 1.36, p = .25, η_p² = .027; opposite direction: F(1, 50) = 0.001, p = .98, η_p² < .001. There was just one case of significant interaction between group and parameter – “one-primitive +discrete distance, original direction,” F(2, 50) = 3.41, p < .05, η_p² = .064 – wherein musicians gave Shrinking lower scores, while non-musicians gave Thinning slightly higher scores.

Discussion

The population new to this experiment (musically-untrained adults) again significantly increased the scores in all six animations as primitives were added, with no preference for a particular “one-primitive” configuration. There were no differences across-animation with original directionality, although with opposite directionality, grades for Opposite Vertical Movement were significantly lower in the “one-” and “two-primitives” conditions than in those for Original Vertical Movement.

Musicians consistently preferred scalar change to discrete distance in “one-primitive” conditions and were less pronounced in their preference for “two-primitive” conditions to “one-primitive” conditions, not exceeding the Bonferroni threshold in one original-direction animation and two opposite-direction animations, all comparing “one-primitive, +scalar change” with “two-primitives.” There were no differences among all conditions across animations with original directionality, except for “one-primitive” Shrinking. As for opposite-direction animations, musicians gave lower scores for “one-primitive, +scalar change” and “two-primitives” in Vertical Movement than in the other animation types. This judgment of lower suitability for reversed direction in Vertical Movement may or may not be a bias of their mother tongue or the common association of upward vertical movement with increased pitch.

In Shrinking and Thinning, directionality seemed irrelevant. Indeed, in prior experiments participants have naturally interpreted higher pitch as “bigger” (Antović, 2009; Krugliak & Noppeney, 2015), and falling pitch as “shrinking” (Eitan, Schupak, Zotler, & Marks, 2014). This occasional “reversed direction” preference for size is confounded by the fact that static low pitches are often considered “big” while dynamic movement from low to high is often perceived as expanding (Eitan, 2013). Experiment 3 addresses this problem in more detail.

Additionally, musicians’ preference for scalar change to discrete distance in “one-primitive” configurations may reflect musical training and should be considered in future studies. The same applies to the peculiar result in which “one-primitive” Shrinking was rated lower by musicians exclusively, perhaps due to this group’s experience with piano or violin, wherein lower pitches do not correspond to smaller size.

Experiment 3

The first two experiments were limited in three ways. First, by using a major scale, we may have evoked unintended musical confounds (e.g., the sense of closure or tonal center) in participants’ choices. Secondly, the visual dimensions of verticality, size, and thickness used in this study have proven pertinent in other studies, not necessarily suggesting the crucial influence of a conceptual, primitive structure. Finally, the issue of “congruence” in directionality was somewhat sensitive, considering the common assumption that ascending pitches become smaller (e.g., Fernández-Prieto et al., 2015) and the surmise by Dolscheid et al. (2013) that “greater spatial height corresponds to higher frequency, but greater spatial thickness corresponds to lower frequency” (2013, p. 615). Yet, some authors note that size corresponds to pitch in opposite ways depending on whether the tones are played individually or in sequence (Eitan, 2013). To corroborate this critique, we asked participants in a post-hoc test to listen twice to two extreme tones of a scale and indicate which was bigger and then which was thicker (counterbalanced with smaller and thinner). Thereafter, we played the entire scale and asked whether they perceived it as “growing” and then also as “thickening.” Admittedly, responses for size perception were random, while those for thickness were not (Table 1). To address these issues, the third experiment tested CMCs of an upward, non-diatonic Bohlen-Pierce scale, in four animation types: two (verticality and thickness) latent in previous cross-linguistic work and two (rotation and hue change) both absent in cross-linguistic work (Eitan & Timmers, 2010) and not perceptually salient (Bernstein, Eason, & Schurman, 1971; Iwamiya, 2013). Additionally, we recruited children one year older, to better distinguish between “one-” and “two-primitives” conditions.

Table 1.

Percentages of forced-choice, binary verbal responses to static and dynamic stimuli.

Response	Children (%)	Adults (%)
STATIC
Lower is bigger	69.2	42.3
Lower is smaller	30.8	57.7
Total	100	100
Lower is thicker	88.5	100
Lower is thinner	11.5	0
Total	100	100
DYNAMIC
Upward gets bigger	57.7	73.1
Upward gets smaller	42.3	16.9
Total	100	100
Upward gets thinner	80.8	100
Upward gets thicker	19.2	0
Total	100	100

Method

Participants

We gathered 26 primary-school fourth-graders (mean age 9.8 years, SD = 0.4, 38.5% male) and 26 adults (mean age 20.8, SD = 0.9, 42.3% male), all native speakers of the same “pitch/height” language, Serbian. No adults were excluded. Three children were immediately replaced from within the pool due to not following instructions.

Materials

In each trial, participants heard a seven-tone ascending sine-wave Bohlen-Pierce scale (approximate frequencies in Hz: 261, 283, 370, 440, 476, 568, 610) with 0.5 s per tone at 70 dB and, in response to the limitation of the first two experiments, with equal loudness correction (Parise, 2016). By using seven non-diatonic tones, we removed the effects of musical closure and tonal hierarchy. While equipment and procedure were identical to Experiments 1 and 2, the stimuli comprised 16 visual sequences, corresponding to four scale conceptualizations: Vertical Movement (down → up, pitch/height), Thinning (thick → thin, pitch/width), Rotation (clockwise → counterclockwise, pitch/rotation), and Hue Change (blue → red, pitch/hue).

The manipulation of the two conceptual primitives in each presentation mirrored Experiment 1. However, the square sizes and steps were altered to match the frequency changes in the Bohlen-Pierce scale. Thus, in the “two-primitives” condition in which both discrete movement and unidirectional scalar change were included, the squares moved/transformed from initial to final position in seven unidirectionally discrete steps. The “one-primitive” condition again comprised two variants per conceptualization, the first including discrete movement but excluding scalar change. The stimuli permutations for Vertical Movement and Thinning mimicked those for Experiment 1, but used the discrete steps corresponding to the frequency leaps in the Bohlen-Pierce scale, wherein 1 Hz change amounted to 0.02° movement. For Rotating, the square first rotated clockwise four times and then counterclockwise three times, in the ratio of two degrees per Hz, approximating the length of the step in the first two animation types. For Hue Change, the square discretely changed from pure blue to pure red for four steps and then changed “back” three steps. We calculated the “step” for Hue Change by dividing the distance in Hz between the first and last tones in the scale by the number of steps between pure blue and pure red on the RGB scale (237 to 357), with saturation and brightness constant. The result was an approximate change of 1 step on the RGB scale for every 3 Hz of changed frequency. The other “one-primitive” variant, as before, excluded discrete movement but included scalar change, wherein the square for each conceptualization transformed/moved from its initial to final position without discretization. Finally, the “zero-primitives” static squares/rectangles were presented in the initial size relative to the conceptualization: in Thinning (visual angle 7.61°); in Vertical Movement (0.60°), this time centered both vertically and horizontally to test for any difference from the lowermost position from Experiment 1; in Rotation (colored black, visual angle 1.89°); and in Hue Change (colored blue, 1.89° – the mean of the smallest and largest squares in Experiment 1).

Procedure

The procedure was identical to Experiment 1, except that, to avoid potential spatial interference (Lidji, Kolinsky, Lochy, & Morais, 2007; Rusconi, Kwan, Giordano, Umilta, & Butterworth, 2006), we asked participants to state their score rather than circle it on a horizontal scale. We again expected that the scores would increase with the addition of each primitive but would not differ across-animation for the same number of primitives. If this obtained with rotation and hue change, animations neither salient cross-linguistically nor preferred in prior perceptual studies, it would strengthen our case for an underlying primitive structure.

Results

Adults

We first looked for any differences in the distributions of two individual “one-primitive” responses (Wilcoxon signed-rank test, Bonferroni corrected to p < .01). The distributions differed in two (Vertical Movement and Hue Change) out of four animations: Vertical Movement: Z = -2.91, p < .01, r = .40; Thinning: Z = -2.39, p < .05, r = .33; Rotation: Z = -2.05, p < .05, r = .28; Hue Change: Z = -3.76, p < .01, r = .52. Figure 5 provides within-animation results.

Figure 5.

Distributions of grades, adults, Experiment 3 (n = 26). X axis – percentage of participants; Y axis – configuration of primitives; shades of purple (0 to 9) – scores.

Within-animation, the scores awarded to “zero-,” “one-,” and “two-primitives” conditions differed (Freidman’s ANOVA): Vertical Movement: χ²(2) = 47.62, p < .01; Thinning: χ²(2) = 44.25, p < .01; Rotation: χ²(2) = 42.56, p < .01; Hue Change: χ²(2) = 51.36, p < .01. In a post-hoc analysis, calculated with “one-primitive” conditions separated for Verticality and Hue Change and averaged for Thinning and Rotation, all relevant adjacent values were significantly different, p < .01. The only exception was “one-primitive +discrete distance” compared to “two–primitives” in Hue Change, with no differences, Z = 1.67, p = .09, r = .23. In Vertical Movement, “one-primitive +discrete distance” also ranked higher than “one-primitive, +scalar change,” but both scored lower than “two-primitives.”

Across animations, there were no differences in any of the four comparisons (Friedman’s ANOVA): “zero-primitives”, χ²(2) = 3.70, p = .30; “one-primitive +discrete distance”, χ²(2) = 7.12, p = .07; “one-primitive +scalar change”: χ²(2) = 4.53, p = .21; “two-primitives”, χ²(2) = 2.37, p = .50.

Children

There were no differences in any of the four pairwise “one-primitive” comparisons (Bonferroni-corrected to p < .01): Vertical Movement, Z = -2.91, p = .77, r = .04; Thinning, Z = -1.59, p = .11, r = .22; Rotation, Z = -.3, p = .76, r = .04; Hue Change, Z = -.36, p = .72, r = .05. Therefore, we used mean values for “one-primitive” conditions in all four animations with children. Figure 6 provides distributions for all conditions.

Figure 6.

Distributions of grades, children, Experiment 3 (n = 26). X axis – percentage of participants; Y axis – configuration of primitives; shades of purple (0 to 9) – scores.

Again, within animation types, the distributions for “zero-,” “one-,” and “two-primitives” conditions differed: Vertical Movement, χ²(3) = 29.41, p < .01; Thinning, χ²(2) = 27.53, p < .01; Rotation, χ²(2) = 18.57, p < .01; Hue Change, χ²(2) = 24.82, p < .01.

Post-hoc, all cases showed differences between the “zero-” and “one-primitive” treatments (p < .01), but no difference between “one-” and “two-primitives” conditions: Vertical Movement, Z = -1.26, p = .21, r = .15; Thinning, Z = -1.16, p = .25, r = .16; Rotation, Z = -1.06, p = .29, r = .15; Hue Change, Z = -0.33, p = .74, r = .05.

Across animations, there were no differences for the same number of primitives with children: “zero-”: χ²(2) = 7.92, p = .05; “one-” (averaged): χ²(2) = -1.08, p = .78; “two-primitives”: χ²(2) = 3.81, p = .28. The borderline difference in the “zero-primitives” condition was from slightly lower average grades for the Verticality static square (insignificant in individual post-hoc Wilcoxon comparisons).

Comparison of adults and children

The main effect of group was found in all conditions, with children giving higher scores than adults in the “zero-primitives” treatment, F(1, 50) = 9.08, p < .01, η_p² = .154, and “one-primitive” treatment, F(1, 50) = 6.74, p < .05, η_p² = .119, but with adults giving higher grades to “two-primitives”, F(1, 50) = 4.29, p < .05, η_p² = .079. There were no significant interactions between group and parameter.

Discussion

Similarly to Experiment 1, adults increased the grades as primitives were added, equally across animations. The only real exception was “one-primitive +discrete distance” in Hue Change, which ranked as highly as “two-primitives” in Hue Change.

Children also responded as in Experiment 1. They distinguished between “zero-” and “one-primitive” conditions, but never between “one-” and “two-primitives” treatments; likewise, across-animation, they provided practically identical scores to “zero-,” “one-,” and “two-primitives” tasks.

While 10-year-olds may still have been too young to focus on the nuance between “one-” and “two-primitives,” it is more important that both populations equally scored each primitive across animations. This strongly suggests that participants were influenced more by an internal schematic structure than by the animation type, solving the task conceptually rather than perceptually. Strikingly, this pattern also obtained with two animations that were neither cross-linguistically prevalent nor salient in prior perceptual studies, further supporting the thesis that an abstract schematic structure motivates the construction of this concept. Moreover, the Verticality conceptualization was not preferred by either population, questioning the thesis that longer use of a language strengthens one particular CMC.

The only exception was the highly ranked “one-primitive distance” with Hue Change in adults. First, this particular conceptualization might not have been perceptually salient enough for the participants to infer direction. For instance, the color did progress “toward red” and then regress “toward blue,” but changed into several semantically salient colors in between, such as pink and purple, potentially interfering with participants’ sense of directionality. While a follow-up experiment with a (monochromatic) change in brightness could address this hypothesis, brightness has been a pitch descriptor in numerous studies. Second, clear results in the four cases in which squares actually moved or transformed, but not in color change, may imply that overt movement is more salient in conceptualizing pitch sequences than the more generic scalar change. In that sense, there may be three, not two, underlying primitives – discrete distance, scalar change, and overt movement – aligned on the generality/specificity scale.

Conclusions

Results suggest that participants experience a sequence of pitches as a concept, not simply a percept. They seem to infer an underlying, schematic structure, prompting them to increase grades with the addition of conceptual primitives. This holds even when the directionality is reversed, and, crucially, when the animations are inconsistent with cross-linguistic options or previous experimental results. We believe this finding supports the proposal for conceptual primitives by linguist Ray Jackendoff, and also psychologist Peter Walker’s claim that different cross-modal correspondences may involve a “modality independent conceptual representation of stimulus features” (Walker, 2016, p. 105). And one need not conclude this just based on the use of five animation types. Additionally, when we introduced a non-diatonic scale, removed the horizontal (visual) Likert sequence, or recruited new groups of participants, the result remained the same. In fact, nothing changed even in spite of two technical omissions in the first two experiments – the failure to apply equal loudness correction and the use of the mistaken 3:2 ratio for semitones (instead of 2:1). With these two issues resolved, results again obtained in Experiment 3. Participants seemed unaffected by structural subtleties in the stimuli across the three experiments because they were basing their responses on a deep abstract structure rather than on any “lower-level” clues – perceptual, statistical, or psychophysical.

It turns out that not two, but at least three schematic factors inform the conceptualizing of scales. discrete distance corresponds to temporal alignment, an important factor in any multimodal binding (Iwamiya, 2013) and perhaps the most fundamental schema for the given correspondence. scalar change, then, is a mid-specific factor, and overt movement yet more specific. The model is, of course, still in an early stage of development, and requires further work.

Finally, the influence of the mother tongue on scale conceptualizations remains unclear. Unfortunately, none of our participants were speakers of a “pitch/thickness” language, and we are motivated to recruit some for further work. Yet these results already call into question the thesis that a lexicalization from the native language strengthens one mapping at the expense of others. If such were the case, then 12 more years of exposure to the language (adults vs. children) and knowledge of conventional visual and technical cues (musicians vs. non-musicians) would result in at least a slight preference for vertical movement in the older and musically-trained populations. Apart from slightly lower grades for incongruent vertical movement in Experiment 2, this was practically never the case with our 184 participants in three separate experiments. Further cross-linguistic and cross-cultural work is needed to address this most interesting problem.

In short, we have put forward a case for the idea that musical concepts are based on abstract, schematic mental representations. Naturally, the model needs development, yet if some of the findings prove relevant to psychologists, musicologists, and linguistic semanticists, this could spark further, exciting debate in theories of musical conceptualization.

Footnotes

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Serbian Ministry of Science (project no. 179013).

Notes

References

Antović

(2009). Musical metaphors in Serbian and Romani children: An empirical study. Metaphor and Symbol, 24(3), 184–202.

Antović

Bennett

Turner

(2013). Running in circles or moving along lines: Conceptualization of musical elements in sighted and blind children. Musicae Scientiae, 17(2), 229–245.

Bernstein

I. H.

Eason

T. R.

Schurman

D. L.

(1971). Hue–tone interaction: A negative result. Perceptual and Motor Skills, 33, 1327–1330.

Blacking

(1995). The music of Venda girls’ initiation. In J.

Blacking

, Music, culture and experience: Selected papers of John Blacking (pp. 73–126). Chicago, IL: The University of Chicago Press. (Original work published 1970)

Casasanto

(2010). Space for thinking. In Evans

Chilton

(Eds.), Language, cognition and space: The state of the art and new directions (pp. 453–478). London, UK: Equinox.

Dolscheid

Hunnius

Casasanto

Majid

(2014). Prelinguistic infants are sensitive to space-pitch associations found across cultures. Psychological Science, 25(6), 1256–1261.

Dolscheid

Shayan

Majid

Casasanto

(2013). The thickness of musical pitch: Psychophysical evidence for linguistic relativity. Psychological Science, 24(5), 613–621.

Eitan

(2013). How pitch and loudness shape musical space and motion: New findings and persisting questions. In Tan

S. L.

Cohen

Lipscomb

Kendall

(Eds.), The psychology of music in multimedia (pp. 165–191). Oxford, UK: Oxford University Press.

Eitan

(2017). Musical connections: Cross-modal correspondences. In Timmers

Ashley

(Eds.), Routledge companion to music cognition (pp. 213–224). Abingdon, UK: Routledge.

10.

Eitan

Schupak

Gotler

Marks

L. E.

(2014). Lower pitch is larger, yet falling pitches shrink. Experimental Psychology, 61(4), 273–284.

11.

Eitan

Timmers

(2010). Beethoven’s last piano sonata and those who follow crocodiles: Cross-domain mappings of auditory pitch in a musical context. Cognition, 114(3), 405–422.

12.

Fernández-Prieto

Navarra

Pons

(2015). How big is this sound? Crossmodal association between pitch and size in infants. Infant Behavior and Development, 38, 77–81.

13.

Iwamiya

S. I.

(2013). Perceived congruence between auditory and visual elements in multimedia. In Tan

S. L.

Cohen

Lipscomb

Kendall

(Eds.), The psychology of music in multimedia (pp. 141–164). Oxford, UK: Oxford University Press.

14.

Jackendoff

(1990). Semantic structures. Cambridge, MA: MIT Press.

15.

Jackendoff

(2002). Foundations of language. New York, NY: Oxford University Press.

16.

Johnson

(1987). The body in the mind: The bodily basis of meaning, imagination, and reason. Chicago, IL: The University of Chicago Press.

17.

Krugliak

Noppeney

(2015). Synaesthetic interactions across vision and audition. Neuropsychologia, 88, 65–73.

18.

Lidji

Kolinsky

Lochy

Morais

(2007). Spatial associations for musical stimuli: A piano in the head? Journal of Experimental Psychology: Human Perception and Performance, 33(5): 1189.

19.

Ludwig

V. U.

Adachi

Matsuzawa

(2011). Visuoauditory mappings between high luminance and high pitch are shared by chimpanzees (Pan troglodytes) and humans. Proceedings of the National Academy of Sciences, 108(51), 20661–20665.

20.

Mandler

(1992). How to build a baby: II. Conceptual primitives. Psychological Review, 99(4), 587–604.

21.

Marks

L. E.

(2004). Cross-modal interactions in speeded classification. In Calvert

Spence

Stein

B. E.

(Eds.), Handbook of multisensory processes (pp. 85–106). Cambridge, MA: MIT Press.

22.

Martino

Marks

L. E.

(1999). Perceptual and linguistic interactions in speeded classification: Tests of the semantic coding hypothesis. Perception, 28, 903–923.

23.

Mondloch

C. J.

Maurer

(2004). Do small white balls squeak? Pitch-object correspondences in young children. Cognitive, Affective, & Behavioral Neuroscience, 4(2), 133–136.

24.

Parise

C. V.

(2016). Crossmodal correspondences: Standing issues and experimental guidelines. Multisensory Research, 29, 7–28.

25.

Parise

C. V.

Knorre

Ernst

M. O.

(2014). Natural auditory scene statistics shapes human spatial hearing. Proceedings of the National Academy of Sciences, 111(16), 6104–6108.

26.

Parkinson

Kohler

P. J.

Sievers

Wheatley

(2012). Associations between auditory pitch and visual elevation do not depend on language: Evidence from a remote population. Perception, 41(7), 854–861.

27.

Perlman

(2004). Unplayed melodies: Javanese gamelan and the genesis of music theory. Berkeley: University of California Press.

28.

Piaget

(1952). The origins of intelligence in children. New York, NY: International Universities Press.

29.

Rumelhart

(1980). Schemata: The building blocks of cognition. In Spiro

R. J.

Bruce

B. C.

Brewer

W. F.

(Eds.), Theoretical issues in reading comprehension (pp. 33–58). Hillsdale, NJ: Erlbaum.

30.

Rusconi

Kwan

Giordano

B. L.

Umilta

Butterworth

(2006). Spatial representation of pitch height: The SMARC effect. Cognition, 99(2), 113–129.

31.

Shayan

Ozturk

Bowerman

Majid

(2014). Spatial metaphor in language can promote the development of cross-modal mappings in children. Developmental Science, 17(4), 636–643.

32.

Shayan

Ozturk

Sicoli

M. A.

(2011). The thickness of pitch: Crossmodal metaphors in Farsi, Turkish, and Zapotec. The Senses and Society, 6(1), 96–105.

33.

Spence

(2011). Crossmodal correspondences: A tutorial review. Attention, Perception & Psychophysics, 73, 971–995.

34.

Stone

R. M.

(1981). Toward a Kpelle conceptualization of music performance. Journal of African Folklore, 94, 188–206.

35.

Walker

(2016). Cross-sensory correspondences: A theoretical framework and their relevance to music. Psychomusicology, 26, 103–116.

36.

Walker

Bremner

J. G.

Mason

Spring

Mattock

Slater

Johnson

S. P.

(2010). Preverbal infants’ sensitivity to synaesthetic cross-modality correspondences. Psychological Science, 21(1), 21–25.