Abstract
The ability to recognize a pitch centre in sound sequences belongs to the basic mental tools which are intuitively used by humans when they listen to music. It is also one of the abilities used by listeners in order to establish a tonal hierarchy. The organization of pitches around a pitch centre is one of the most ubiquitous syntactic rules observed in music of all cultures. As far as we know, there is nothing similar to this rule in other human sound expressions nor in animals’ vocal communication. Thus, the recognition of pitch centricity seems to be the unique and species-specific ability of Homo sapiens, which suggests its evolutionary origin. It is proposed in the article that in the course of hominine evolution, the ability of pitch centre recognition became an adaptive innovation which enabled a more effective social consolidation. It is also suggested that the origin of this ability has its roots in the ‘Baldwin effect’ which led to the emergence of a predisposition to join three originally separate abilities – the implicit recognition of the frequency of pitch occurrence, working memory and the emotional assessment of predicted stimuli – into a new mental tool.
Pitch centricity (PCT) is one of the most ubiquitous features of music and from a structural point of view, refers to a certain pitch class being used preferentially by the means of repeating it in important places, for example in the downbeat of a metrical cycle or at the end of a phrase. PCT is made possible by mentally establishing which of the pitch classes is pivotal. Although the attribution of a privileged status to one pitch class – pitch centre (PC) or tonic – is mainly associated with Western tonal music, this phenomenon is also present in non-Western music (Thomson, 1999). In order to avoid Eurocentric phraseology, in the case of non-Western music, a variety of terms are used as an equivalent to tonic such as ‘ground-pitch’, or ‘pitch focus’ (Thomson, 1999, p. 215). In all of these examples, however, one pitch class always gains the predominant role.
Although pitch centre recognition (PCR) is a complex cognitive phenomenon, the psychological nature of which has been the subject of competing explanations (Bharucha, 1996; Butler, 1989; Huovinen, 2002; Huron, 2006; Krumhansl, 1990; Krumhansl & Cuddy, 2010; Lerdahl, 2001; Temperley, 1999), it is assumed that the crucial psychological mechanism underlying it is the implicit statistical analysis of pitch-frequency distribution (Huron, 2006; Krumhansl, 1990). In addition, the whole process of establishing PC is intuitive and the acquisition of tonal hierarchy is implicit (Tillmann, Bharucha, & Bigand, 2000). To show similarities between visual and auditory perception (Huovinen, 2002), as well as mechanisms of prediction (Huron, 2006), scientists usually explain the natural character of this ability as one of the general principles of cognition (Krumhansl and Cuddy, 2010). This article will present another point of view in which PCR is claimed to be an adaptive innovation for sound communication which emerged in the biological evolution of hominines by means of the Baldwin effect. The evolutionary advantage of PCR is related to the survival value of musical rituals in which pitch order plays an important role in the facilitation of social cohesion.
Evolution of mental abilities and the Baldwin effect
In 1871, Darwin proposed natural selection as the reason for the origin of human intellect and emotions. Since that time, many scholars have appreciated the importance of natural selection in shaping the human mind (Cosmides & Tooby, 1992; Gazzaniga, 2008; Wilson, 1978), although there is still no agreement as to which mental abilities are actually the result of adaptations and what role the cultural environment has played in their evolution. Some scientists claim that if a particular faculty is observed among the whole population of a given species, then it is most likely to be an adaptation (Dawkins, 1982; Wilson, 1978). Others indicate that certain ubiquitous traits can be the evolutionary by-products as famously described by the spandrel analogy (Gould & Levontin, 1979; Millikan, 1995). There is also the possibility that a trait which has been selected for a particular role can later gain a new adaptive role. Such a phenomenon is known as exaptation (Gould & Vrba, 1982). Despite the functional differences between all of these phenomena, their emergence is either directly or indirectly influenced by natural selection. In accordance with classical Darwinism, the source of variability which enables natural selection is entirely accidental (Dawkins, 1982).
However, since selection also acts on organisms whose behaviour can be learned, there is the possibility that ontogenetic adaptation can influence the course of evolution. According to Dennett (1991, 1995) this influence can direct evolution. In such a process, described independently by Baldwin (1896), Morgan (1896) and Osborn (1896), and termed the ‘Baldwin effect’ (Simpson, 1953), learning plays the initial role in the appearance of adaptive abilities. If animals are able to learn socially, then some adaptive changes, as responses to new challenges, are first achieved by social invention and then proliferated by the means of social learning. After that, if the process of learning is strenuous and costly and if the selective pressure lasts long enough, then instinctive learning which emerges accidentally is preferred by natural selection. Thus, Baldwinian evolution is a process in which natural selection transforms the learned response of an organism into an instinctive response (Dor & Jablonka, 2001). As a computational model of the Baldwin effect shows (Hinton & Nowlan, 1987), it can also change the rate of evolution (Dennett, 1995). According to Maynard Smith (1987) the Baldwin effect speeds up evolution ‘by altering the search space in which evolution operates (p. 762).’ Although the Baldwin effect does not explain the origins of all adaptive phenomena, it seems a probable mechanism which has led to the emergence of complex adaptive behaviours, such as songbird songs (Morgan, 1896) or grammar in language (Dor & Jablonka, 2000), that still partly depend on learning.
Theories of human vocal evolution and PCR
Current theories have neglected the role of PCR in the evolution of human vocal communication. There are however some exceptions. According to Merker, it is very probable that PCR predates music and is present among nonhuman primates (2006). Based on the observation of rhesus monkeys (Macaca mulatta) that are able to recognize a one- and two-octave transposition of simple tonal melodies as similar, but not with atonal melodies (Wright, Rivera, Hulse, Shyan, & Neiworth, 2000), Merker (2006) has hypothesized that macaques can recognize a melody as a perceptual whole using one pitch as a reference point. Yet, he has not indicated what particular role PCR played in the evolution of human singing (Merker, 2005, 2012). It is worth mentioning that rhesus monkeys do not sing and do not use any sequences of stable pitch categories in their vocal communication (cf. Rowell & Hinde, 1962). Thus, their categorization of a musical tune is most probably based on a relational memory that is not facilitated by the means of an emotional/motivational component which is crucial when macaques (Gouzoules & Gouzoules, 2000) and other social mammals (Juslin & Scherer, 2005), as well as birds (Rothenberg, Roeske, Voss, Naguib, & Tchernichovski, 2014), recognize their species-specific calls. Because human PCR is accompanied with the emotions of stability and attraction – examples of emotional/motivational states – it is reasonable to suppose that human PCR differs from macaques in respect to an emotional component.
In contrast to Merker’s view, Bannan (2012) suggests that the appearance of tonality became possible thanks to the ability to vocally control the fundamental frequency of sounds. In addition, he claims that the establishment of tonality was a critical adaptation which led to the emergence of a proto-song considered to be a stage of vocal communication from which music and language eventually branched out. Although such a stage has been proposed by other scholars (Brown, 2000; Fitch, 2006, 2013; Mithen, 2006; Morley, 2002, 2013) they have not enumerated PCT as a feature of proto-music. However, since PCT is absent from speech but present in music, Bannan’s suggestion would be more probable if the proto-song was not a stage from which speech evolved. The absence of PCT in speech indicates the separate origin of PCR rather than being a part of speech evolution.
Pitch order as an innovative communicative tool
Pitch order (PO) is the arrangement of pitches in time which are recognizable by individuals belonging to a particular culture. PO is one of the most important features of musical syntax (Lerdahl, 2013; Lerdahl & Jackendoff, 1983). It is also specific solely to music. Emotions are often emphasized as the most common content of musical communication (Juslin, 2005). Importantly, music is able to communicate emotions not only indirectly by eliciting in listeners a conscious association between a musical structure and a particular emotion, but also by direct induction of the emotional reactions (Krumhansl, 1997). In fact, listening to music may induce emotions by the means of many different mental mechanisms (Juslin & Västfjäll, 2008). From an evolutionary point of view, some of them are old evolutionary adaptations. A telling example of such an adaptation is ‘affective prosody’ (Merker, 2003; Zimmermann, Leliveld, & Schehka, 2013). However, music is also composed of additional features which convey affective meaning. I propose that PCT is one of these features.
As far as it is known, different kinds of music based on PCT are observed in all human cultures (Bannan, 2012). PC is even perceived within certain atonal compositions (Krumhansl, Sandell, & Sergeant, 1987; Ockelford & Sergeant, 2013). However, the popularity of avant-garde music, which is deprived of PC, is restricted to select groups of people (Dutton, 2009). Thus, although PCT is not observed in every musical utterance, ‘rooting of songs in the tonic (ground-pitch) of whatever scale type’ is often indicated as an example of musical universals (Brown & Jordania, 2013, p. 10) thanks to its statistical predominance in music around the world. Considering this, even if one agrees that ‘music’ is an ethnocentric category (List, 1971; Nettl, 2000), it is still reasonable to suppose that syntactic organization of pitches based on PCT is a universal feature of human sound expressions.
Syntax is a characteristic feature in both language and music. Although there are examples of a restricted combinatorial organization in non-human primate vocalizations (Clay & Zuberbühler, 2009; Ouattara, Lemasson, Zuberbühler, & Poon, 2009), they lack complex generativity, hierarchy and abstract structural relations – features which are assumed to be the key properties of syntactic structure within music and language (Patel, 2013). Apart from this, humans are the only living primates capable of complex vocal learning – an ability that is meant to be crucial for speech and song acquisition and production (Fitch & Jarvis, 2013). Admittedly, there are examples of songbird songs that displace a combinatorial organization which is meaningful and recognizable by conspecifics (Gentner, Fenn, Margoliash, & Nusbaum, 2006; Rothenberg et al., 2014; Soard & Ritchison, 2009; Woolley & Doupe, 2008). In addition, some birds seem to use discrete pitch sets as elements in their songs (Cross et al., 2013). Songs in which single units are arranged into whole phrases, are often called syntactic (Berwick, Okanoya, Beckers, & Bolhuis, 2011; Cross et al., 2013). However, both the language and music syntaxes are more complex than any currently known rules of animal song (Fitch & Zuberbühler, 2013). These observations support the claims that language and music syntaxes are human-specific traits of vocal communication.
The ubiquity of the intuitively recognizable generative character of every language and almost all music suggests that there is at least some inborn developmental predisposition towards the acquisition of syntactic rules in the music and language environment (Fitch & Jarvis, 2013). However, in contrast to language syntax which is strictly linked with semantics (Bickerton, 2009; Dor, 2000), in music there is nothing even close to resembling this kind of connection. Rather than being connected with conceptual meaning, musical syntax influences the induction of emotion in its listeners. The violation of the syntactic rules of PO can cause a measurable emotional reaction (Koelsch, 2005). But the relation between PO and emotion is something more than merely causing surprise or not. Listening to particular pitches in a particular context elicits a variety of subtle emotional states known as qualia (Huron, 2006; Margulis, 2014; Scherer, 2013). Each pitch of a musical system has its own emotional tinge that depends on a specific tonal context. These emotional states are the basis of a tonal hierarchy which is stored in long-term memory (Krumhansl, 1990). The ability of PCR seems to be a fundamental cognitive characteristic which plays a crucial role in creating such a hierarchy. The difference between central and non-central pitch is the first step in establishing more complex functional relations between pitch categories. These types of relations (e.g. tonic–dominant–tonic) are only restricted to PO. There are in fact examples of human cognitive phenomena in which, similar to tonality, recursion creates some forms of hierarchies (Kinsella, 2010). Reiteration is also observed in many domains of music (Margulis, 2014) especially in rhythm (London, 2004). However, the rhythm hierarchy lacks ‘dominating–subordinating constituencies’ (Thompson-Shill et al., 2013, p. 294) and all language hierarchies ultimately serve to communicate propositional meaning. Thus, all these examples of hierarchies differ from the tonal hierarchy.
PO, PCT, emotions and cognition
A stable context-dependent emotional hierarchy is something which seems to be specific only to music (Podlipniak, 2013). Tonal hierarchy is built by the means of statistical learning of pitch-frequency distribution and is the basis of the listeners’ pitch expectations (Krumhansl, 1990). Because particular pitches usually occur in different contexts with various frequencies, they can influence a person’s predictions and their emotional reaction (Huron, 2006). Interestingly, the veridical memory (Bharucha, 1994) of a particular musical piece does not influence the emotional reaction to scale degrees. The feeling induced by a leading tone retains its character even though a listener knows the piece by heart and can precisely predict the occurrence of this leading tone.
Equally interesting is that people are able to make almost accurate cross-cultural tonal predictions (Castellano, Bharucha, & Krumhansl, 1984; Eerola, 2004; Eerola, Louhivuori, & Lebaka, 2009; Kessler, Hansen, & Shepard, 1984; Lantz, Kim, & Cuddy, 2013). This suggests that people who listen to unfamiliar music can attribute different degrees of stability to pitches based on an ad hoc implicit analysis of the pitch-frequency distribution (Krumhansl, 1990), although they can also refer to their culture-specific tonal hierarchy (Curtis & Bharucha, 2009). In both cases however, implicit statistical learning is the main mechanism underlying the establishment of pitch stability. Thus, this mental strategy seems universal (Cross, 2005). Additionally, PCR is realized by the means of an emotional reaction facilitating a cognitive hierarchy of pitches.
But what is the reason for such a specific emotional reaction to PC as being stable (cohering, agreeing, etc.)? The basic biological function of emotion is to assess and control an organism’s interactions with the environment, which enhances survival value (Panksepp, 1998). More specifically, emotions serve as both a motivational tool (Mortillaro, Mehu, & Scherer, 2013) and to evaluate the relevance of environmental stimuli with the organisms needs (Scherer, 2013). Because sounds usually deliver important information about the environment, it is clear that hearing something can cause an emotional reaction. Thus, the basic cues which affect sound sensation are acoustical features.
There are theories that point to psychoacoustic properties as an explanation for tonal relations. These theories assume that dissonance is the main source of tonal tension whereas consonance causes tonal resolution (Bharucha, 1984; Lerdahl, 2001; Large, 2005; Sethares, 2005). Such reasoning has a long history in Western music theory (e.g. Riemann, 1896) as well as in science (von Helmholtz, 1863). According to von Helmholtz, sensory dissonance is a source of tension as a result of the interference of harmonic partials of simultaneously sounding complex tones. This seems promising because sensory dissonance is usually resolved to consonance in Western tonal music. However, PCT is not a perceptive but cognitive phenomenon (Huron, 2006). This means that the assessment of centricity depends on pitch context, and its emotional meaning cannot be inferred exclusively from acoustical traits of sound. Apart from this, sensory dissonance is also used together with PC, which is difficult to explain if sensory dissonance was to be the main source of tonal tension. These examples are observed both in Western music (e.g. some jazz styles: Butler & Brown, 1994) and in non-Western music (Ambrazevičius & Wiśniewska, 2009).
More recently, another psychoacoustic explanation of tonal stability has been proposed (Large, 2011; Large & Tretakis, 2005). According to Large and Almonte (2012), the recognition of tonal relations is due to nonlinear resonance in the auditory neuronal networks which can lead to ‘stability and attraction relationships among neural frequencies. (p. E1)’ In contrast to other psychoacoustic theories Large proposes that tonal relations are due to the generic properties of nonlinear frequency transformation taking place in neural networks. This assumption is based on the hypothesis that nonlinear transformation of sound frequencies occurs in the mammalian central auditory nervous system (Large & Tretakis, 2005). From this point of view, tonal relations specific to Western music are refined by learning rather than being established in the process of enculturation (Large & Almonte, 2012). However, this theory does not explain the fact that PC is easily recognizable in music based on equidistant scales, which is observed in some pre-instrumental cultures (cf. e.g. Ellis, 1965) as well as in some forms of non-European instrumental music composed of intervals whose size differs significantly from European intervals (cf. e.g. Keefe, Burns, & Nguyen, 1991).
Thus, although the aforementioned psychoacoustic theory may account for a culturally specific use of pitch characteristics which refines tonal relations (e.g. in Western major-minor tonality), the stability relationship among neural frequencies is not a necessary condition to establish PC. This rather suggests that pitch hierarchy in Western harmony only facilitates PCR because harmonic context is probably an additional factor important for statistical learning of pitch hierarchy. As a result, PC is more closely affiliated with certain chords than others (cf. Arthur, 2014). Yet, this does not mean that the ability to discriminate partials of harmonic sounds is not important for PCR in monophony. The discrimination of partials was definitely an important point of departure for the emergence of monotony and monophony in our ancestors’ vocal communications (Bannan, 2012). It is also a necessary ability to divide pitch space into more or less stable pitch categories as well as to recognize the octave equivalence. However, the organization of pitches around a PC necessitates something more than the recognition of a particular pitch class.
Acoustical cues are not the only source of expressiveness (Scherer & Zentner, 2001). Sounds also became a useful tool for intentional communication of animals. An example of such communication is the vocal expression of emotions which have evolved as a nonverbal signalling system in order to facilitate social interaction (Juslin & Scherer, 2005). This type of system serves to communicate an organism’s reactions to, states, and intentions regarding its social surroundings (Scherer, 2013). Yet, apart from this system based on fixed expressive meaning, human vocal expression is additionally composed of messages that are coded in a socially learned fashion (Juslin & Scherer, 2005). In such cases, stable emotional reactions to particular stimuli can be the effect of social learning. The frequent co-occurrence of certain sounds with particular events can lead to a stable emotional association. In fact, the emotional judgment of tonality is usually assumed as being culture-specific (Juslin, 2001), which suggests that social learning is the reason for emotional meaning of tonal relations.
Nonetheless, although there are cross-cultural differences between musical tonal systems (Ellis, 1965; Keefe et al., 1991; Nettl, 2005) and diverse nuisances in the emotional reaction to scale degrees (Curtis & Bharucha, 2009), as far as it is known, PCR is accompanied by the feelings of stability, coherence, contentment and agreement cross-culturally. This stable and ubiquitous attribution is difficult to explain in terms of social learning (Podlipniak, 2013, 2014). There is nothing functionally unchangeable in the extra musical context of a particular pitch distribution. If social learning was exclusively responsible for the stable connection between PC and a particular emotion, then one could expect that various social contexts of music in different cultures could lead to the establishment of many diverse culture-specific connections between PC and emotions. But this is not the case.
Another explanation suggests that a specific emotional reaction to PC is the result of misattribution (Huron, 2006). According to Huron, the predictions of stimuli generate emotional reactions as a result of the adaptive value of the general ability of prediction which is the ultimate function of the nervous system (Llinás, 2002). The stabilizing emotional reaction to PC is the consequence of a correct prediction that is based on statistical learning of a scale degree occurrence in particular pitch contexts in previously experienced music. Positive emotion is incorrectly attributed towards PC because the actual adaptive reason for the positive emotion is successful prediction, not the occurrence of a particular pitch. However, because this explanation reduces the emotional reaction to PC to the principles of general cognition, the same way of reasoning should be applied to predictions of all kinds of stimuli, for example a frequently seen picture, or a spoken word. But the feeling of stability which accompanies PC seems to be incomparable to the feelings which accompany other well predicted stimuli. Indeed, according to Panksepp and Biven (2012), a stronger emotional reaction to well-predicted auditory stimuli in comparison to a well-predicted visual stimuli could be explained by the fact that the inferior colliculi (the structure involved in the transmission of auditory and mechanosensory information: Striedter, 2005) is more caudal, that is, probably evolutionarily older than the superior colliculi (the structure involved in the transmission of visual information: Striedter, 2005). Nevertheless, it does not explain why emotional reactions to a well-predicted phoneme or word seem to be at best less impressive. The emotional specificity of PO perception is perhaps the main reason why tonal tension is understood as a uniquely musical phenomenon (Lerdahl & Krumhansl, 2007). However, accurately predicted timbre or changes in volume do not elicit an emotional reaction similar to that with well predicted PC.
All of these issues can provoke crucial questions. Firstly, why do emotional reactions to pitch distribution differ from the reactions to timbre distribution in music and speech perception? Why do people spontaneously organize music using ‘pitch centricity’ while they do not do likewise, for example, using ‘timbre centricity’? And finally, why do people intuitively experience pitch centricity whereas they do not have the sense of what could be called ‘timbre’ or ‘phonemic centricity’? Of course, there are some general neurophysiological traits which result in pitch perception being multidimensional (Patel, 2008). This specificity of pitch perception, and the physical characteristics of periodic sounds, favours the use of pitch contrast as a basis for sound organization in music (Patel, 2008). However, these characteristics do not explain why in all cultures humans organize pitches around PC. To explain the cross-cultural ubiquity of PCT as a mere by-product of general cognition is ungrounded. If an observed trait is unique to a given species and ubiquitous among its representatives in the way PCT is unique to Homo sapiens, it is reasonable to suppose that such a trait is a result of a domain-specific inborn predisposition (Gazzaniga, 2008). This predisposition should be understood as a specific motivational mechanism which leads to the development of a domain-specific ability. This means that humans are endowed with the proclivity for a spontaneous assessment of pitch-frequency distribution and can use this for creating the mental gradation of the perceived pitch classes out of which PC is prominent.
The biological function of pitch structure
Every evolutionary explanation demands that an adaptive function of an evolved trait should be pointed out. Thus, it is necessary to indicate which adaptive function is performed by PCR. Although it has been indicated that musical syntax results from innate capacities (Lerdahl & Jackendoff, 1983), PCR has not been explained in terms of its adaptive functions. In language, particular semantic categories determine some universal aspects of grammar (Dor, 2000). Thus, the adaptive function of linguistic grammar can be understood in terms of the communication of referential meaning. In music, the feeling of stability which is specific to PCR is irreplaceable with the feeling of tension. Additionally, listeners are unaware that some feelings are determined by PO. In this respect, the relationship between semantic categorization and the grammatical trait is similar to the relationship between PO and emotional reaction.
The connection between PCR and the emotion of stability suggests that the establishment of PC is related to the adaptive results of music behaviour. This is possible because music performers share similar mental representations (structural, emotional, and kinaesthetic), that is, alignment of their brain states, which enables them to become synchronized and in turn promotes group identity (Bharucha, Curtis, & Paroo, 2012). These performers, thanks to the shared feeling of stability (emotional alignment) triggered by a collective performance of PC (structural alignment), subconsciously experience the feeling that the co-performers are all members of the one community. Emotional alignment informs them that they have similar intentions, needs and goals (Tomasello, 2008; Tomasello, Carpenter, Call, Behne, & Moll, 2005). Co-performers have to collectively accept a particular hierarchy of pitches even in the case of the previous competition for certain resources. Therefore, a musical performance that is organized around PC can serve as a tool for reducing tension between group members and as a result can enhance mutual trust. Although processes of rhythmic entrainment are often believed to be the actual reason for the socializing capability of music (Cross, 2014; McNeill, 1997), the pitch synchronization also seems to be an important factor in the facilitation of social bonding (Bharucha et al., 2012). This is indicated by the fact that phase synchronization in respiration and heart rate variability between choir singers is higher during unison singing than when singing different voice parts (Müller, Lindenberger, & Kurths, 2011). It is not simply ‘when’, but also ‘what’ is sung that is important for a well-synchronized performance. As a consequence, information related to pitches performed by others is desirable for the performers’ overall cooperation. Thus, it is reasonable to suppose that the establishment of PC enables the performers to expect the subsequent pitches and makes cooperation during music-making easier.
Additionally, PCR facilitates orientation in pitch space (Snyder, 2000) and allows the manipulation of PO during music making. This ability enables people to intentionally play with the emotions of anxiety and stability. The use of pitch-frequency distribution in order to create the feelings of tension and stability necessitates, apart from the activity of the limbic areas, the cortical processing of pitch as well as the employment of working memory (Koelsch et al., 2009). One of the significant human evolutionary achievements is the developed ability to control and modulate emotional processing (Hariri, Bookheimer, & Mazziotta, 2000). This control is achieved by the cortical inhibition of subcortical activity (Hariri et al., 2000; Ochsner & Gross, 2005). The cortical processing of pitch and the retention of PC in working memory may have served as a tool for rehearsing this cortical inhibition. Such an extended control of emotions is very useful in sustaining social relations (cf. Rhoades, Greenberg, & Domitrovich, 2009).
Although it has been proposed that both the emergence of new types of feelings of certainty and an inhibitory control of them coevolved together with linguistic communication (Jablonka, Ginsburg, & Dor, 2012), PO is in this respect at least as effective as language. In some sense, eliciting feelings of certainty by means of PO is more efficient than by the use of words. People can unconsciously assess the level of social cohesion throughout the establishment of PC instead of communicating this verbally. In this respect, PCR is less time-consuming than understanding spoken declarations and assessing their reliability. Because an understanding of emotions and intentions as well as the goals of others is essential to survive in the social environment (Moll & Tomasello, 2007), the social function of PO based on PCT in transmitting all this information is adaptive. This is consistent with Scherer’s (2013) claim that emotional expression and impression gained a new function in the domain of social bonding. What became unique in our ancestors in respect to the use of PO was a voluntary, vocal control of subsequently sung pitches (Bannan, 2012; Morley, 2013). This ability enabled our ancestors to communicate emotionally about coherence (agreement) or tension (uncertainty) similar to the control of subsequently spoken phonemes that enabled the communication of propositional meaning. All these hypothetical functions of PO are possible on the condition that people are able to recognize PC.
The evolution of PCT and the Baldwin effect
The main pitch structure characteristic of the majority of musical styles is to culturally transmit information. Clearly, cultural transmission is possible thanks to learning abilities that are observed among animals. There are at least two ways in which behavioural traits are learned: (i) through non-imitative social learning (i.e., the observation of the conditions and the consequence of a particular behaviour) and (ii) imitation (Jablonka & Lamb, 2005). Although these two kinds of learning are not usually independent of each other in humans, the character of pitch structure transmission (i.e., the importance of fidelity in the reproduction of pitch sequences) indicates that it is learned solely by the means of imitation. What characterizes this kind of learning is the exact copying of the actions of other individuals (Jablonka & Lamb, 2005). In contrast to other forms of learning, imitation is related to so-called ‘ritual culture’ (Merker, 2005) in which ‘the exact form and particulars of execution are primary’ (Merker, 2012, p. 219). In this respect, pitch sequences are comparable to non-human species’ ritualized displays, for example to the songs of songbirds (Fitch & Jarvis, 2013). In these instances, specifically arranged pitch sequences serve to communicate information regarding the sender’s condition, intention, and motivation (Alcock, 2001), although the patterns of pitches are arbitrary in accordance with the ritual’s function (Merker, 2005).
Rituals can be slightly modified in the long term, often as a result of the fluctuating preferences of the recipients. Because human rituals are usually collective (Merker, 2009), their execution demands an alignment of the performers’ brain states. Additionally, the social character of human rituals means that the success of the ritual depends on the reaction of a social audience (Merker, 2005). Therefore, the social environment becomes an important factor in establishing an acceptable form of the ritual. As a result, ritualized pitch sequences can be more or less modified during the process of cultural evolution depending on a group’s preferences. As a result, the cultural evolution of songs can lead to differentiation of song dialects.
In order to start this process a set of mental capabilities is necessary. Among them are: (i) the ability to analyse auditory pitch patterns, (ii) an expanded auditory memory, and (iii) an auditory learning ability. Since singing most probably preceded instrumental music (Morley, 2013), the imitation of pitch structure additionally demanded vocal learning – the ability to vocally reproduce what is heard (Merker, 2012). Complex vocal learning is rare among mammals and absent in living primates but not in humans (Janik & Slater, 1997); it therefore had to evolve among our predecessors after our evolutionary lineage had branched out from the common ancestor of the chimpanzee and human. Vocal learning enabled our ancestors to sing. But how did PCR evolve? It seems impossible to explain the emergence of PCR solely by the means of an accidental genetic mutation. Since there is no causal relation between a structure and function of ritual the selective pressure for this preference does not exist; thus, the proliferation of such a mutation seems impossible.
One possible solution for this problem is the ‘Baldwin effect’. The Baldwinian scenario for the evolution of PCR starts with the invention of pitch sequence ritual. Hominines able to vocally learn pitch sequences began to use these ritualized sequences in some socially important contexts. It is only a matter of speculation whether this ritual was religious (Alcorta, Sosis, & Finkel, 2008) or served other social purposes. In all these cases the selective pressure favoured the best performers, that is, those who learned most effectively and were infallible during ritual display. In the beginning, because of the working memory capacity the number of pitches in such rituals was restricted only to a few notes. At this stage no pitch held greater importance over other pitches. If the performance of the ritual resulted in a positive effect (e.g. the affirmation of group membership), then the closure of the ritual gained an informative role. Since one may only assume that the first rituals were inseparable from performance (Morley, 2009), that is, singing from dance activity, then the closure of such a ritual would end with a particular pitch. This view is supported by the notion that pre-linguistic hominines would lack the ‘cognitive fluidity’ necessary to create complex artefacts and behaviours (Mithen, 1996, 2006).
Because information transmitted by means of a ritual usually affects autonomic, endocrine and behavioural responses in the recipient (Alcorta et al., 2008), it is reasonable to assume that such positive effects of the ritual triggered these responses in our predecessors too. The autonomic and endocrine responses are strictly connected with emotions (Panksepp, 1998) which must have been experienced by the hominines participating in the ritual. However, the ultimate cause of the emotional reaction to the ritual display is the function of the ritual (Merker, 2005). Thus, the emotions that accompanied the recognition of a particular pitch at the end of the ritual, and which were induced originally by the positive effect of the ritual, were misattributed to a well-predicted pitch (cf. Dutton & Aron, 1974). At such a moment, a successful performance was needed in order to misattribute a particular emotional reaction to the final pitch. If the adaptive function of the ritual consisted in group consolidation, the social context of group integration during singing always elicited a feeling of stability when singers achieved a particular pitch together. Additionally, the emotional reinforcement of a final pitch conferred a privileged role resulting in its more frequent use. This process led to the culturally induced emergence of a cognitive-behavioural loop in which a performance of a tune finished with a particular pitch that was associated with the ultimate function of the ritual; the ritual induced an emotional reaction that in turn facilitated the memorization of the privileged pitch which became PC. Although emotions facilitate working memory (Lindström & Bohlin, 2011), at that time, the connection between the emotional reaction and PCR was much weaker than among contemporary humans. Thus, hominines had to learn the rules of repetition by strenuous learning that resembled the contemporary learning of writing.
The cultural niche became a selective environment that favoured the fastest learners. At some point in time, the accidental mutation predisposed one individual to combine the implicit knowledge of pitch statistics with emotional reaction more effectively than others, which resulted in faster learning. The fastest learners consolidated most successfully, and thus started to predominate among the whole species. But this time, the emotion which accompanied PCR was not the result of misattribution. The result that was previously achieved by means of many culturally imposed repetitions suddenly became an instinctive response to musical stimuli. In consequence, PCR became an implicit mnemonic strategy used by all people to improve memory of melodies. From a neurobiological perspective, this evolutionary innovation was a subtle change (Podlipniak, 2014). It consists of the inborn predisposition to couple already existing structures, dedicated to (i) emotional reaction, (ii) extracting implicit knowledge of pitch distribution, and (iii) sustaining representations of pitch units in working memory, in one loop. However, from an ethological and psychological perspective, this drastically changed human musicality.
Conclusion
This article has attempted to show that PCR is a human-specific mental ability that should be understood as an evolutionary innovation of vocal communication. This view is based on two main observations: (i) the psychological specificity of PCR and (ii) the cross-cultural and historical ubiquity of PCT in music. The human-specific and music-specific character of PCT stands in contradiction to the general nature of the mental mechanisms by which PCR has been explained by other authors. Thus, it seems impossible to explain PCR solely by the means of these mechanisms. A more convincing hypothesis is that in the course of evolution all of these evolutionarily ancient mechanisms (i.e., emotional reaction based on prediction, the statistical analysis of stimuli, and the ability to sustain pitches in working memory) were included in an evolutionarily new mental tool. This is consistent with evolutionary logic i.e. tinkering in an opportunistic way instead of constructing new tools from scratch (Jacob, 1977).
As has been indicated, it is highly probable that the evolution of PCR was strictly related to the emergence of human ritual culture and the social character of our species. It was also emphasized that the cultural environment played an important role in this process causing the Baldwin effect. As a result, music structure became a tool of emotion control which enhanced social bonding. The presented view suggests that PCR evolved because of its function, which is absent from language. What distinguishes music from language in this respect is the intuitive involvement of the emotional reaction in the processing of segmental organization. This is possible thanks to PCT which conveys emotionally-based meaning. Yet, in speech, the rules of segmental sound organization are poor at eliciting emotions. The propensity for organization of pitches around PC has influenced the pitch structures of different music idioms around the world. However, this influence has not been as restrictive as, for example, the influence of the human propensity to laugh on the acoustical characteristic of laughter (Zeifman, 2001) which is much less culturally ductile than phonetics or music acoustical specificity (Patel, 2008). In fact, PCR has allowed considerable space for cultural invention. As a result, pitch syntaxes differ cross-culturally and historically due to propagation of musical conventions. Despite this variability the majority of pitch syntaxes possess PCT (Bannan, 2012). In this respect, the world-wide variety of music pitch structures resembles the diversity of language grammars.
Evidently, because of a scarce amount of data the presented evolutionary scenario has a speculative character. According to Hauser, one of the most useful approaches designed to understand the communicative phenomena in the living world is Timbergen’s framework which uses four different perspectives, that is, mechanistic, ontogenetic, functional and phylogenetic (1996, p. 2). Only a few facts related to these perspectives concerning PCR have been explained; therefore, a full understanding of PCR requires further research. Possible research that would test the proposed specificity of PCR would be the comparison of emotional reactions to syntactically unexpected speech and music sequences. This could be achieved by the means of measuring somatic markers of autonomic nervous system activity as a response to speech and music stimuli. This kind of study would be especially useful if the participants in the research came from different cultures.
Although the uniqueness and ubiquity of PCT is accepted by many scholars (Bannan, 2012; Brown & Jordania, 2013; Cross, 2005; Huovinen, 2002; Sloboda, 1986), its functional characteristics remain unclear. If the ability of PCR is a result of human inborn propensity, this would mean that musical pitch syntaxes cannot be autonomous from their functions. Therefore, here the functional explanation of PCT has been proposed. The culturally invented consolidatory function of PCT triggered interactions between genetic and cultural evolution which resulted in the emergence of biased cognition of pitch. This bias explains the observable easier learning of tonal compared to atonal sequences. In order to prove the presented point of view, the ontogenetic and phylogenetic details concerning PCR should be elucidated. For example, the remaining unanswered questions are why the human mind started to be sensitive both to formants (as distinctive features of phonemes) and relative pitch (as a distinctive feature of musical pitch) and what was the role of the sexual dimorphism of adult male and female voices in the evolution of PCR. A neurobiological explanation of the mechanisms responsible for PCR must also be better understood. However, the proposed origin of PCR is exactly what is expected as a result of the Baldwin effect. On the one hand, people instinctively recognize PC in music, whilst on the other, musical syntaxes are culturally flexible and variable.
Footnotes
Acknowledgements
I would like to thank the reviewers for their helpful suggestions and comments. I would also like to thank Peter Kośmider-Jones for language consultation.
Funding
This work was supported by the Polish National Center of Science [grant number 0101/B/H03/2011/40].
