Abstract
There is widespread interest in the possibility that music training enhances nonmusical abilities. This possibility has been examined primarily for speech perception and domain-general abilities such as IQ. Although social and emotional processes are central to many musical activities, transfer from music training to socioemotional skills remains underexplored. Here we synthesize results from studies examining associations between music training and emotion recognition in voices and faces. Enhancements are typically observed for vocal emotions but not for faces, although most evidence is cross-sectional. These findings are discussed considering the design features of the studies. Future research could explore further the neurocognitive mechanisms underlying musician-related differences in emotion recognition, the role of predispositions, and the implications for broader aspects of socioemotional functioning.
Introduction
Music training has been used widely as a model for studying brain plasticity (Habib & Besson, 2009; Herholz & Zatorre, 2012; Kraus & Chandrasekaran, 2010; Moreno & Bidelman, 2014; Pantev & Herholz, 2011), which refers to changes in brain structure and/or function that occur as a consequence of learning and experience. Learning to play a musical instrument is a demanding multimodal task that usually starts in childhood and involves large amounts of practice, often for several years. It requires high-precision motor control, perceptual tuning to the fine-grained acoustics of sounds, integration of information from several sensory modalities, and attention, memory, and emotional processes.
Due to potential theoretical and practical implications, there is widespread interest in the possibility that music training has consequences that extend to nonmusical abilities. Such generalization of knowledge is called “transfer.” A distinction is often made between near transfer, which occurs between domains that are closely related (e.g., between two different pitch discrimination tasks), and far transfer, which occurs between domains that have less in common (e.g., between music and mathematics; Barnett & Ceci, 2002). If music training promotes transfer of skills, this would inform debates on learning and plasticity (Herholz & Zatorre, 2012), the organization of cognitive functions (e.g., modularity; Besson et al., 2011), the biological basis of musicality (e.g., Clark et al., 2015), and the use of music as a tool in clinical and educational contexts (Dumont et al., 2017; Grau-Sánchez et al., 2020).
Many studies have examined far transfer from music to speech perception and domain-general abilities such as IQ and executive functions. Music training has been associated with enhancements in areas such as speech-in-noise perception (Coffey, Chepesiuk, et al., 2017; Coffey, Mogilever, & Zatorre, 2017; but see Boebinger et al., 2015) and prosody perception (e.g., Marques et al., 2007). A theoretical account has been put forward for understanding these effects, based on plasticity mechanisms and the overlap between sensory and cognitive processes in both music and speech (OPERA hypothesis; Patel, 2011, 2014). Enhancements have also been documented for working memory (Roden et al., 2014), executive functions (Bugos & DeMarie, 2017; Moreno et al., 2011), and IQ (Schellenberg, 2004, 2011). Although it is often claimed that music training causes advantages in nonmusical tasks, most evidence comes from cross-sectional comparisons between musicians and nonmusicians, which do not establish causality (Schellenberg, 2019). Sala and Gobet (2017a, 2017b, 2020) reviewed longitudinal training studies conducted with children, and found that far transfer from music to cognitive and academic skills is in fact weak and most likely to be found when study designs are not optimal (e.g., no random allocation of participants). Claims about music training and transfer therefore remain a matter of debate.
In this review, we address the association between music training and a central aspect of socioemotional processing—the ability to recognize emotions expressed by others. Music is linked fundamentally to emotional and social processes, but potential transfer of skills to these domains is typically overlooked, in contrast to the large number of studies focusing on speech perception and cognitive abilities. Our evaluation of the available literature on music training and the recognition of vocal and facial emotions focuses on three issues: (a) design features of the studies, (b) specificity and scope of the associations, and (c) underlying cognitive and neural mechanisms.
Music and Socioemotional Processes
Music’s prevalence across human societies relates to its power to express, induce, and regulate emotions (for a review, see Swaminathan & Schellenberg, 2015). Such connection between music and emotion is part of our social interactions from early life, for instance as a communicative channel between caregivers and infants. Caregivers’ speech to infants is music-like (Fernald et al., 1989), and singing to infants is ubiquitous, with effects on sustained attention, arousal, social regulation, and emotional synchrony (Nakata & Trehub, 2004; Trehub, 2003). Moreover, adolescents and adults report listening to music primarily to regulate their moods (Lonsdale & North, 2011; North et al., 2000). Music also triggers synchronization behaviors such as dancing or foot tapping, which induce positive emotions (Trost et al., 2017) and promote bonding and prosocial behavior (Tarr et al., 2014).
Similarly to other socioemotional signals such as faces and vocalizations, music expresses emotions that can be recognized quickly and consistently across listeners (e.g., Bigand et al., 2005; Thompson, 2009). There is also evidence of above-chance recognition of musical emotions across cultures, suggesting that there are universal cues to emotion in music (Balkwill & Thompson, 1999; Fritz et al., 2009). Emotion recognition in music can additionally be systematically linked to factors such as age and music training. For example, the ability to identify negative emotions, but not positive ones, starts to decline in middle age (Castro & Lima, 2014; Lima & Castro, 2011), and music training is associated with enhanced sensitivity to musically expressed emotions (Bhatara et al., 2011; Castro & Lima, 2014; Lima & Castro, 2011).
Consistent with the fundamental link between music and emotion, a meta-analysis of neuroimaging studies shows that music listening recruits core brain systems underlying emotional processes, including the amygdala, nucleus accumbens, hypothalamus, hippocampus, insula, cingulate cortex, and orbitofrontal cortex (Koelsch, 2014). According to Koelsch (2013, 2014), the social functions of music are at the core of these emotional effects. Actively participating in musical activities, especially in a group, would engage social functions that are essential for the survival of the individual and the species. Playing music promotes contact with others, social cognition processes including mental state attribution, copathy (emotional states becoming more homogenous across individuals), communication, coordination of actions, and cooperation, leading to increased social cohesion. By involving these functions, music would fulfil several basic human needs in an effortless manner, possibly explaining its role in evolution and development, and its power to produce pleasure and reward value.
In line with this argument, Clark et al. (2014, 2015) highlight the potential role of musical activities in promoting the ability to analyze others’ mental states and motivations. They argue that music has evolved from the call signals of our hominid ancestors as a vehicle to model and express emotional mental states, allowing socially relevant routines to be abstracted, rehearsed, and transformed in a low-cost manner (i.e., without the costs of enacting the corresponding scenarios). The proposal of a connection between music and social cognition is supported by findings indicating that the decoding of mental states in music is impaired in frontotemporal dementia, a paradigmatic acquired disorder of social behavior (Downey et al., 2013). This musical impairment correlates with measures of social inference and empathic capacity and, notably, has a neuroanatomical substrate in medial prefrontal and temporal areas that are implicated in mentalizing and social behavior. Furthermore, adults with a developmental deficit in music processing (congenital amusia) show impairments in socioemotional cognition, including a decreased ability to infer the emotional authenticity of laughter and to decode emotions from vocal and facial expressions (Lima, Brancatisano, et al., 2016). These results suggest that intact musical behavior and capacities might be needed for the normal development of socioemotional processing.
It is therefore plausible to hypothesize that systematic engagement in musical activities could lead to improvements in socioemotional skills and corresponding brain networks. Music could additionally improve socioemotional skills as a consequence of low-level sensory enhancements. For example, music training can fine-tune aspects of auditory processing, including pitch (Habibi et al., 2016; Moreno et al., 2009), timing (Chobert et al., 2014; Frey et al., 2019), and timbre (Putkinen et al., 2019). Processing these acoustic cues is critical for sensory stages of socioemotional perception in the auditory domain, particularly for vocal emotions. The acoustic profile of musical and vocal emotions is in fact partly shared, with notable commonalities in the use of acoustic cues such as pitch, intensity, and high-frequency energy (Curtis & Bharucha, 2010; Ilie & Thompson, 2006; Juslin & Laukka, 2003). For instance, in both music and speech, anger is communicated by fast tempo, rising pitch, high intensity, and high-frequency energy, whereas sadness and tenderness are typically associated with reductions in these features (Juslin & Laukka, 2003). The locus for putative transfer from music to socioemotional processing could thus be at an auditory-perceptual level of processing, at a higher order social cognition level, or both.
Emotion Recognition Abilities
Emotion recognition is a crucial socioemotional skill. In everyday interactions, we receive information about the emotional states of others through a multitude of nonverbal cues expressed by the face, body, and voice. The voice is a particularly rich communicative tool: emotions can be conveyed through prosody in actual speech (emotional prosody) and through nonverbal vocalizations. Both prosody and nonverbal vocalizations are vocal signals, but their underlying production and perception mechanisms are partly distinct (Pell et al., 2015; Scott et al., 2010). Emotional prosody corresponds to variations in aspects of speech, including fundamental frequency, amplitude, timing, and timbre (Banse & Scherer, 1996; Scherer, 1995). Nonverbal vocalizations are nonspeech vocal sounds such as laughs, screams, or sobs.
Forced-choice classification and rating studies indicate that listeners can recognize a wide range of emotions in facial, prosodic, and nonverbal vocal cues, with performance levels well above chance (e.g., Bänziger et al., 2012; Langner et al., 2010; Lima et al., 2013). This recognition process is fast, automatic (Lima et al., 2019; Tracy & Robins, 2008), and partly governed by universal principles (e.g., Chronaki et al., 2018; Laukka & Elfenbein, 2020; Sauter et al., 2010, 2015; but see Gendron et al., 2018). Emotion recognition abilities also vary widely across individuals. For instance, older adults perform worse than younger adults on facial and vocal emotion recognition tasks (e.g., Amorim et al., 2019; Lima et al., 2014; Ruffman et al., 2008), and some studies show associations between poor empathy and poor emotion recognition (e.g., Dawel et al., 2012; Gery et al., 2009; but see Vilaverde et al., 2020). Understanding variability in emotion recognition is important because of its potential implications for everyday behavior and communication. Emotion recognition abilities relate to personal and social adjustment in children (e.g., Blair & Coles, 2000; Bowen & Nowicki, 2007; Stevens et al., 2001) and adults (e.g., Carton et al., 1999; Dawel et al., 2012; Hall et al., 2009), and are central to the notions of emotional competence and emotional intelligence (Mayer et al., 2003; Scherer & Scherer, 2011). Moreover, intervention studies focused on emotion recognition have shown that these abilities are amenable to training (e.g., Golan et al., 2010; Schlegel et al., 2017).
Several neural systems are involved in emotion recognition, including the amygdala, fusiform gyrus, superior temporal sulcus and gyrus, and medial and lateral prefrontal cortices (e.g., Frühholz et al., 2016; Schirmer, 2018; Schirmer & Adolphs, 2017). Some of these systems are engaged regardless of modality—namely the medial prefrontal cortex and the posterior part of the superior temporal sulcus (e.g., Peelen et al., 2010)—and play a more general role in socioemotional processing. Nevertheless, other systems show modality-specific effects. Vocal emotions seem to more readily engage the superior temporal sulcus and gyrus, whereas the amygdala is more readily engaged by facial emotions (Grandjean, 2020; Schirmer, 2018; Young et al., 2020). Recent studies have additionally suggested that the motor system also plays a role in socioemotional processing, both for vocal (Correia et al., 2019; Lima, Krishnan, & Scott, 2016; McGettigan et al., 2015; O’Nions et al., 2017) and facial expressions (Ethofer et al., 2013; Johnston et al., 2013; Kreifelts et al., 2013; Rochas et al., 2013). For instance, a stronger recruitment of the motor system predicts better performance in vocal emotion recognition tasks (Correia et al., 2019; McGettigan et al., 2015).
Music Training and Emotion Recognition Abilities
Several studies have examined whether music training is associated with enhanced performance in emotion recognition tasks. The results of these studies are summarized in Figure 1 and supplemental Tables S1 and S2. Most of them are based on cross-sectional comparisons between trained and untrained participants (n = 17), but a few have used longitudinal training designs (n = 4).

Studies on emotion recognition in musically trained individuals, organized according to design (cross-sectional or longitudinal) and stimulus type (speech prosody, speech prosody analogues, nonverbal vocalizations, faces, and audiovisual).
Cross-Sectional Evidence
Most cross-sectional studies ask how trained and untrained listeners recognize vocal emotions, using prosodic stimuli in the majority of cases (e.g., Correia et al., 2020; Dmitrieva et al., 2006; Farmer et al., 2020; Fuller et al., 2014; Lima & Castro, 2011; Park et al., 2015; Pinheiro et al., 2015), but also melodic analogues of emotional prosody (Thompson et al., 2004; Trimmer & Cuddy, 2008) and purely nonverbal vocalizations (Correia et al., 2020; Parsons et al., 2014; Young et al., 2012). Only a few studies examined emotion recognition for other modalities, including faces (Correia et al., 2020; Farmer et al., 2020; Weijkamp & Sadakata, 2016) and audiovisual stimuli (Farmer et al., 2020; Weijkamp & Sadakata, 2016). The focus is typically on the recognition of specific emotions (e.g., happiness, sadness), evaluated via forced-choice tasks in which participants select the emotion being expressed by each stimulus from a list of alternatives. Performance is then measured in terms of the accuracy with which emotions are recognized. There are also studies evaluating other aspects of emotional processing, including valence and arousal perception (Dibben et al., 2018), and inferences of distress (Parsons et al., 2014; Young et al., 2012) and depression in voices (Nilsonne & Sundberg, 1985). Trained participants are usually adults with at least 5 years of classical music training, which is in line with the broader literature on music training (Zhang et al., 2018), who are compared to individuals with little or no training.
A musician advantage emerges in the majority of studies on vocal emotion recognition (11 out of 17; 65%). For instance, Lima and Castro (2011) compared 40 musicians and 40 untrained listeners from a wide age range (18–60 years) on their ability to recognize seven prosodic emotions in sentences with emotionally neutral semantic content. Musicians showed improved recognition accuracy, an effect that was similar across the age range and across emotions (for a recent replication, see Correia et al., 2020). Similar advantages have been found for recognizing emotions from tone sequences that mimic the prosody of spoken sentences (Thompson et al., 2004, Experiment 1), for recognizing pseudowords (Fuller et al., 2014), and for inferring whether a vocal expression was produced during a depressive state (Nilsonne & Sundberg, 1985). In six of the studies reporting an advantage of music training, however, group differences were restricted to a subgroup of participants (Dmitrieva et al., 2006; Parsons et al., 2014; Young et al., 2012) or a subset of emotions (Correia et al., 2020; Pinheiro et al., 2015; Thompson et al., 2004). Thompson et al., for example (2004, Experiment 2), compared 28 trained listeners and 26 untrained ones on their ability to recognize prosodic emotions in tone analogues of spoken sentences, and in intact English and Tagalog sentences. The training advantage was observed across stimulus types, but only for sadness, fear, and neutrality, and not for happiness and anger. More recently, Correia et al. (2020) showed that musician advantages extend to the recognition of purely nonverbal vocalizations, but again not for all emotions (e.g., trained and untrained participants had a similar performance in the recognition of anger). Importantly, there are six studies that failed to find training effects. Specifically, null findings were observed for emotional prosody recognition in linguistic stimuli including words, pseudowords, and sentences (Başkent et al., 2018; Dibben et al., 2018; Mualem & Lavidor, 2015; Park et al., 2015; Trimmer & Cuddy, 2008), for tone analogues of spoken sentences (Trimmer & Cuddy, 2008), and for humming voices (Weijkamp & Sadakata, 2016). When Trimmer and Cuddy (2008) tested 100 participants with varying levels of music training, they found that emotion recognition in sentences and tone analogues correlated with emotional intelligence but not with music training.
Therefore, there is evidence for enhanced vocal emotion recognition in musicians, but the advantage is not always apparent. One possibility is that the effect is small (e.g., Correia et al., 2020), such that relatively large samples are required for it to emerge. Consistent with this possibility, four of the studies reporting null results tested 10 to 16 trained participants only (Başkent et al., 2018; Trimmer & Cuddy, 2008; Weijkamp & Sadakata, 2016), or did not specify the number of trained participants (Trimmer & Cuddy, 2008). It is also possible that, in some cases, the samples included participants who were no longer practicing their instruments. Only three of the studies reporting null results specified that participants were active musicians (Başkent et al., 2018; Dibben et al., 2018; Mualem & Lavidor, 2015). This issue is important because behavioral and brain effects of learning-induced plasticity might be lost if not practiced, in line with evidence from taxi drivers (Woollett et al., 2009) and jugglers (Draganski et al., 2004).
The few studies that examined associations between music training and emotion recognition for modalities other than the voice found null effects. In a task involving emotion recognition in static facial expressions, Correia et al. (2020) found that trained musicians performed as well as untrained ones. Moreover, Bayesian analyses provided substantial evidence for the null hypothesis. Farmer et al. (2020) explored emotion recognition in audiovisual clips including biological motion and prosodic cues, and observed that the training advantage was restricted to the auditory condition. Musicians were more accurate than untrained participants when they assessed prosodic cues alone, but not when they assessed biological motion or the two types of cues combined. Finally, using a Stroop-like task involving facial and vocal stimuli combined, Weijkamp and Sadakata (2016) found that musicians’ emotional evaluations were less susceptible to interference from the to-be-ignored modality, which could reflect improved attention or more efficient audiovisual integration. Performance was similar across groups, however, when the focus was limited to facial expressions.
Although more research on the visual modality and on multimodal integration is warranted, the available evidence suggests that music training is selectively related to vocal emotions. What remains poorly explored are the mechanisms underlying this advantage. Because music training correlates with higher domain-general cognitive abilities (e.g., Swaminathan & Schellenberg, 2018; Swaminathan et al., 2017), emotion recognition advantages could arise because trained individuals tend to be high-functioning and not because of a meaningful association between music and emotion processes. Cognitive abilities seem like an obvious confounding variable, yet only a few studies have measured them (see Figure 1). Among the studies that measured cognitive abilities and documented musician advantages in emotion recognition, all found positive associations between music training and general cognitive abilities (Correia et al., 2020; Lima & Castro, 2011; Thompson et al., 2004). Thompson et al. (2004) showed, however, that group differences in emotional prosody recognition could not be explained by cognitive performance as measured by Raven’s Advanced Progressive Matrices. Similarly, Lima and Castro (2011) found that the musician advantage in emotional prosody recognition remained significant after accounting for performance on the Montreal Cognitive Assessment (MoCA), Raven’s Advanced Progressive Matrices, and the Stroop test. In the study by Correia et al. (2020), digit span did not explain training advantages in the case of nonverbal vocalizations, but it partly did in the case of emotional prosody. When digit span was held constant, the correlation between music training and prosody recognition only approached significance. Because the digit span test has a strong auditory component, it remains unclear if it represents a proper confounding variable or if it reflects more specific aspects of the training. In sum, there is initial support for the idea that the musician advantage in emotion recognition is not an artefact of general cognitive advantages. Nevertheless, the inclusion of cognitive measures needs to become standard practice, and the precise role of different domain-general processes will need to be delineated in future research.
Studies on the neural correlates of vocal emotional processing in musicians are rare, and they will be crucial to inform the debate on the mechanisms underlying the association between musical experience and vocal emotions. Initial evidence points to an important role of auditory-perceptual enhancements. For example, Strait et al. (2009) recorded brainstem potentials of musicians and nonmusicians in response to an infant’s crying sound. Musicians exhibited enhanced subcortical timing, as well as enhanced representations of frequencies important for pitch and timbre perception. In an event-related potential study, Pinheiro et al. (2015) found that music training relates to distinct responses to emotional prosody in two components that are markers of early sensory processing. Specifically, musicians showed reduced P50 amplitude compared to untrained listeners in response to prosodic emotions in emotionally neutral semantic sentences. For both P50 and N100, musicians showed similar amplitudes regardless of stimulus type (prosodic stimuli with or without semantic content), whereas untrained listeners showed more negative amplitudes for stimuli without semantic content. These two studies indicate that neurocognitive pathways for processing music and vocal emotions may overlap at an early sensory level, and that this overlap may contribute to musician advantages in emotion recognition. A similar view has been used to account for transfer effects from music to speech processing (Patel, 2011, 2014). The role of auditory-perceptual enhancements in emotion recognition tasks remains poorly understood. In one recent study, music perception abilities, as assessed through a range of tasks (e.g., musical beat perception, pitch discrimination), correlated positively with emotion recognition in prosody and nonverbal vocalizations (Correia et al., 2020). Importantly, when these perceptual abilities were held constant, the association between music training and vocal emotion recognition was no longer significant.
Whether music training also relates to emotion recognition at higher order levels of socioemotional processing is another avenue for future studies. On the one hand, the lack of associations between training and visual emotion recognition suggests that this might not be the case. On the other hand, in an fMRI study, Park et al. (2015) found that musicians show enhanced responses to sad prosody, though not to happy or fear prosody, in regions involved in general socioemotional processing, including the medial prefrontal and anterior cingulate cortices.
Longitudinal Evidence
Central to our understanding of music training effects is the issue of causality, which can only be established with longitudinal data, random assignment, and appropriate (active) control groups (Schellenberg, 2020). We identified only four longitudinal studies that included emotion recognition measures (see Figure 1 and Table S2). One was conducted with typically developing children, and the children were tested only once on the emotion recognition task, thus the design was not truly longitudinal in this regard (Thompson et al., 2004). The remaining three were conducted with deaf children or adults with cochlear implants (CIs; Chari et al., 2019; Fuller et al., 2018; Good et al., 2017). All of these studies focused on emotional prosody recognition, although Good et al. (2017) also included an audiovisual condition that combined prosody with facial expressions.
Thompson et al. (2004) recruited 43 6-year-old children who had previously been assigned to 1 year of keyboard, vocal, drama, or no lessons groups (n ≈ 10 per group), and examined their emotion recognition abilities. In other words, the children were not assessed on the experimental task before training. In a two-alternative forced-choice task, children decided whether prosodic stimuli were happy or sad in one condition, and angry or fearful in a second condition. The groups did not differ on IQ before training, but there were differences after training for the recognition of angry/fearful stimuli, with the music groups performing better than the no-lessons group. This advantage raises the possibility of a training benefit, but the music groups did not differ from the drama group, and no effects were observed for happy/sad trials. Moreover, in separate analyses per group, only the keyboard group, but not the singing one, significantly outperformed children who had had no lessons. These findings thus provide only limited evidence of an effect of music training on emotional prosody recognition, and they do not establish causality because no pretraining emotion recognition data were available.
The three studies conducted with CI users included random assignment and active control groups, but they all reported null results. Good et al. (2017) compared two groups of nine children with CIs, who completed either 6 months of music or visual arts training. The two groups were tested for music perception skills, and for emotion recognition in prosody alone and in audiovisual clips. While music perception skills improved significantly more in the music compared to the visual arts group, emotion recognition did not. Emotion recognition performance was generally better after training, but the interaction between time (pre/post) and group was not significant. In a different study, Fuller et al. (2018) randomly assigned 19 adult CI users to one of three groups that completed six 2-hour training sessions: a pitch/timbre group (n = 6), a music therapy group (n = 7), and an active control group that involved activities such as writing, cooking, and woodworking (n = 6). The three groups were tested for several aspects of speech perception, melodic contour identification, quality of life, and emotional prosody recognition in pseudowords. For melodic contour identification, the pitch/timbre group improved significantly more than the music therapy and control groups. For emotion recognition and for the remaining measures, however, no benefits of training were found. There was a small improvement for the music therapy group only, but there was no interaction between time and group. Finally, Chari et al. (2019) randomly allocated 18 participants to one of three groups that completed 4 weeks of training: an auditory-motor group (n = 7), an auditory-only group (n = 7), and a passive control group (n = 4). Participants were tested for speech perception, pitch discrimination, melodic contour identification, and emotional prosody recognition in sentences. For melodic contour identification, the auditory-motor group improved significantly more than the other groups, but no benefits were observed for the remaining measures, including emotion recognition.
In short, evidence of a causal role for music training in emotion recognition is currently nonexistent. Nevertheless, the null results cannot be taken as evidence of absence of effects either, because of the limited number of studies and the small sample sizes.
Music Training Versus Preexisting Factors
Because associations between music training and vocal emotion recognition are reported primarily from cross-sectional comparisons, it is possible that they stem from preexisting factors. Cross-sectional studies involving highly trained musicians are valuable because they offer a model for examining the correlates of training when expertise and experience are highest. But they do not allow us to tease apart environmental from genetic contributions, and often they do not account for other relevant confounding variables. Cognitive abilities, but also personality and socioeconomic variables, might determine who takes music lessons, and this possibility is rarely considered in cross-sectional comparisons (Swaminathan & Schellenberg, 2018; for a review, see Schellenberg, 2020). Importantly, musical abilities, the propensity for music training, and associations between music training and musical abilities have a genetic component (for a review, see Mosing & Ullén, 2016). Twin studies indicate that individual differences in music practice are heritable to a substantial degree (40% to 70%), and when genetic predispositions are held constant, practice is no longer associated with better musical abilities (Mosing et al., 2014). In the same vein, associations between music training and IQ, often interpreted as transfer, appear to stem from shared genetic influences that determine practice behavior and cognitive performance (Mosing et al., 2016).
In short, samples of trained musicians differ from untrained individuals in music training, but they are also likely to differ in genetic and other environmental factors that affect our measures of interest. A promising line of research is starting to delineate the potential role of preexisting factors in associations between musical and nonmusical domains by asking whether musician-like advantages can be found in untrained individuals. These studies can be done by identifying individuals who have “naturally” good musical abilities. Such good abilities, in the absence of formal training, must be a consequence of genetics and/or informal engagement with music. For example, Mankel and Bidelman (2018) found that untrained adults with higher music perception abilities show a more efficient neural encoding of noise-degraded speech sounds, mirroring the benefits previously observed in musicians. Similarly, Swaminathan and Schellenberg (2017) found that better music perception abilities are associated with enhanced phoneme perception in a foreign language, regardless of music training. In a study of 6- to 9-year-old children, Swaminathan and Schellenberg (2019) documented an association between natural musical abilities and two aspects of language ability: phoneme perception and grammar. As for emotion recognition abilities, Correia et al. (2020) found that both self-report and performance-based measures of music perception skills correlate positively with emotion recognition in prosody and nonverbal vocalizations, even with music training held constant. Crucially, untrained participants with high musical abilities were as good as trained musicians at recognizing both types of vocal expressions.
These response patterns indicate that training might not be a necessary condition for advantages in vocal emotion recognition to emerge. Although cross-sectional differences are often attributed to training (Schellenberg, 2019), similar advantages can be seen in listeners without any training. In addition to highlighting the need to interpret cross-sectional evidence with caution, these findings also point to the multifaceted nature of musical expertise. While the musical expertise literature typically equates musical abilities with formal music training, an exciting avenue of research will be to consider the potential effects of other forms of engagement with music (Müllensiefen et al., 2014), as well as interactions with genetic influences. Such an approach will lead to a more complete understanding of associations between music and emotion recognition.
From Emotion Recognition to Broader Aspects of Socioemotional Functioning
An interesting question is whether music training relates to emotion recognition only, or whether it also relates to broader aspects of socioemotional competence such as empathy, emotion regulation, mentalizing, or social adjustment. Most studies focusing on these aspects are longitudinal studies with children, and a diversity of measures have been considered, such as general adaptive and maladaptive behavior (Gerry et al., 2012; Schellenberg, 2004, 2006), self-esteem (Costa-Giomi, 2004; Rickard et al., 2012, 2013), empathy (Rabinowitch et al., 2012), motivation (Degé & Schwarzer, 2017; Rickard et al., 2012), and socioemotional well-being (Kim & Kim, 2017; Rose et al., 2017; Welch et al., 2014).
Evidence of positive effects of music training is currently weak and mixed. Schellenberg et al. (2015), for instance, tested whether 8- and 9-year-old children improved in prosocial skills after 10 months of music lessons, in comparison to children who took no lessons. Music training led to improvements in sympathy and prosocial behaviors, but only in children who had poor prosocial skills before training. In a prior study with 6-year-olds, Schellenberg (2004) found that drama lessons improved adaptive social behavior, but music training (including keyboard and vocal training) did not. Similarly, music lessons were not related to social behavior in a later cross-sectional study with 6- to 11-year-old children (Schellenberg, 2006).
Mixed evidence of training effects has also been found for measures of self-esteem. In a study with 9-year-olds, Costa-Giomi (2004) found that individual piano lessons for 3 years did not improve self-esteem more than no lessons, even though there was a trend-level advantage. By contrast, Rickard et al. (2013) found that school-based music classes prevented the decline in self-esteem measures that was observed in the control group in children from two age cohorts (≈ 6-year-olds and ≈ 9-year-olds). The different response patterns may stem from the fact that music training was group-based in Rickard et al. (2013), and individual in Costa-Giomi (2004). In fact, most studies reporting positive effects of music training on socioemotional skills relied on group-based programs (e.g., Gerry et al., 2012; Schellenberg et al., 2015). Group-based programs arguably emphasize the engagement of the social functions of music (Koelsch, 2014), which could allow for socioemotional effects to emerge. It is crucial to examine systematically whether differential results emerge depending on the specific type of music training (group-based vs. individual), and whether benefits observed for emotion recognition abilities relate to other socioemotional skills and to everyday outcomes of social functioning.
Concluding Remarks
The reviewed studies suggest an association between music training and an enhanced ability to recognize emotions in vocal expressions. This association has been shown for a variety of vocal stimuli, including emotional prosody and purely nonverbal vocalizations. Initial evidence further indicates that the advantage might not simply result from better general cognitive abilities. Nevertheless, it appears to be restricted to audition, because it does not extend to the visual domain. This conclusion remains tentative, however, because only a few studies have examined emotional stimuli other than voices. Finding associations between music training and socioemotional processing highlights the importance of including socioemotional processes as part of the quest to identify links between musical and nonmusical domains, which has focused primarily on speech perception and general cognitive abilities.
A crucial goal of future research is to delineate the precise mechanisms of the musician advantage in vocal emotion recognition. There is some evidence that auditory-perceptual enhancements might play a role, which is consistent with the idea of overlapping sensory pathways for processing music and vocal emotional information. Whether this overlap extends to higher order stages of socioemotional processing remains unclear. EEG and fMRI studies will be useful to address these issues by helping to determine when and where in the brain these cross-domain interactions occur. Importantly, the causal direction of the effect also remains to be determined. It has been assumed that music training causes emotion recognition enhancements, but the available longitudinal evidence of this is nonexistent. The issue of causality is particularly relevant in light of evidence that musical abilities and training have a genetic component, and that musician-like advantages in emotion recognition can be seen in untrained individuals with naturally good musical abilities. Well-powered and well-designed longitudinal studies, which include random assignment, an active control group, and careful assessment of confounding variables (e.g., socioeconomic status, personality), are needed to address the issue of causality. Such studies will improve our understanding of the associations between music and socioemotional processing, and will inform debates on music-based interventions in clinical, community, and educational contexts.
Supplemental Material
sj-docx-1-emr-10.1177_17540739211022035 – Supplemental material for Does Music Training Improve Emotion Recognition Abilities? A Critical Review
Supplemental material, sj-docx-1-emr-10.1177_17540739211022035 for Does Music Training Improve Emotion Recognition Abilities? A Critical Review by Marta Martins, Ana P. Pinheiro and César F. Lima in Emotion Review
Footnotes
Correction (October 2023):
The article has been updated for funding statements.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was funded by the Portuguese Foundation for Science and Technology (FCT; grants IF/00172/2015 and PTDC/PSI-GER/28274/2017), and co-funded by the European Regional Development Fund (ERDF) through the Lisbon Regional Operational Program (LISBOA-01-0145-FEDER-028274) and the Operational Program for Competitiveness and Internationalization (POCI-01-0145-FEDER-028274).
Supplemental Material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
