Abstract
Most research on cross-cultural emotion recognition has focused on facial expressions. To integrate the body of evidence on vocal expression, we present a meta-analysis of 37 cross-cultural studies of emotion recognition from speech prosody and nonlinguistic vocalizations, including expressers from 26 cultural groups and perceivers from 44 different cultures. Results showed that a wide variety of positive and negative emotions could be recognized with above-chance accuracy in cross-cultural conditions. However, there was also evidence for in-group advantage with higher accuracy in within- versus cross-cultural conditions. The distance between expresser and perceiver culture, measured via Hofstede’s cultural dimensions, was negatively correlated with recognition accuracy and positively correlated with in-group advantage. Results are discussed in relation to the dialect theory of emotion.
Introduction
Evidence for cross-cultural differences and universals in emotion has largely depended on the study of facial expressions (e.g., Elfenbein & Ambady, 2002; Scherer, Clark-Polner, & Mortillaro, 2011). Photographs created in the United States were shown to observers from countries as diverse as Argentina, Brazil, Chile, Japan, and Papua New Guinea, who were asked to make multiple-choice judgments of which emotional states were intended among the six “basic” categories of anger, disgust, fear, happiness, sadness, and surprise (Ekman, 1972; Izard, 1971). Observers generally achieved greater accuracy than would be expected due to chance guessing alone—that is, 16.7%—and so began an account of emotion as “universal.” Since that time, more nuanced accounts have advanced the question whether the nature of this evidence is sufficient to claim universality (e.g., Russell, 1994). Given how central this body of work has been to the continuing debate about universals and cultural differences, the current article asks to what extent such evidence exists for emotion expressed through the voice.
The voice provides a rich medium for nonverbal communication of emotion—whenever we vocalize, we do not only convey the meaning contained in the words we use, but also convey emotional information through speech prosody and nonlinguistic vocalizations. Recent studies have demonstrated how a wide variety of both positive and negative emotions can be recognized from vocal expressions (e.g., Cowen, Elfenbein, Laukka, & Keltner, 2019). There is also a sizable literature on cross-cultural vocal emotion recognition (for reviews, see Laukka et al., 2016; Scherer et al., 2011). In this article, we present a quantitative review of this work, in which findings from this literature are aggregated into a meta-analysis that assesses the extent to which emotions expressed through the voice can be recognized across cultures.
The present investigation is grounded in dialect theory (Elfenbein, 2013). Research in the field of linguistics proposes that spoken language continually evolves, which leads to different dialects across groups of people who become separated by cultural or geographical boundaries (O’Grady, Archibald, Aronoff, & Rees-Miller, 2001). Using a linguistic metaphor, dialect theory argues that there are subtle differences across cultures in the style of emotional expression—much like the dialects of a spoken language—which become greater the more culturally distant the groups are from each other. As in linguistic dialects, expressive dialects are largely recognizable across groups and yet some of the meaning can get lost along the way. This leads to an empirical observation of in-group advantage, namely that emotion recognition is more accurate when judging expressions from one’s own cultural in-group compared to cultural out-groups. The current meta-analysis was designed to test three corresponding hypotheses from dialect theory:
Hypothesis 1: There is accuracy at levels above chance in the cross-cultural recognition of emotion in the voice.
Hypothesis 2: There is in-group advantage across cultures in the recognition of emotion in the voice.
Hypothesis 3: In-group advantage in vocal emotion recognition is greater the more cultural distance there is between the expresser and perceiver cultural groups.
Two previous meta-analyses have considered cross-cultural recognition of emotion in the voice (Elfenbein & Ambady, 2002; Juslin & Laukka, 2003). Both reported that vocal expressions were recognized with better than chance accuracy across cultures, and also found evidence for in-group advantage. However, previous research syntheses were limited by the small number of studies available at the time, and thus were able to only report results for a limited number of individual emotions. For example, Juslin and Laukka (2003) only reported cross-cultural recognition rates for five emotions, and Elfenbein and Ambady (2002) for four emotions specifically for vocal expression. In our update, we benefit from all the cross-cultural studies published in the past 2 decades and consider a wide range of both positive and negative emotions.
Materials and Methods
Study Details and Criteria for Selection and Inclusion
A total of 37 cross-cultural vocal expression studies were included in the meta-analysis (see online supplemental Table S1 for details). These studies provide data from more than 4,210 perceivers’ judgments of vocal expressions from more than 272 expressers (some studies do not report the numbers of perceivers and/or expressers). The median number of perceivers and expressers for each pairwise cultural comparison was 30 and 3, respectively. Expressers came from 26 different cultural groups or nations and perceivers came from 44 different cultures. Culture was usually defined by the original authors as either country of origin or as the native language of expressers and perceivers.
Studies for inclusion were identified through previously published reviews on the topic (Elfenbein & Ambady, 2002; Juslin & Laukka, 2003; Laukka et al., 2016; Scherer et al., 2011). In addition, computer-based searches of databases (PubMed, Web of Knowledge, PsycArticles, IEEE Explore) were performed; search string = (speech OR voice OR vocal OR prosody) AND (emotion* OR affect*) AND cultur*.
Only studies that provided estimates of the accuracy of emotion recognition were included. This entails that studies had to include an objective criterion for accuracy. Typically, this criterion was whether perceivers recognized the emotion that the expresser was trying to convey. All included studies used expressions portrayed by actors, which means that the intended expression was known to the experimenters. One possible exception was a study by Chung (2000) that used material selected from television interviews. The majority of studies also used a forced-choice method wherein participants were asked to choose their response from a list of predefined response alternatives. However, a couple of studies that used rating scales or free responses were included because they also provided data on the proportion of accurate responses (Abelin & Allwood, 2000; Gendron, Roberson, van der Vyver, & Barrett, 2014; Huang, Erickson, & Akagi, 2008). The accuracy criterion excluded studies that investigated emotion recognition using dimensional ratings (e.g., activation, valence, emotion intensity, appraisal dimensions)—which do not provide measures of accuracy (e.g., Koeda et al., 2013; Nordström, Laukka, Thingujam, Schubert, & Elfenbein, 2017; Pfitzinger, Amir, Mixdorff, & Bösel, 2011).
We included any emotion that individual researchers considered worthy of inclusion in their experiments. However, we excluded judgments of personality traits, attitudes, and linguistic functions of prosody. In total, the included studies provided accuracy estimates for 25 different emotions: achievement, amusement, anger, awe, calmness, contempt, contentment, desire (for food), disgust, doubt, embarrassment, fear, happiness, interest, irony/sarcasm, lust, pain, pride, relief, sadness, shame, surprise, sympathy, tenderness, and neutral. The median number of emotions in the included studies was 5. For more details about how different labels were classified, see online supplemental Table S2.
Further, only studies that provided data from at least one within-cultural condition and at least one cross-cultural condition, were included. This could be achieved in two different ways. The same set of vocal stimuli could be judged both by perceivers from the same cultural group as the stimuli and by perceivers from another cultural group. We call this type of study the “many-on-one” design in Table S1. Alternatively, the same group of perceivers could judge stimuli both from expressers from their own culture and from a different cultural group. This type of design is labelled “one-on-many” in Table S1. Balanced designs (where vocal stimuli from at least two different cultures are judged by perceivers from each culture) provide data for both types of cultural comparisons. Studies that investigated emotion perception in cross-cultural conditions only, were not included because they do not provide data on in-group advantage (e.g., Laukka et al., 2013).
Only studies that investigated healthy adults (approximately 18–65 years) were included. In cases where studies provided data from several participant samples (e.g., from different age groups), we only made use of the data based on adults (e.g., Chronaki, Wigelsworth, Pell, & Kotz, 2018; McCluskey & Albas, 1981). Studies on clinical populations were not included.
Finally, we only included studies which utilized nondegraded vocal stimuli, either in the form of short sentences or words (speech prosody) or in the form of nonlinguistic vocalizations (e.g., laughter, crying, shrieks, grunts, sighs). Studies that used stimuli that were manipulated (e.g., content-filtered) or based on speech synthesis were not included (e.g., Yanushevskaya, Gobl, & Ní Chasaide, 2018), because stimulus manipulations may obscure cultural variation in expressive style. In cases where several types of stimuli were used in the same study, we only used data for nonmanipulated speech (e.g., Kramer, 1964).
Recorded Variables and Data Collection
The most common way of reporting data in emotion recognition studies is to present the proportion of correct responses (percentage accuracy). Therefore, percentage accuracy provides a natural and inclusive effect size index that is easy to interpret and can be used in the analysis of both cross-cultural emotion recognition and in-group advantage (Elfenbein & Ambady, 2002). The included 37 studies provided data on 191 pairwise comparisons between expresser and perceiver culture. We extracted percentage accuracy for each reported emotion and Expresser × Perceiver combination from each included study. One problem with comparing percentage accuracy across studies is that different studies use different numbers of response alternatives in the judgment task. Following Juslin and Laukka (2003), percentage accuracy values were therefore transformed into Rosenthal and Rubin’s (1989) proportion index (PI). PI values were also used to calculate a measure of in-group advantage (defined as the difference between within-cultural and cross-cultural accuracy) for each pairwise cultural comparison.
PI presents an effect size index for one-sample, multiple-choice data, and enables the transformation of accuracy scores involving any number of response alternatives into a standard scale of dichotomous choice, where .50 corresponds to chance-level responding and 1.00 corresponds to perfect recognition. In other words, in cases where there are more than two response alternatives, the raw proportion of correct responses is converted to the proportion of correct responses made if there had been only two response alternatives. PI is calculated using the formula PI = P(k − 1) / (1 + P(k − 2)), where P is the proportion of correct responses and k is the number of response alternatives (Rosenthal & Rubin, 1989). When calculating PI values for studies that used free responses to collect perceiver judgments, and thus did not have a fixed number of response alternatives, we used the number of expressed emotions as a proxy for the number of response alternatives.
The following variables were extracted from each study using data available in tables and figures: (a) characteristics of the sample (expresser culture, perceiver culture, number of expressers, number of perceivers); (b) study design (study design: balanced or not balanced; vocal stimulus type: speech prosody or nonlinguistic vocalizations; perception test type: forced-choice or rating scale/free responses; number of emotions; number of response alternatives); (c) descriptive statistics of participants’ performance for each combination of emotion and expresser and perceiver culture (percentage accuracy). For two of the included studies (Jiang, Paulmann, Robin, & Pell, 2015; Sauter, 2013), authors kindly provided percentage accuracy rates not included in the original publications.
In addition, we used Hofstede’s (2001) values for the dimensions power distance, individualism, masculinity, and uncertainty avoidance to calculate an index of cultural distance for each combination of expresser and perceiver culture. This was done by calculating the Euclidean distance between culture pairs in a four-dimensional space using cluster analysis. A handful of studies included participants from isolated cultural contexts to investigate emotion recognition across cultures with very little common exposure, such as mass media (Bryant & Barrett, 2008; Cordaro, Keltner, Tshering, Wangchuk, & Flynn, 2016; Gendron et al., 2014; Sauter, 2013; Sauter, Eisner, Ekman, & Scott, 2010). For these cultures, there were no available data regarding Hofstede’s dimensions. In order not to lose these valuable studies from the analysis, we conservatively gave comparisons including remote cultures the same index value as the largest difference that was observed among the included globalized cultures (which was the difference between Japan and The Netherlands).
Results
Table 1 displays cross-cultural recognition accuracy in terms of Rosenthal and Rubin’s (1989) proportion index (PI) for the most frequently investigated emotions, as well as overall performance averaged across positive and negative emotions. Results for the full set of 25 emotions are shown in online supplemental Table S3. All individual emotions were recognized with cross-cultural accuracy above chance, as indicated by 95% confidence intervals (CIs) not including .50 (which is the level of chance responding for PI), although it should be noted that for a few emotions (calmness, love/tenderness), the number of data points was too small for calculating CIs. Negative emotions were overall recognized with higher cross-cultural accuracy (M = 0.849, SD = 0.095) than positive emotions (M = 0.797, SD = 0.098), t(107) = 4.74, p < .001, d = 0.561.
Cross-cultural recognition accuracy and in-group advantage for selected emotions.
Note. aOverall recognition accuracy is based on all included emotion categories in each respective study. bPositive emotions included: achievement, amusement, awe, calm, contentment, desire (for food), happiness, interest, love, lust, pride, relief, and sympathy. cNegative emotions included: anger, contempt, doubt, disgust, embarrassment, fear, pain, sadness, and shame.
Regarding in-group advantage, recognition accuracy was overall higher in within-cultural conditions (M = 0.897, SD = 0.061) compared to cross-cultural conditions (M = 0.843, SD = 0.075; more details about within-cultural recognition rates in the current sample are available in Table S3). In-group advantage (defined as the difference in recognition accuracy between in-group and out-group comparisons) was overall significant for both “many-on-one” contrasts (M = 0.055, SD = 0.067, d = 0.821) and “one-on-many” contrasts (M = 0.051, SD = 0.065, d = 0.785), as indicated by 95% CIs not including zero (see Table 1). In-group advantage was significantly larger for positive emotions (M = 0.072, SD = 0.091) than for negative emotions (M = 0.046, SD = 0.070) for “one-on-many” contrasts, t(52) = 2.04, p = .046, d = 0.325. However, the difference between positive (M = 0.071, SD = 0.091) and negative (M = 0.052, SD = 0.084) in-group advantage was only marginally significant for “many-on-one” contrasts, t(96) = 1.96, p = .053, d = 0.219. Across all emotions, cross-cultural recognition accuracy was negatively correlated with in-group advantage for both “many-on-one” contrasts (r = −.74, p < .001, N = 119) and “one-on-many” contrasts (r = −.61, p < .001, N = 69), suggesting that in-group advantage was larger for expressions that were relatively harder to recognize.
We further explored differences in accuracy beyond the dichotomous status of in-group versus out-group, and examined to what extent cultural distance led to relatively lower accuracy. Overall recognition accuracy (including both within- and cross-cultural data points) was negatively correlated with cultural distance (r = −.37, p < .001, N = 186). A negative correlation was also observed between cross-cultural recognition accuracy (including only cross-cultural data points) and cultural distance (r = −.22, p = .014, N = 129). We also directly calculated the correlations between cultural distance and in-group advantage, and observed positive correlations for both “many-on-one” contrasts (r = .21, p = .026, N = 116) and “one-on-many” contrasts (r = .22, p = .072, N = 69), although the latter association did not reach conventional criteria for statistical significance. As a whole, these correlations suggest that emotion recognition accuracy decreases, and in-group advantage increases, as cultural distance between expresser and perceiver increases.
Finally, we investigated the effect of moderator variables (see Table 2). Large studies (as indicated by larger numbers of expressers, perceivers, and emotions) were associated with higher cross-cultural recognition rates and smaller in-group advantage. Balanced studies were further associated with smaller in-group advantage compared to unbalanced studies, and forced-choice studies were associated with smaller in-group advantage compared to studies using free responses and rating scales. No significant associations were found for stimulus type (speech prosody or nonlinguistic vocalization).
Correlations (Pearson r) between cross-cultural recognition accuracy and in-group advantage and moderator variables.
Note. Boldfaced results indicate p < .05.
Discussion
In order to supplement a large volume of research on facial expression, the goal of this study was to document the state of the evidence for cross-cultural similarities and differences in emotion expressed via the voice. Drawing from dialect theory, we hypothesized and found that emotion expressed through the voice was recognized across cultures at levels substantially above what would be expected due to chance guessing alone. However, we also found evidence for cultural differences in the form of in-group advantage, in that vocal emotion was recognized more accurately when expressers and perceivers were from the same cultural group. Further, this in-group advantage was larger the greater cultural distance the expresser and perceiver groups had from each other. Taken together, the results of this study support the dialect theory of emotion (Elfenbein, 2013; Elfenbein & Ambady, 2002).
Although individual studies usually included a fairly limited number of emotions, the aggregated results suggest that a rich repertoire of emotions—not limited to the usual six or so basic emotions, and including several positive emotions beyond happiness—can be recognized cross-culturally with accuracy well above chance. Indeed, out of 25 included emotions, all were recognized with better than chance accuracy. (Note that for two emotions, there were not enough data to calculate interval estimates.) In terms of Rosenthal and Rubin’s (1989) PI, our estimate of overall cross-cultural accuracy (M = 0.843) was very similar to the one obtained in a previous meta-analysis (see Juslin & Laukka, 2003)—although the current meta-analysis included 5 times more studies and emotions. A PI value of 0.84 corresponds to a raw accuracy score of 0.57 in a forced-choice task with five response options (which was the median number among the included studies), which in turn is approximately 3 times higher than the 1 in 5 rate that would be expected by chance responding. Based on the current evidence, this seems to be a fairly robust estimate of average cross-cultural accuracy for vocally expressed emotions.
In terms of cultural differences, our estimate of overall in-group advantage (M ≈ 0.05) was also very similar to that reported by Juslin and Laukka (2003). Vocal expressions are considered to be at least partly based on appraisal and emotion-related physiological changes (Scherer, 1986), and we have previously argued that this may prevent drift of expressive styles past the point of mutual unintelligibility (Laukka et al., 2016). Accordingly, overall in-group advantage was not large in terms of proportion of correct responses, although it was systematic enough that it did represent a large effect size (d ≈ 0.80).
Results further underscore that accuracy and in-group advantage are not static phenomena, but vary across cultural conditions. Dialect theory proposes that differences in expressive style grow as groups become more culturally distant, resulting in less accurate cross-cultural communication. We accordingly observed that increased cultural distance was associated with decreased accuracy and increased in-group advantage. Although effect sizes were only small to medium, these findings demonstrate the benefits of meta-analysis, because it has been difficult for individual studies to obtain samples with enough cultural variability to investigate this issue (e.g., Laukka et al., 2016; Scherer, Banse, & Wallbott, 2001).
In-group advantage also seemed to vary across emotions. The number of observations was rather small for several emotions (see Table S3), which makes it difficult to draw conclusions about the significance of in-group advantage for individual emotions. However, we note that among the most frequently investigated ones (presented in Table 1), relief and sadness did not show consistent evidence for in-group advantage. Interestingly, sadness had small in-group advantage for vocal expression in both Elfenbein and Ambady (2002) and in Juslin and Laukka (2003). Results further suggest that in-group advantage may be larger for expressions with positive versus negative affective valence. This observation can be interpreted in terms of the functions of positive and negative emotions. Sauter et al. (2010) suggested that communication of positive emotions facilitates social cohesion with group members, but that such affiliative signals may not always be shared with nonmembers. Signals of negative emotions may instead be more closely related to biological reactions to various threats, and less influenced by cultural learning. On a more general level, in-group advantage was strongly negatively correlated with cross-cultural recognition accuracy. We speculate that knowledge of cultural differences in expressive style may become more important when judging expressions that are more ambiguous. Clearly, more research is needed to understand how and why in-group advantage varies across emotions.
Results are in line with dialect theory, but do not provide direct evidence for the mechanisms of the theory. However, we and others have recently tested propositions of dialect theory more directly. For example, Laukka, Neiberg, and Elfenbein (2014) used speech stimuli from Australia, India, Kenya, Singapore, and the United States, and trained acoustics-based classifier programs to recognize emotional expressions. For each model, training was based on stimuli from one cultural group, and recognition was based either on stimuli from the same group on which it was trained or on stimuli from a different group. Accuracy was at greater than chance levels regardless of whether classifiers were trained and tested on stimuli from the same versus different cultures. This provides evidence for basic universality in acoustic patterns. However, accuracy was higher when classifiers were trained and tested on stimuli from the same culture versus different cultures. This, in turn, provides direct evidence for systematic cultural differences in expressive style.
In another study, we examined patterns of fundamental acoustic properties in emotional expression within and across cultural groups (Laukka et al., 2016). It was found that acoustic cues were used in a relatively similar way across groups both to produce and judge emotions, but there were also subtle cultural differences. Findings thus suggested that expressers appear to have culturally nuanced schemas for enacting vocal expressions, and perceivers appear to have culturally nuanced schemas for judging expressions. Consistent with dialect theory, in-group judgments showed a greater match between these schemas used for emotional expression and perception.
As a final example, Sauter (2013) showed that in-group advantage was observed when perceivers judged vocalizations from their own cultural group, even though they were unable to reliably infer the group membership of the expresser. Taken together, these studies speak against perceiver biases—such as increased motivation for judging in- versus out-group expressions (e.g., Thibault, Bourgeois, & Hess, 2006) or possible effects of decoding rules on emotion inferences (e.g., Matsumoto, 2002)—as the main origin of in-group advantage. Instead, they suggest strongly that the mechanism is subtle differences in expressive style.
Like any meta-analysis, we are limited by the availability of original empirical work. As such, this quantitative review also exposes several important gaps in the literature. Notably, nearly all data points included in the meta-analysis were based on recognition of expressions portrayed by actors. This means that studies using spontaneous expressions will be an important next step, especially because recent research has revealed small but systematic differences in the acoustic patterns between posed and spontaneous expressions (e.g., Juslin, Laukka, & Bänziger, 2018). The majority of included studies also used a forced-choice methodology with a relatively small number of response alternatives, which has been criticized to artificially inflate recognition rates and impose categorical interpretations on affective stimuli (e.g., Russell, 1994). Our analysis of moderator variables suggested that forced-choice studies were associated with smaller in-group advantage compared to studies using other response formats (see Table 2), but this finding is limited by the very small number of studies using other formats included in the current analysis. Measures that account for the possibility that systematic biases in participants’ use of forced-choice response alternatives could inflate apparent accuracy, have been proposed (see e.g., Wagner, 1993), and future cross-cultural work could report such sensitivity measures in addition to traditional accuracy rates. In a broader perspective, we believe that studies that combine forced-choice methods with open-ended responses and continuous scale ratings will be important for making progress in understanding cultural similarities and differences with regard to how perceivers infer meaning from vocal expressions (Cowen, Laukka, Elfenbein, Liu, & Keltner, 2019).
We did not find any effects of type of stimulus (speech prosody or nonlinguistic vocalizations) on cross-cultural recognition accuracy. This is in contrast to several within-cultural studies proposing higher recognition rates for vocalizations compared to speech prosody (e.g., Hawk, van Kleef, Fischer, & van der Schalk, 2009). However, the number of cross-cultural studies on vocalizations is currently rather small, which calls for more cross-cultural research that directly compares different types of vocal expressions. Such studies could preferably employ balanced designs, which are especially useful for evaluating in-group advantage. A balanced design can control for possible extraneous main effects across expresser and perceiver groups that are not related to culture, and test cultural differences in the form of an interaction effect. As more balanced cross-cultural studies become available, future meta-analyses could employ these interaction effects as a measure of effect size.
Finally, we used Hofstede (2001) as the source for calculating our index of cultural distance, but there may be other ways of measuring cultural distance that may capture aspects that are more relevant for nonverbal behavior, and we welcome more research that tries to link various measures of cultural differences with emotion recognition.
We conclude this review by noting that, in contrast with hundreds of studies on facial expressions, the collection of 37 studies of vocal emotion that we located is a far smaller number—but worthwhile in the corroborating evidence it provides. Taken together, the evidence reveals that perceivers can recognize emotional expressions via nonverbal behavior across cultural divides, and yet they do so better when staying within their own cultural boundaries.
Supplemental Material
LaukkaandElfenbein_meta-analysis_suppl_material_R2 – Supplemental material for Cross-Cultural Emotion Recognition and In-Group Advantage in Vocal Expression: A Meta-Analysis
Supplemental material, LaukkaandElfenbein_meta-analysis_suppl_material_R2 for Cross-Cultural Emotion Recognition and In-Group Advantage in Vocal Expression: A Meta-Analysis by Petri Laukka and Hillary Anger Elfenbein in Emotion Review
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported by the Marianne and Marcus Wallenberg Foundation (MMW 2018.0059).
Supplemental Material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
