Abstract
The current study investigated what can be understood from another person’s tone of voice. Participants from five English-speaking nations (Australia, India, Kenya, Singapore, and the United States) listened to vocal expressions of nine positive and nine negative affective states recorded by actors from their own nation. In response, they wrote open-ended judgments of what they believed the actor was trying to express. Responses cut across the chronological emotion process and included descriptions of situations, cognitive appraisals, feeling states, physiological arousal, expressive behaviors, emotion regulation, and attempts at social influence. Accuracy in terms of emotion categories was overall modest, whereas accuracy in terms of valence and arousal was more substantial. Coding participants’ 57,380 responses yielded a taxonomy of 56 categories, which included affective states as well as person descriptors, communication behaviors, and abnormal states. Open-ended responses thus reveal a wide range of ways in which people spontaneously perceive the intent behind emotional speech prosody.
Keywords
The study of emotion perception has been a long-standing and fundamental research topic within psychology (e.g., Darwin, 1872/1965; Schlosberg, 1952). A better understanding of how individuals perceive nonverbal cues of emotion has bearing on theoretical issues at the heart of what we understand about emotion as well as practical issues in our understanding of interpersonal effectiveness and functioning (Buck, 2014; Hall et al., 2009; Rosenthal et al., 1979). The most common research in this tradition has investigated still photographs of facial expressions (Calvo & Nummenmaa, 2016; Elfenbein & Ambady, 2002). In addition to facial behavior, the voice is another important channel of nonverbal communication. The voice provides a rich source of effective information—as the old expression goes, it is not what you say but how you say it. Indeed, vocal cues are the most frequently reported cues to emotion in naturally occurring situations (Planalp et al., 1996). The current investigation attempts to add to what we know about perceiving vocal emotions and asks the basic question: What can be understood from another person’s tone of voice? In order to get a nuanced picture of the ways in which people spontaneously perceive emotional speech prosody, we investigate open-ended judgments of a wide range of both positive and negative emotions.
Open-Ended Judgment of Emotional Expression
In the most common source of evidence for the perception of emotion, participants are presented with stimuli that are intended to express various emotional states—usually performed by actors—and judgments are recorded by instructing participants to select the most appropriate label from a fixed list of options. This is called the forced-choice paradigm. Although it has provided much insight, the dominant paradigm has also been the subject of critique. Typical forced-choice studies include a relatively small number of options, which may artificially inflate recognition rates due to various reasons. Studies on facial expression suggest that perceivers tend to use informed guessing strategies (Russell, 1994). Some states may have clear signal characteristics, whereas other states do not and yet still appear to be well recognized after other response choices are eliminated. In implementing such strategies, perceivers may judge the same stimulus differently depending on which response options are available (Frank & Stennett, 2001; Russell, 1993). DiGirolamo and Russell (2017) even found that participants could reach consensus on nonsense options when other options were made available to them strategically (see also Nelson & Russell, 2016). The current work attempts to expand the body of evidence outside of this dominant paradigm. In particular, we use open-ended responses that do not limit participants to experimenter-imposed selections.
The forced-choice design is the most common because it is convenient and readily analyzed, and yet it does not capture all the nuances and complexities that may be present in people’s emotion inferences. Because the vast majority of existing research uses the forced-choice design, we know little about how listeners interpret emotion expressions naturalistically. The use of open-ended responses may be the most suitable for assessing participants’ spontaneous impressions, but this free response method is relatively rare due to the work required for coding, and the challenge in defining accurate performance.
Given this extra difficulty, typically free-response studies have added the constraint that respondents should only provide answers using single-word emotion labels—that is, they are free-labeling, rather than completely open-ended judgments. Several studies have compared recognition accuracy between free-labeling and forced-choice methods for facial expressions (e.g., Boucher & Carlson, 1980; Limbrecht-Ecklundt et al., 2013; Rosenberg & Ekman, 1995), vocal expressions (e.g., Abelin & Allwood, 2000; Gendron et al., 2014; Greasley et al., 2000), and music performance (Juslin, 1997). These studies show that free labeling results in accuracy that is lower than that of forced-choice but still acceptable. Restricting comparisons to samples judging stimuli from their own culture, accuracy with exact word matches ranged from 20.7% (Greasley et al., 2000) to 78.4% (Rosenberg & Ekman, 1995).
In the current study, we use free responses rather than free labeling, which has been used only in a few studies. Frijda (1953) asked participants to rate film clips and film still images of facial expressions, and to describe what might be going on with the person who was being shown. He found that judges scored relatively well (about 50% received scores of 3 or 4 on an accuracy scale of 0–4 based on how close the response was to the intended state). Interestingly, he also found that they frequently described the situations that might elicit emotion rather than emotional states per se. Haidt and Keltner (1999) showed participants’ photographs of facial expressions and asked them to describe what happened to make a person feel that way. They also found that participants recognized the emotions with a wide range of success rates (from 16% for shame to 94% for happiness), and that the open-ended responses often included a range of situational descriptions. We expand on this small amount of existing work by studying the open-ended judgments of emotional prosody.
Emotional Prosody Across Emotions
This research focuses on one particular kind of vocal expression of emotion, namely vocal or speech prosody—that is, the tone of voice used while speaking words. Emotional prosody holds a special place among channels of nonverbal communication, because it is the only channel that necessarily involves simultaneous verbal communication.
Most research on the recognition of emotion via the voice has focused on a small number of emotions that are relatively well recognized (see Juslin & Laukka, 2003), usually, those that have historically been considered “basic” on the basis of research with facial expression—namely, anger, fear, disgust, happiness, sadness, and surprise (e.g., Ekman, 1992). Meta-analyses of forced-choice studies have shown that basic emotion categories are generally recognized at accuracy rates well above chance, both within and across cultures (Elfenbein & Ambady, 2002; Juslin & Laukka, 2003). Recent research in the area of vocal bursts (e.g., crying, laughter, grunts, and sighs), which are brief non-linguistic vocalizations that communicate emotion (Scherer, 1994), has started to expand the number of emotions. For example, Cordaro et al. (2016) examined vocal bursts of 16 emotional states and found above-chance accuracy in the recognition of the positive emotions of amusement, awe, contentment, desire for food and sex, interest, relief, serenity, and triumph, and the negative emotions of anger, contempt, disgust, embarrassment, fear, pain, sadness, and surprise.
The current work tests a far larger number of states than most other prosody research, including nine positive states, nine negative states, and neutral. The positive emotions are affection, amusement, happiness, interest, lust, pride, relief, serenity, and (positive) surprise. Although the small number of basic emotions tested typically includes only one positive state, namely, happiness, there has been more recent interest in expanding the number of positive states being tested in general (Sauter, 2017). Research suggests that a variety of positive emotions can be recognized from vocal bursts (Cordaro et al., 2016; Simon-Thomas et al., 2009), and the current study on vocal prosody therefore also samples a large number of positive states. The negative emotions are anger, contempt, disgust, distress, fear, guilt, sadness, shame, and (negative) surprise. Overall, these 18 emotions were selected to represent a diverse set that includes basic emotions, self-conscious emotions, interpersonal emotions, low arousal emotions, and emotions that are rarely studied in emotion recognition research. They vary in the extent to which they are considered prototypical (Fehr & Russell, 1984). It is important to include a wide range of emotions in open-ended studies to be able to capture the possible variety of judgments.
Research Questions
We attempt to expand the existing body of work in a number of ways. This study focuses on the underemphasized nonverbal channel of emotional prosody, attempts to expand the methodological repertoire beyond the dominant research paradigm, and samples from a wide variety of positive and negative emotions. It is a large-scale attempt to understand what listeners can hear in the voice. Given the relative dearth of research on the open-ended understanding of emotional prosody, we organize our investigation around three research questions, rather than around traditional hypothesis testing.
These research questions revolve around the more general question: What can people hear in the voice? Participants were provided with an open-ended task, namely to describe what they thought the speaker in each prosody stimulus was trying to express to the listener. We approach the analysis of participant responses in two different ways. In the first analysis, we examined the location of their response within the emotion componential process (Frijda, 2007; Scherer & Moors, 2019), which is described in the section that follows. In the second analysis, we examined the content of the response. That is, we coded the underlying meaning of the text that they wrote, notably the emotion that participants listed if any. This meaning was examined for its accuracy vis-à-vis the intended emotional state as well as for developing a typology of these responses.
Question 1: What Aspects of Emotion Do We Hear in Vocal Prosody?
In this research, we explore what reaction participants generate naturally with little constraint. Without instructing participants as to what the stimuli should contain, we ask them to describe what the speaker is trying to express to the listener.
Most research on the perception of effect takes as given that the stimuli will be judged as expressing a feeling state. This assumption is reflected in the nature of the response choices, which typically refer to discrete emotional categories. However, providing only feeling states as response choices may be presumptive (Russell, 1994). Vocal expression has the ability to convey not only categories but also emotion dimensions such as activation, valence, and potency (Belyk & Brown, 2014; Goudbeek & Scherer, 2010; Laukka et al., 2005). Listeners may also hear other phenomena in emotional prosody, including affect-related person attributes such as dominance and politeness (Hönemann et al., 2015; Shochi et al., 2009). The current study steps back from the feeling state assumption and actively evaluates what people think about emotional expressions.
Emotional feeling states are considered just one stage of the larger emotion process that consists of multiple components (Frijda, 2007; Scherer & Moors, 2019). These are: (a) the Situation, which is the underlying stimulus, event, or cause that elicits an emotion (e.g., “good news,” “someone did something to him”), (b) Cognitive appraisal, which is the evaluation of a situation, notably along the dimensions predicted by appraisal theories (e.g., Ellsworth & Scherer, 2003; Laukka & Elfenbein, 2012; Nordström et al., 2017), such as a stimulus’ relevance, goal conduciveness, normative significance, predictability, and the individual’s coping potential for the situation (e.g., “expected,” “important”), (c) Feeling state, which is the individual’s subjective emotional state (e.g., “angry,” “anxious”), (d) Physiological arousal, which is the individual’s level of arousal experienced (e.g., “excited,” “calm”), (e) Expressive behavior, which is any verbal or nonverbal behavior that conveys an emotional state either intentionally or unintentionally (e.g., “crying,” “laughing”), (f) Emotion regulation, which is any attempt to change an individual’s feeling states or their expression (e.g., “concealing,” “hiding”), and (g) Influence, which are any actions taken with respect to another person (e.g., “commanding,” “empathizing”). It has been proposed that emotion expressions may provide information about several of the components of the emotion process (e.g., Ekman, 1997). The current study provides the first descriptive data on this issue based on spontaneous impressions.
Question 2: How Accurate Are Open-Ended Judgments of Emotions in Vocal Prosody?
The degree of accurate recognition is a question that looms large in the study of emotion. The current study takes four perspectives on what it means for an open-ended response to be correct. First, we examine whether the intended emotional category matches the category that was reported by the participants. The other three perspectives involve dimensions rather than discrete categories. The circumplex model of affect maps distinct emotional categories into a two-dimensional space. One axis refers to valence or hedonic tone—the intrinsic pleasantness of a stimulus—and the other axis refers to the level of intensity of arousal (Russell, 1980, 2003; Russell & Feldman Barrett, 1999). Along these lines, we examine accuracy in terms of whether the judgment matches the intended category in terms of its valence, its arousal, or its quadrant within the 2×2 of valence and arousal. We note that our analyses do not further distinguish the 18 emotional categories along the dimension of potency, for two reasons. First, the model of valence and arousal has been the dominant model in modern writing. Second, including potency would create an unwieldy 2×2×2 set of 8 cells, not all of which would be populated by the 18 emotional categories represented in this study.
In examining the accuracy of open-ended judgments of vocal prosody, we note that the term “accuracy” can be hard to define in a free-response format. Not only do judges need to be able to recognize emotional states from the acoustical cues but they also need to find those emotional states to be the most relevant feature of the stimulus when reporting their impressions to researchers. For this reason, one can argue that the content of such a judgment is as much a matter of salience as it is a matter of accuracy. For example, a response that a participant is “talking to a young child” is not necessarily right or wrong, because it does not contain a judgment of an emotional state. As an attempt to assuage this concern, we include instructions to the participants that direct their attention to the notion that actors are trying to express something in these stimuli, and their stated task is to report what the actors are attempting to express. Furthermore, below we limit the analysis of accuracy to those responses that were coded as feeling states.
Question 3: What Else Do People Hear in Vocal Prosody Other Than Emotional States?
The final goal of this free-response study is to develop a nuanced portrait of the inferences people make from others’ emotional expressions. These inferences are not limited to emotional categories and, indeed, we use participant responses to understand the nature of spontaneous impressions of vocal expressions. After coding the content of open-ended judgments, we rely on expert emotions researchers—full members of the International Society for Research in Emotion (ISRE)—in order to develop a taxonomy from these codes.
The Current Study
This empirical work uses data collected in five distinct nations—namely, Australia, India, Kenya, Singapore, and the United States. To prevent confounds across linguistic backgrounds, each of these groups is from an English-speaking country. Even with this linguistic commonality, the selected countries vary greatly with respect to their cultures. Considering the five dimensions in Hofstede’s model (www.geert-hofstede.com; Hofstede, 2001), (a) Australia and the United States are often considered individualistic societies, whereas India, Kenya, and Singapore are more collectivistic, (b) India, Kenya, and Singapore are considered high in power distance, whereas the United States and Australia are low, (c) Singapore is low in uncertainty avoidance, with the other four cultures having moderate levels, (d) in terms of masculinity, levels from lowest to highest are Singapore, India, Kenya, the United States, and Australia, and (e) long-term orientation is lower in Australia, Kenya, and the United States, and higher in India and Singapore. Collecting data in multiple nations allows us to increase diversity in the participant samples.
Stimuli come from the largest available database of vocal expressions (Vocal Expressions of Nineteen Emotions across Cultures [VENEC]; Laukka et al., 2010, 2016), which contains 18 different emotional states (9 positive and 9 negative), as well as neutral states, portrayed by 20 actors from each of five English speaking nations. In an effort to maximize the quality of vocal portrayals in producing diagnostic cues, stimuli were collected from professional actors. The developers sampled a number of speakers that is larger than usual from each culture, in order to reduce the possibility of specific item effects that are idiosyncratic to particular individuals. Overall, this stimulus set is of a far larger scale than those used in previous research on emotional prosody, with a rich set of emotional states and diverse set of national groups represented.
Taken together, this study is designed to examine the perception of emotion through the voice in a setting that imposes the lowest level of constraint feasible for participant responses. It is exploratory in that it embraces the diversity of potential responses and catalogs the open-ended impressions of the emotional prosody stimuli. We attempt to increase greatly the amount of data available to understand the open-ended judgment of affective states expressed through emotional prosody.
Method
The data that support the findings of this study are openly available via the Center for Open Science at: https://osf.io/8x6s7/?view_only=b41e545daca6412c82a5324b8dfe80dd.
Stimulus Materials
Vocal stimuli were sampled from the VENEC database, which consists of emotion portrayals by 100 professional actors across five English-speaking countries (Australia, India, Kenya, Singapore, and United States; Laukka et al., 2010), and which was developed by the same author team as the current article. For the sake of consistency, we defined a “professional actor” as an individual who had been paid on at least one occasion for their acting. The database contains recordings from 20 actors from each culture (50% women). Each actor conveyed 18 affective states (affection, anger, amusement, contempt, disgust, distress, fear, guilt, happiness, interest, lust, negative surprise, positive surprise, pride, relief, sadness, serenity, and shame) with moderately high level of emotion intensity, and also recorded emotionally neutral expressions. To maintain consistency across portrayals, the words spoken were one of two emotionally neutral phrases: “Let me tell you something” or “That is exactly what happened.”
The actors were first provided with scenarios based on the appraisal theory of emotion (e.g., Ellsworth & Scherer, 2003; Lazarus, 1991; Ortony et al., 1988). These scenarios describe typical situations in which each of the above emotions may be elicited, and actors were instructed to try to enact finding themselves in these situations (see Laukka et al., 2016). The protocol further asked the speakers to try to remember similar situations that they had experienced personally and that had evoked the specified emotion. They were asked, if possible, to try to put themselves into the same emotional state of mind. The method of acting out emotional episodes by reactivating past emotional experiences is common among actors (e.g., Stanislavski, 1936), and these instructions were used to encourage consistency in the methods used across the individuals providing stimuli.
The recordings were conducted on location in each country (Brisbane, Australia; Pune, India; Nairobi, Kenya; Singapore, Singapore; and Los Angeles, United States) with conditions standardized across locations. The actors’ speech was recorded directly onto a computer with 44 kHz sampling frequency using a high-quality microphone (sE Electronics USB2200A, Shanghai, China). To enable a wide dynamic range while avoiding clipping, each actor first produced sample stimuli, and recording levels were optimized and held constant based on the loudest sample. The sound level of the stimuli was normalized separately for each speaker.
The stimulus set consisted of 380 items from each of the five cultures (total number of stimuli = 1,900), with 20 actors posing 18 emotional states plus neutral. To reduce fatigue, each participant listened to a subset of half of these 380 stimulus items. Two equivalent sets of 190 stimuli were created that contained five male and five female actors each posing these states.
Participants
A total of 302 participants took part in the study, from Australia (N = 61; 26 females), India (N = 60, 30 females), Kenya (N = 58, 29 females), Singapore (N = 51, 27 females), and the United States (N = 72, sex not recorded). In each nation, participants were undergraduate university students who were born and raised in their respective countries, and self-reported that they spoke English at native fluency. For determining the sample size, there are no specific power tests for open-ended qualitative data, and so we conducted a power test for the ANOVA comparing quantitative results across 18 emotions for one culture at a time. Using G*Power (Faul et al., 2009), assuming a small effect size (r = 0.15), a sample of N=50 was predicted a priori to have power of .85 for a within-subjects design with 18 repeated measures. Each nation’s sample exceeded this number.
Procedure
Participants listened to each stimulus and provided open-ended judgments of the stimuli from their own cultural group. MediaLab software (Jarvis, 2008) presented stimuli in a randomized order for each participant and recorded responses. Instructions were as follows: In this study we want to examine what kinds of information people can tell from another person’s vocal tone. You will hear a number of phrases that were spoken by professional actors trying to show how they would express themselves in various situations. These expressions might involve emotions, intentions, or other kinds of feelings that a person could have. For each phrase expressed, we want you to think about what kind of feeling would likely have elicited such a vocal expression. For each phrase, we ask you to describe what you think the actor is trying to express to you, the listener. You can use a single word for your response, up to a maximum of one line of text.
A total of 57,380 responses were generated from the 302 participants judging 190 stimuli each. These consisted of 18,217 unique text strings.
Coding for Emotion Process
We coded the open-ended responses for seven stages of the componential emotion process. Based on theoretical models, these stages were: Situation, Cognitive appraisal, Feeling state, Physiological arousal, Expressive behavior, Emotion regulation, Influence attempts, and Other could not be coded. The second author (PL) coded all stimuli, and a research assistant provided independent ratings of all stimuli. Each text string could be assigned to multiple emotion process categories. The coder could list categories in order of their first, second, and even third choice. Before calculating reliability, partial credit was assigned to the top two process codes if more than one code was listed. The average (r) and total (R) inter-rater reliability varied for each stage of the emotion process: Stimulus r = .50, R = .66, Cognitive appraisal r = .63, R = .77, Feeling state r = .71, R = .83, Physiological arousal r = .79, R = .88, Expressive behavior r = .29, R = .45, Emotion regulation r = .53, R = .70, and Influence attempts r = .71, R = .83. Based on these reliability figures, data on expressive behavior appear for completeness, but it is noted that coding did not reach conventional reliability in the tables in which it appears. Disagreements were handled by assigning proportional credit to the different codes chosen. This meant that any particular text string could be considered half part of one category and half part of another.
Coding for Stimulus Content
To capture the diversity of the 18,217 unique text strings within a manageable yet comprehensive set of categories, the first author (HAE) coded 5,000 strings from the first three countries in which data were collected (India, Kenya, and the United States). These 5,000 included every text string that appeared at least twice. Beginning with the originally intended categories portrayed by the actors, she added a new category any time that a frequently-used text string could not be coded into the existing categories. She also added a new category any time that at least three separate text strings were related to each other and could not be coded into an existing category. When data arrived from the other two countries (Australia and Singapore), she coded the 115 new text strings that appeared 5 or more times, and confirmed that there were no further changes to the coding system. These 5,115 unique text strings used to generate the coding system represented 43,441 (75.7%) of the total participant responses. This process resulted in a total of 56 unique categories, of which 32 referred to affective states. Both these 32 affective states and the remaining 24 categories are described below.
Four raters used this coding system. In addition to the first author, two research assistants entered independent responses for each text string from all five nations, and an additional research assistant entered independent responses for the first three countries that had complete data. The average inter-rater reliability (r) was .75, and total Reliability (R) was .90 on the basis of an average of 3.6 ratings per text string. As with the emotion process coding, disagreements were handled by assigning proportional credit to the different categories chosen.
Results
Research Question 1: What Aspects of Emotion Do We Hear in Vocal Prosody?
Table 1 summarizes the categorization of open-ended judgments of vocal prosody into stages of the emotion process, broken down by the intended states (18 emotions and neutral). In general, feeling states were implicated in nearly half of all responses (43.2%), followed by attempts to influence others (20.0%), cognitive appraisals (16.6%), and physiological arousal (15.0%). Less common but still present were references to situations (2.2%), expressive behavior (1.7%, which was not a sufficiently reliable variable), and a very small number of entries related to emotion regulation (0.2%).
Categorization of Open-Ended Judgments of Vocal Tones Into Stages of the Emotion Process (N = 302).
Note. Expressive behavior appears in the table for completeness; however, coding did not reach conventional reliability and coefficients appear in italics.
Positive vs. negative emotions are more likely to be seen as physiological reactions, whereby negative vs. positive emotions are more likely to be seen as cognitive appraisals and feeling states. We note that many of the social/moral emotions (anger, contempt, pride, guilt, and shame) had the highest percentages of responses coded as the cognitive appraisal. The emotions most frequently coded as influencing others were interest, lust, and affection.
Supplemental Tables 1a through 1g contains these data for every combination of emotion process, emotional state, and nation. Judgments of stimuli as representing situations averaged 2.2% and ranged from 0.5% for Kenya to 5.0% in Singapore. Cognitive appraisals averaged 16.6%, ranging from 15.1% in India to 18.8% in Singapore. Feeling states had an average of 43.2%, with a wide range from 33.2% in Singapore to 51.9% in Australia. For physiological arousal, the average was 15.0%, ranging from 12.9% in India to 17.6% in Singapore. Attempts at emotion regulation were rarely used, at 0.2% in total. As mentioned above, coding of expressive behavior did not reach conventional reliability, but we note here that it averaged 1.7% of responses. From 0.9% in Australia to 2.5% in India. Finally, attempts to influence others were coded at 20.0% of responses, and ranged from 12.9% in Australia to 26.2% in India.
Research Question 2: How Accurate Are Open-Ended Judgments of Emotions in Vocal Prosody?
The convergence between emotion perceptions and intended states has been a central question within research on vocal tone, and we attempt to assess it here. To address this question, we focus on the 43.2% of open-ended judgments that referred to feeling states, because strictly speaking only these responses can be considered accurate vs. inaccurate. If a participant did not label any of the 20 trials for a given emotion as a feeling state, the mean was substituted for that combination of emotion and culture. This occurred in only 1.43% of cases, ranging from 0% for distress, fear, happiness, and sadness, up to 7% for neutral.
Table 2 contains a confusion matrix that lists the 19 intended emotional categories in the columns and the 32 feeling states that emerged from stimulus content coding in the rows. Supplementary Tables 2a through 2e contains these matrices separately for each of the five nations sampled. In these tables, accuracy is based on the convergence between the open-ended responses and the intended state, which is indicated in bold and underlined typeface. Whether or not this convergence is greater than that predicted by chance guessing is not tested formally, because there is no multiple-choice response that a participant could choose at random. As such, we treat all convergence with intended responses as if it is better than that based on chance alone.
Categories Coded for 19 Affective States That Were Judged as Feeling States Based on Open-Ended Responses From Five Nations (N = 302).
Note. Trials were included only if at least one rater coded the response as representing a feeling state, as opposed to another stage of the emotion process. AFF = affection; AMU = amusement; HAP = happy; INT = interest; LUS = lust; PRI = pride; PSU = positive surprise; REL = relief; SER = serenity; ANG = anger; CON = contempt; DSG = disgust; DST = distress; FEA = fear; GUI = guilt; NSU = negative surprise; SAD = sad; SHA = shame; NEU = neutral.
We examined four different types of accuracy: based on emotional categories (categorical), based on affective valence (valence-based), based on affective arousal (arousal-based), and based on the quadrant in the affective circumplex marked by valence and arousal (quadrant-based). These values for the five nations and overall appear in Table 3. For each type of accuracy, we provide formal tests of differences in accuracy levels from 5 (Culture Between-Subjects) × 19 (Emotion Within-Subjects) ANOVAs.
Judgment Accuracy for 19 Affective States That Were Judged as Feeling States Based on Open-Ended Responses From Five Nations (N = 302).
Note. Values may sum to less than 100% due to responses that are neutral or that could not be coded into the four quadrants. Bold font indicates judgments of emotional states in the intended quadrant. Some values in Supplemental Table 3 differ from corresponding values in the one-tenth of one-percent place because the values in this table are weighted by culture. AUS = Australia; IND = India; KEN = Kenya; SING = Singapore, USA = United States.
Categorical accuracy
There was a main effect for Emotion, F(18, 5346) = 276.25, η2 = .48, p < .001, a main effect for Culture, F(4, 297) = 43.57, η2 = .37, p < .001, and an interaction effect of Culture × Emotion, F(72,5346) = 9.58, η2 = .11, p < .001. It is worth noting that the overall categorical accuracy is relatively modest, at 14.2%. However, the results of the ANOVA emphasize that this varied by emotional state, and some individual emotions received substantial recognition rates. In general, negative states (18.2%) were better recognized than positive states (10.9%). The states with highest accuracy were anger (61.2%), sadness (38.6%), happiness (30.9%), and fear (29.8%). Those with the lowest accuracy were positive surprise (1.9%), disgust (2.3%), guilt (2.8%), and affection (3.3%). Participants from India were the most accurate (17.7%), followed by those from the United States (15.5%), Australia (15.3%), Singapore (13.8%), and Kenya (8.4%).
Valence-based accuracy
There was a main effect for Emotion, F(18, 5346) = 295.65, η2 = .50, p < .001, a main effect for Culture, F(4, 297) = 19.74, η2 = .21, p < .001, and an interaction effect of Culture × Emotion, F(72,5346) = 5.31, η2 = .07, p < .001. Valence accuracy was overall substantially higher than that for individual categories, and averaged 56.3%. In general, negative states (74.5%) were far better recognized than positive states (43.1%). The best recognized states were sadness (83.8%), anger (80.9%), distress (80.2%), and fear (79.2%). Those with the lowest accuracy were serenity (25.1%), relief (28.7%), lust (29.6%), and affection (47.1%). The most accurate participants were from Australia (60.6%), followed by those from the United States (58.5%), India (58.4%), Singapore (53.2%), and Kenya (49.5%).
Arousal-based accuracy
Based on Scherer (2005) and Gillioz et al. (2016), arousal levels were coded as low for the positive emotions affection, relief, and serenity, and the negative emotions guilt, sadness, and shame. The remainder were coded as high arousal.
There was a main effect for Emotion, F(18, 5,346) = 128.28, η2 = .30, p < .001, a main effect for Culture, F(4, 297) = 9.441, η2 = .11, p < .001, and an interaction effect of Culture × Emotion, F(72,5,346) = 2.90, η2 = .04, p < .001. Arousal-based accuracy was also substantially higher than that for individual categories, with an average of 60.1%, and was approximately equal for negative (59.4%) and positive (61.9%) states. The most highly recognized states were anger (83.0%), happiness (80.0%), amusement (79.8%), and positive surprise (75.1%). The most poorly recognized states were affection (27.6%), relief (37.7%), guilt (42.5%), and shame (48.0%). Note that the effect size for cultural differences was relatively smaller than that for categorical and valence-based accuracy. Values were highest in Australia (64.6%), India (62.2%), Kenya (58.4%), Singapore (57.8%), followed by the United States (57.5%).
Affective quadrant accuracy
There was a main effect for Emotion, F(17, 5049) = 201.39, η2 = .40, p < .001, a main effect for Culture, F(4, 297) = 19.51, η2 = .21, p < .001, and an interaction effect of Culture × Emotion, F(68,5049) = 5.40, η2 = .07, p < .001. Affective quadrant accuracy was again higher than that for categorical accuracy, and averaged 41.5%. As with categorical and arousal-based accuracy, negative states (51.4%) were better recognized than positive states (31.6%). Those with the highest accuracy were anger (75.3%), amusement (60.5%), happiness (60.3%), and fear (57.6%). Those with the lowest accuracy were affection (9.1%), serenity (9.9%), relief (12.4%), and lust (24.0%), which tended to be low arousal emotions. Accuracy was highest in Australia (45.2%), followed by the United States (43.8%), India (43.3%), Singapore (39.2%), and Kenya (34.9%).
To understand these data further, we present a confusion matrix in Table 4, which illustrates to what extent the erroneous judgments were random vs. systematic. 1 The most visible trend is that low activation emotions—whether positive or negative—appear to be judged most often as negative states. In particular, low activation positive states appear to be challenging to judge, with 10.0% of judgments appearing in the intended quadrant, and the remaining judgments split about equally among the three others (22.3%, 22.9%, and 25.5% for high activation positive, high activation negative, and low activation negative, respectively).
Confusion Matrix of Affective Circumplex Quadrants Judged for Each Emotional State (N = 302).
Note. Values may sum to less than 100% due to responses that are neutral or that could not be coded into the four quadrants. Bold font indicates judgments of emotional states in the intended quadrant. Some values in Supplemental Table 3 differ from corresponding values in the one-tenth of one-percent place because the values in this table are weighted by culture.
Overall findings
Taken together, the three forms of accuracy that draw from the affective circumplex—that is, valence-based, arousal-based, and quadrant-based—tended to have far higher accuracy than discrete emotional categories. Those that tended to be best recognized were the basic emotions of anger, sadness, happiness, and fear. These are also the emotions that Research Question 1 found to be the most often coded as feeling states.
Supplemental analyses by gender
In the case of data from Australia, India, and Singapore, individual participant data could be matched with self-reported gender, and accuracy by gender appears in Supplemental Tables 3a, 3b, 3c, and 3d for categorical, valence, arousal, and quadrant accuracy, respectively. 2 Given findings of gender differences in emotion recognition (Hall et al., 2016; Lausen & Schacht, 2018), we analyzed these data using a 2 (Gender) × 3 (Culture) × 19 (Emotion) ANOVA separately for each dependent measure. The advantage for women that is apparent by visual inspection of the tables was significant for categorical accuracy, F(1, 166) = 5.24, η2 = .03, p=.023, reached marginal significance for arousal-based accuracy, F(1, 166) = 2.93, η2 = .02, p=.089, and quadrant-based accuracy, F(1, 166) = 3.38, η2 = .02, p=.068, and was not significant for valence-based accuracy, F(1, 166) = 1.10, η2 = .01, p=.295. There were no significant two-way interaction terms of Gender × Culture or Gender × Emotion for any of the four types of accuracy.
Research Question 3: What Else Do People Hear in Vocal Prosody Other Than Emotional States?
Table 5 summarizes the frequency in which the intended states were categorized into the entire set of 56 open-ended categories. Separate versions of this table for each of the five nations appear in Supplemental Tables 4a through 4e.
Categories Coded for 19 Affective States Based on Open-Ended Responses From Five Nations (N = 302).
Note. Entries in underlined bold font represent the intended categories. AFF = affection; AMU = amusement; HAP = happy; INT = interest; LUS = lust; PRI = pride; PSU = positive surprise; REL = relief; SER = serenity; ANG = anger; CON = contempt; DSG = disgust; DST = distress; FEA = fear; GUI = guilt; NSU = negative surprise; SAD = sad; SHA = shame; NEU = neutral.
To group these categories into larger themes, we recruited expert emotions researchers, who were all full members of the ISRE. They participated in a “card sorting” task, which provided participants with 56 “cards” that they could sort into as many categories as they wish, and they provided their own label for each category. The resulting data produced a 56×56 similarity matrix, which consisted of the number of times that any given codes appear in the same user-generated category. These data were examined using participant-centered analysis (https://support.optimalworkshop.com/en/articles/2626877-interpret-the-optimalsort-participant-centric-analysis-pca), which displays the sorting methods that converged most closely with those of multiple participants. To serve as a power test for card-sorting tasks, a simulation model determined that N=15 achieves a correlation of .90 between similarity matrices and those using a sample over ten times that number (Tullis & Wood, 2004). Sixteen scholars took part in this card sorting task.
The resulting grouping of the 56 categories is indicated in the headings of Table 5. It is noteworthy that affective states accounted for 32 of the 56 categories, which suggests that such states account for a substantial proportion of open-ended judgments. However, 24 of the categories were grouped into themes that the experts did not consider affective states, even though these responses were generated by participants after listening to stimuli designed to represent such states. In particular, additional categories could be classified under the heading of person descriptors, communication behaviors, and abnormal states. Person descriptors included wanting attention, being dominant, hesitating, being humble, being polite, rushing, being secretive, sincerity, seriousness, sympathy, thoughtfulness, wanting to be understood, and being weak. Communication behaviors included advising, complaining, confessing, confirming, asking questions, making requests, sarcasm, warning, and whispering. Abnormal states included being crazy or under the influence of narcotics.
Several patterns are worth mentioning that consisted of 9% or more responses. In terms of feeling states, amusement was often attributed to happiness and excitement. Happiness was often attributed to amusement and excitement. Serenity was often attributed to sadness and neutrality. Contempt and disgust were often attributed to anger. Distress was often attributed to anger, fear, and sadness. Fear was often attributed to distress. Guilt was often attributed to sadness. Negative surprise was often attributed to anger. Sadness was often attributed to distress. Shame was often attributed to sadness. In terms of person descriptors, the negative emotions of anger, contempt, disgust, and negative surprise were frequently attributed to dominance, as were the positive emotions of affection, interest, and pride. Lust was often attributed to being secretive. None of the communication behaviors or abnormal states was part of major confusion patterns.
Discussion
We started this investigation with a basic question: What do we hear in the voice? Taking a large-scale approach to examining the judgment of emotional prosody, we extended a long-standing tradition on emotion recognition research that has focused primarily on multiple-choice judgments of a small number of emotions expressed via still photographs of the face. Emotional prosody is unique among communication channels, in that it necessarily conveys verbal and nonverbal content simultaneously. Emotional expressions came from 100 professional actors from five English-speaking nations that span four continents. These actors portrayed 18 distinct emotional categories, particularly including more positive states than typically sampled in research on emotion perception. Participants had little constraint on their judgment process 3 and were asked only what the speaker was trying to convey to the listener. In conducting a project of this unusual magnitude, the goal was to provide additional data to fill in the basic science of understanding the human voice in expressing emotion.
Rather than organizing the paper around hypotheses, we examined three research questions within the larger question of what people hear in the voice.
First, we examined what aspects of emotion listeners hear in vocal prosody. Participants spontaneously reported responses that cut across the chronological emotion process (Frijda, 2007). This includes not only feeling states—which have been the focus of response choices in conventional studies—but also the situation that elicited the emotion, the cognitive appraisals that analyze the situation, the physiological arousal associated with the feeling state, the expressive behavior that intentionally or unintentionally conveys information about feeling states, attempts at regulating emotions, and also attempts at influencing social partners. The current study provides the first descriptive data of this kind. Some components were more frequently mentioned for some emotions, which suggests that not all emotions are equally linked to all components in the participants’ perceptions. For example, social/moral emotions had the highest percentages of responses coded as cognitive appraisal; and interest, lust, and affection had the highest percentages of responses coded as influencing others. There were also some relative differences across cultures. Singaporean participants provided judgments that were less often coded as feeling states, and more often as situations, cognitive appraisals, and physiological arousal. Participants from India judged stimuli less often as cognitive appraisals and more often as attempts to influence others. These results should be replicated to see if similar patterns can also be observed in other samples of expressive stimuli and participants. Research on how people spontaneously infer different components of the emotion process from expressive behavior is still emerging, but has the potential to widen our understanding of the social function of emotion.
Second, we asked how accurate open-ended judgments of emotions in vocal prosody are. For this, we examined only those responses that consisted of feeling states—which are the only ones strictly speaking that can be considered accurate vs. inaccurate. Doing so, we found modest average accuracy of 14.2% in responding with the same intended emotional category. Although average categorical accuracy was low, it was fairly high for some emotions with anger reaching the highest rates at 61.2%. We suggest that accurate open-ended vocal emotion recognition requires both that the emotion in question has a distinct acoustic pattern that is being recognized, and that the emotion concept in question is salient and readily accessible when recording the written response. (In addition, participants also need to find emotional states to be the most relevant feature of the stimulus when reporting their impressions to researchers, and therefore accuracy is here based only on responses that were coded as feeling states in the first place). This seems to be the case for at least some emotion categories—anger, fear, happiness, and sadness—which received fairly high categorical recognition rates. We note that these emotions all belong to those that are historically considered as “basic” (Ekman, 1992). Future research is required to find out if the low recognition rates for the other emotions were caused by less distinct acoustic patterns, lower salience of the emotion concepts in question, or a combination of both.
Accuracy in terms of affective dimensions was overall more substantial. Participants matched the valence of the intended emotion an average of 56.3% of the time and the arousal level of the intended emotion an average of 60.1% of the time. When valence and arousal were combined into four quadrants, participant responses were considered accurate an average of 41.5% of the time. Arousal-based accuracy also showed the smallest effect size for cultural differences, compared with categorical and valence-based accuracy. These findings are in accordance with previous rating studies, which show that especially arousal may modulate many vocal cues and is well recognized in the voice (e.g., Laukka et al., 2005). When participants’ responses do not directly match the intended emotion, these findings suggest that they still often match another emotion that is similar to the intended emotion in terms of arousal and valence. This suggests that listeners’ perceptions are not as fine-grained as simply falling into emotion categories. 4 Although these findings provide potentially suggestive data, more research is needed to address explicitly the controversial question of to what extent vocal expressions are perceived in terms of emotion categories or dimensions (see Cowen et al., 2019).
It was interesting to note that women outperformed men in categorical accuracy but not in arousal-based or quadrant-based accuracy. This suggests that women in our sample were more attuned to nuance in recognizing specific emotions, but had no greater sensitivity to the underlying dimensions of core affect.
Finally, we examined what else people hear in vocal prosody other than emotional states. We took a holistic approach to explore what types of responses emerged when listeners had no constraints on their reactions to a diverse pool of vocal emotional expressions. In doing so, we found that there was a wide range of ways in which people spontaneously perceive the intent behind effective vocal cues. Like Frijda (1953) and Haidt and Keltner (1999) had previously shown for facial expressions, our results showed that participants did not only describe feeling states when given the opportunity to provide open-ended responses of vocal expressions. Our taxonomy of these responses thus included not only emotional categories but also cues about person descriptors, communication behaviors, and abnormal states. These categories emerged from a rating task completed by expert scholars in the field of emotion. Whereas participants are typically directed to discrete categories of affective experience in research studies, we found that participants who could freely convey what they hear in the voice produced a diverse set of judgments. The responses were coded by emotions research experts into 56 categories that not only included 32 affective states but also included 13 person descriptors, 9 communication behaviors, and 2 abnormal states. The complex taxonomy inherent in these findings points to the complexity of social cues that listeners hear in vocal tone. The inclusion of communication behaviors speaks to the debate about whether emotional stimuli should be considered “expression” vs. “communication” (Ekman, 1997). That is, the internal experience of emotion can surface in terms of involuntary expressive behavior, but in addition, we may have evolved to imitate these involuntary cues to convey deliberate messages to others. To the extent that listeners perceive communication behaviors in the vocal signals of others, this supports the notion that emotional displays represent more than spontaneous reflexes. The inclusion of person descriptors suggests potential links between emotion perception and broader aspects of social perception (e.g., Todorov, 2017), which could be investigated in future studies.
Overall, our findings of open-ended accuracy and the diversity of judgments may also provide new insights for research on the real-life outcomes of emotion recognition skills (e.g., Schmid Mast & Hall, 2018). Future studies could, for example, investigate which kind of inferences (other than emotion categories) are associated with successful social interactions.
Limitations
A number of important limitations qualify the findings reported above.
First, we note the wording of our stimulus sentences. Our goal was to have sentences that were long enough for variance in prosody and that could apply to 19 different affective states. However, participants could have interpreted the particular sentences (“Let me tell you something” and “That is exactly what happened”) as implying influence attempts. This would have influenced the results for the first and third research questions. In particular, it may have increased the proportion of influence attempts appearing in Table 1, which accounted for an average of 20.0% of all responses, and may have increased the proportion of communication behaviors appearing in Table 4. Future research could use nonsense words, content filtering, or sentences that are more truly neutral in content with respect to emotional processes.
A related limitation is to the wording of the instructions. Listeners were told that the “expressions might involve emotions, intentions, or other kinds of feelings that a person could have,” and were instructed to think about “what kind of feeling would likely have elicited such a vocal expression.” This may also have influenced the results for the first and third research questions. It could have increased the proportion of feeling states in participant responses, which accounted for an average of 43.2% of the text strings in Table 1, and for the emotional states listed in Table 4, which account for 32 of the 56 categories coded.
The coding process also introduced error, as it involved reading details about participants’ intended meanings that are limited by the precision of everyday emotional vocabulary in relatively brief text strings. For example, participants may have used the same words to refer to feeling states as they do to refer to physiological arousal, e.g., calm, and may have used more generalized vocabulary to refer to more specific emotions (e.g., happy for affection, and sadness for guilt). 5
A potential limit to generality is that all five nations represented in the study are English-speaking. This was chosen deliberately so that linguistic properties would not be confounded with paralinguistic ones. This also meant that the emotion categories were consistent in their meaning across the five nations. It would be worthwhile also to conduct large-scale research on vocal prosody with other language groups. We note that all data were collected for individuals’ judgments of members of their own cultural group, that is, within-cultural data versus cross-cultural data. A cross-cultural judgment study using open-ended responses would be welcome. Such a study could provide novel information about possible differences in spontaneous impressions of emotion expressions from one’s own vs. other cultural groups.
Another concern is that the studies made use of posed expressions instead of spontaneous speech—consistent with most research on the perception of vocal tone. We made this decision for the feasibility of including 18 different emotions and neutral portrayals. However, actors may enact vocal tone in a way that differs from spontaneous speech. For example, in a comparison of acoustic patterns in spontaneous and posed expression databases, Juslin et al. (2018) reported small but systematic differences. We attempted to assuage this concern by using stimuli developed via the Stanislavski (1936) technique of method acting, which increases authenticity by inviting actors to relive past experiences (Scherer & Bänziger, 2010). Furthermore, we note Scherer’s (2013) challenge to the notion that using acted samples in empirical research is problematic, based on several sources of evidence (see also Scherer & Bänziger, 2010). Notably, Scherer (2013) found that speech samples from mood induction and acting were comparable in terms of their acoustic properties. Taken together, we believe that the limitation of using posed expressions is an important one, but that it does not invalidate the research presented here. Related to this concern is the fact that we informed participants that the stimuli were produced by actors, which may have increased their awareness that these were intended to be emotional expressions. 6
Although maintaining the argument that these stimuli are valid representations of emotional expression, we acknowledge that it is not possible to compare directly the accuracy rates in this open-ended study to those in forced-choice studies for which stimuli were hand-selected to be prototypical representations. The current stimulus set was not hand selected, as all actor portrayals were included, unlike other studies that use pre-screening to ensure a baseline intensity of emotional cues.
Finally, our data on accuracy come from only those responses in which participants judged stimuli as representing a feeling state. This is because in other responses the feeling state was not the most salient feature to the listener, and so their response could not be judged as accurate vs. inaccurate per se. In further work, researchers may choose to capture responses as generated spontaneously by participants being prompted only to describe what the actor is trying to express, or they may limit participants to emotion labels in order to analyze the entire data set.
Conclusion
In asking the question of what we hear in the voice, we conclude that people hear quite a bit. Beyond emotional states, people hear a wide range of messages. This diversity in what people perceive in prosody indicates the far-reaching value of emotion as a means of communication, including information about the environment, other people’s characteristics, and their behavioral intentions. People make a wealth of spontaneous inferences from expressive cues, and many of these inferences were captured in the current open-ended study and contribute to a fuller understanding of vocal emotion perception. We note that the usable nature of this information is at the heart of the social function of emotion—after all, perceiving is for doing (Gibson, 1979). The recognition of emotional content embedded within nonverbal behavior is critical for real-life outcomes (Schmid Mast & Hall, 2018), and the perception of emotion in prosody is crucial within this important human process.
Supplemental Material
sj-docx-1-psp-10.1177_01461672211029786 – Supplemental material for What Do We Hear in the Voice? An Open-Ended Judgment Study of Emotional Speech Prosody
Supplemental material, sj-docx-1-psp-10.1177_01461672211029786 for What Do We Hear in the Voice? An Open-Ended Judgment Study of Emotional Speech Prosody by Hillary Anger Elfenbein, Petri Laukka, Jean Althoff, Wanda Chui, Frederick K. Iraki, Thomas Rockstuhl and Nutankumar S. Thingujam in Personality and Social Psychology Bulletin
Supplemental Material
sj-docx-2-psp-10.1177_01461672211029786 – Supplemental material for What Do We Hear in the Voice? An Open-Ended Judgment Study of Emotional Speech Prosody
Supplemental material, sj-docx-2-psp-10.1177_01461672211029786 for What Do We Hear in the Voice? An Open-Ended Judgment Study of Emotional Speech Prosody by Hillary Anger Elfenbein, Petri Laukka, Jean Althoff, Wanda Chui, Frederick K. Iraki, Thomas Rockstuhl and Nutankumar S. Thingujam in Personality and Social Psychology Bulletin
Footnotes
Acknowledgements
We thank Ann Liu and Alex Morel for research assistance and Alan Cowen for helpful suggestions. Yochi Cohen-Charash, Shlomo Hareli, Marcello Mortillaro, James Russell, Disa Sauter, Craig Smith, and 10 anonymous colleagues contributed expert ratings.
Author Contributions
The third, fourth, fifth, sixth, and seventh authors contributed equally, and appear in alphabetical order.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: We acknowledge United States National Science Foundation BCS-0617624 to Hillary Anger Elfenbein and Swedish Research Council 2006-1360 to Petri Laukka.
Supplemental Material
Supplemental material is available online with this article.
Notes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
