Abstract
First impressions formed after seeing someone’s face or hearing their voice can affect many social decisions, including voting in political elections. Despite the many studies investigating the independent contribution of face and voice cues to electoral success, their integration is still not well understood. Here, we examine a novel electoral context, student representative ballots, allowing us to test the generalizability of previous studies. We also examine the independent contributions of visual, auditory, and audiovisual information to social judgments of the candidates, and their relationship to election outcomes. Results showed that perceived trustworthiness was the only trait significantly related to election success. These findings contrast with previous reports on the importance of perceived competence using audio or visual cues only in the context of national political elections. The present study highlights the role of real-world context and emphasizes the importance of using ecologically valid stimulus presentation in understanding real-life social judgment.
Introduction
We form first impressions of unfamiliar people the moment we meet them. In such a situation, we are usually presented with audio (i.e., voice) and visual (i.e., face) cues simultaneously, yet, the vast majority of social evaluation literature has been focused on their independent effects (McAleer, Todorov, & Belin, 2014; Oosterhof & Todorov, 2008; Todorov, Said, Engell, & Oosterhof, 2008). First impressions from faces and voices have many parallels—they are both formed very quickly (after a 100-ms exposure for faces, Willis & Todorov, 2013 and after brief utterances for voices, McAleer et al., 2014) and have the same underlying structure, with dominance and trustworthiness emerging as fundamental dimensions (McAleer et al., 2014; Oosterhof & Todorov, 2008). Although evidence for the accuracy of social judgments is limited at best (Klofstad & Anderson, 2018; Todorov, Olivola, Dotsch, & Mende-Siedlecki, 2015), people seem to agree with each other’s evaluations, implying they are based on some consistent visual information in the face or acoustic information in the voice (McAleer et al., 2014; Zebrowitz & Montepare, 2008). Most importantly, first impressions have been shown to influence our behavior and decisions both in situations where appearance might be relevant, for example, dating, (Doll et al., 2014; Wells, Dunn, Sergeant, & Davies, 2009) and where we should be making more objective and informed choices, such as in political elections (Ballew & Todorov, 2007; Klofstad, 2016; Klofstad, Anderson, & Peters, 2012; Olivola & Todorov, 2010; Sussman, Petkova, & Todorov, 2013; Todorov, Mandisodza, Goren, & Hall, 2005), business and finance decisions (Dean, 2017; Fruhen, Watkins, & Jones, 2015; Rule & Ambady, 2008), and court sentencing (Wilson & Rule, 2015; Zebrowitz & McDonald, 1991; also see Olivola, Funk, & Todorov, 2014; Todorov et al., 2015 for reviews).
Here, we focus on how audio and visual cues are integrated to inform social judgments relevant to the most studied choice domain—leadership elections. Empirical work exploring the dimensions voters use when evaluating political candidates shows that competence is deemed one of the most important traits to possess (Miller, Wattenberg, & Malanchuk, 1986; Trent, Mongeau, Trent, Kendall, & Cushing, 1993). Therefore, competence has been the focus of social evaluation research on political decisions, with studies consistently demonstrating that political candidates perceived to have a more competent-looking face than their opponents are more likely to win U.S. Senate, House of Representatives, gubernatorial, and even Presidential elections (see Hall, Goren, Chaiken, & Todorov, 2009; Olivola & Todorov, 2010 for reviews). This effect has been replicated across different exposure durations (100 ms, 250 ms, or unlimited time in Ballew & Todorov, 2007) and different decision tasks (2AFC in Ballew & Todorov, 2007 and Todorov et al., 2005, and rating the competence of multiple candidates in Sussman et al., 2013). Such findings support the assumption that first impressions represent rapid and unreflective (also referred to as “system 1”) judgments, which means their effect might be unnoticed by voters (Chaiken & Trope, 1999; Kahneman, 2003). In fact, one way to disturb the relationship between competence evaluations and election success is to instruct participants to make a deliberate, rather than an intuitive, decision (Ballew & Todorov, 2007).
In the voice perception literature, research on political and leadership decisions has exclusively focused on the role of vocal pitch. Tigue, Borak, O’Connor, Schandl, and Feinberg (2012), for example, presented participants with pairs of voice recordings (one with a high and one with a low pitch) and asked them to select the person who sounded like a better leader and the one they would vote for. The results showed a significant preference for low-pitched voices both in terms of leadership and hypothetical votes. These findings have also been replicated with audio recordings and data from the U.S. House of Representatives elections, demonstrating a negative correlation between vocal pitch and vote share for both male and female candidates (Klofstad, 2016).
Despite all we already know about evaluating faces or voices along social dimensions, a more realistic approach to first impressions would acknowledge and explore their integration. Historically, most audiovisual literature has focused on identity and emotion recognition (Campanella & Belin, 2007; Massaro & Egan, 1996; Robertson & Schweinberger, 2010; Schweinberger, Robertson, & Kaufmann, 2007) with relatively fewer studies on social evaluation (Mileva, Tompkinson, Watt, & Burton, 2018; Rezlescu et al., 2015; Tsankova et al., 2015). These studies show that the relative importance of face and voice cues depends on the social dimension, with visual information from the face being more diagnostic of attractiveness (Rezlescu et al., 2015) and trustworthiness judgments (Mileva et al., 2018; Tsankova et al., 2015) and auditory information from the voice being more diagnostic of dominance judgments (Mileva et al., 2018; Rezlescu et al., 2015).
The relative contribution of face and voice cues to competence judgments as well as their integration in the context of leadership decisions, however, is not well understood. In one of the few studies addressing this issue, Benjamin and Shapiro (2009) showed participants a 10-s video footage of political candidates in a debate. Their task was to rate each person on attractiveness, likeability, leadership, and political orientation (liberal vs. conservative) as well as guess which candidate won the election. The videos were presented in full sound, with muddled sound, or with no sound. Their results showed that participants were able to predict the winner of the election above chance levels in all three audio conditions with no significant differences between them. However, as this was not the focus of the paper, Benjamin and Shapiro did not provide much information about the influence of social ratings. A recent study by Klofstad (2017) also explores the relative contribution of face and voice cues to election success. However, it focuses on a single social trait—competence and on a single acoustic characteristic—vocal pitch. Here, images of House of Representatives members rated as the most and the least competent were paired with a separate set of voice recordings manipulated to have either a higher or a lower pitch. Participants were then presented with two such pairs and asked to cast a hypothetical vote. The results showed that candidates with competent faces and competent voices (i.e., voices with low pitch, see Klofstad et al., 2012) won the largest proportion of votes; however, the effect of facial competence was 2.8 times larger than the effect of vocal competence. Findings from both studies imply that visual information in the face might be of higher importance than vocal characteristics when it comes to political and leadership decisions.
An interesting question following from studies that integrate audio and visual information together is whether such cues lead to the same social evaluation. In other words, are people with trustworthy or dominant faces, also perceived to have trustworthy and dominant voices? Previous literature has been mainly focused on judgments based on a single modality rather than their integration. Attractiveness is the only exception within this context, with some evidence that people perceived as more attractive from their faces are also perceived as more attractive based on their voices (Collins & Missing, 2003; Saxton, Burriss, Murray, Rowland, & Craig, 2009; Saxton, Caryl, & Craig, 2006). There are also studies exploring the perception of physical characteristics from both face and voice cues (Puts, Jones, & Debruine, 2012; Re, DeBruine, Jones, & Perrett, 2013). Some studies show that both visual and acoustic characteristics are highly correlated with measures of a person’s strength, height, and weight (Burton & Rule, 2013; Hodges-Simeon, Gurven, Puts, & Gaulin, 2014), whereas others report high correlations between face- and voice-based ratings for masculinity, health, and height (Smith, Dunn, Baguley, & Stacey, 2016). Together with findings of the strong relationship between facial and vocal perceived threat (Han et al., 2017), these studies suggest a possible link between dominance judgments inferred from the face and the voice.
The present study aims to extend previous literature in two ways. First, we examine a very different electoral context to that usually studied. Student representative elections are common across colleges and universities worldwide. As such, they are part of life for a very large number of people. (It would be hard to estimate how many, but any estimate would presumably be in the millions, every year.) Of course, these are of no geopolitical consequence in comparison with the elections typically studied in psychology, which universally focus on political voting. As described above, the large body of research shows the importance of perceived competence in national political contests, but we do not know whether this factor will be so important in elections of all types. Candidates in student elections may attract a different type of support, perhaps based on social factors or influenced by the fact that the winners of such elections receive relatively little real power. For this reason, it is important to establish whether the influence of perceived competence is universal, or tied to a particular context.
Second, in this study, we examine the independent contribution of faces, voices, and their combination across different social judgments. We do this using real election material (student campaign material) and relate social judgments to real outcome (election results). The use of genuine election material, rather than hypothetical elections, allows us to examine whether differences in social judgment are powerful enough to survive the highly variable, “messy,” context of a real ballot.
Throughout the study, participants were presented with short video clips of unfamiliar candidates running for student representative elections and then asked to rate each person for the fundamental social dimensions of trustworthiness, dominance, and attractiveness as well as competence. The experimental stimuli comprised audio cues only (voice recording extracted from each video), visual cues only (muted video clips), or audio and visual cues together (unedited clips). If the results from the political social evaluation election literature generalize across social contexts, then we would expect that competence, as judged from both the face and the voice, would be the trait most closely related to election success. Following from Klofstad (2017), we would also anticipate that visual information from the face would be more diagnostic of election success than acoustic information from the voice. However, we are interested to observe whether these patterns hold in the present context. More generally, we also predict that there will be positive correlations between face- and voice-based ratings, at least for judgments of attractiveness and dominance, where such patterns have previously been observed in neutral, lab-based settings.
Method
Participants
A total of 99 participants (seven male, Mage = 19 years, age range = 18-50 years) took part in the experiment. All were first-year students from the University of York, who were unfamiliar with candidates from student elections held in earlier years, whose campaign material was used in the study. Participants had normal or corrected-to-normal vision and received course credit or payment for their participation. Sample size was determined with an a priori power analysis in GPower (Erdfelder, Faul, & Buchner, 1996). Sussman et al. (2013) report one of the few studies which use a wider range of candidates and correlate the percent of votes received with their trait ratings. They collected data for 18 candidates with at least 32 participants rating each image and report a correlation of .53 between competence and vote percent. Based on their results, our power analysis revealed that to detect an effect of a similar size, with 90% power using an alpha of .05 (two-tailed), we need a sample of 33 participants per condition. The study was approved by the ethics committee of the psychology department at the University of York and informed consent was provided prior to participation.
Materials
The study used 22 videos produced by Student Television (http://ystv.co.uk/) as manifestos from candidates running in the University of York Student Union elections. 1 We used videos of candidates running for the positions of student union president (11/22) and sports president between 2015 and 2017 (original videos can be found at https://ystv.co.uk/watch/Elections/). 2 There were 7/22 videos of female candidates. An average of 2,524 3 votes were cast per year for each position and winning candidates received an average of 1,247 votes. Given that there were at least four candidates in each election, successful candidates won by a comfortable margin, securing about 50% of votes cast.
All videos were cut to capture only candidates presenting themselves and the position they were running for (mean video length = 3.41 s, video length range = 2-6 s). These short clips were used in the audiovisual condition. Participants in the visual condition saw the same 22 clips presented silently, while participants in the audio condition heard the voice of the candidates extracted from the same clips.
Procedure
The study used the online platform Qualtrics (2015; Provo, UT) to collect data; however, participants were tested in the lab. Each participant was randomly assigned to one of the three conditions: audiovisual, video only, or audio only. Participants were presented with all 22 clips and asked to rate each candidate for attractiveness, trustworthiness, dominance, and competence on a 9-point Likert-type scale. Each trait was rated in a separate block to minimize any carryover effects (Rhodes, 2006). Block and stimulus presentation order were randomized individually for each participant.
Results
Social Traits and Election Success
All trait ratings showed good interrater reliability (Cronbach’s α ranging from .75 to .94) and we, therefore, calculated an average score for each candidate within each trait × modality condition. Table 1 shows these average scores together with information about the total number and relative proportion of votes received by each candidate. The average trait scores were then correlated with the proportion of votes received by each candidate separately for the auditory, visual and audiovisual stimulus presentation (Table 2).
Mean proportion of votes and social ratings (A – attractiveness, T – trustworthiness, D – dominance, C – competence) across the three conditions (auditory, visual and audiovisual) for each election candidate.
Note. Vote proportion reflects the proportion of votes that each candidate received in their relative race. Some candidates did not record a manifesto video, however, the votes they received have been taken into consideration when calculating the proportion of votes received by the present candidates.
Uncorrected Pearson’s Correlations Between the Proportion of Votes Received by Each Candidate and Each Social Trait Across All Three Presentation Modalities.
Note. Significant correlations are presented in bold. n = 22.
Pearson correlations identified trustworthiness as the only trait related to election success. This relationship was significant in the auditory (r = .50, p = .017, 95% CI [confidence interval] = [0.10, 0.91]), visual (r = .45, p = .038, 95% CI = [0.03, 0.86]), and in the audiovisual condition (r = .44, p = .040, 95% CI = [0.02, 0.86]). Figure 1 shows the relationship between trustworthiness ratings and proportion of votes across modality. No other trait was significantly correlated with the proportion of votes received by the candidates. To check the reliability of these results, we also used the Benjamini and Hochberg (1995) correction for multiple comparisons with a false discovery rate of 0.2. Trustworthiness remained significantly correlated with vote proportion in all three conditions after the correction (auditory condition: p = .017, visual condition: p = .033, audiovisual condition: p = .05).

Correlations between trustworthiness and proportion of votes in the auditory only (top), visual only (bottom left), and audiovisual (bottom right) conditions.
Although participants rated candidates on each trait in separate blocks with randomized order, there were only 22 candidates which could still lead to some potential carryover effects (Rhodes, 2006). To address this issue, we used data from the first rating block for each participant and used ratings across all conditions as they present with a very similar pattern of results. This resulted in 23 participants rating attractiveness, 30 participants rating trustworthiness, 22 participants rating dominance, and 24 participants rating competence. Consistent with our earlier findings, Pearson correlations showed a significant relationship between vote proportion and trustworthiness (r = .67, p = .001, 95% CI = [0.33, 1.02]). No other traits were significantly correlated with the proportion of votes received by the candidates.
Relationship Between Face and Voice Cues
Such findings imply that the effect of face and voice cues might be complementary, rather than independent. To explore this further, we looked at the correlations between ratings attributed to each candidate when participants were presented with auditory or visual cues only (Figure 2). Pearson correlations showed a positive relationship between ratings attributed to faces and voices for all social traits. These correlations were strongest for ratings of trustworthiness (r = .63, p = .002, 95% CI = [0.27, 0.99]) and dominance (r = .47, p = .028, 95% CI = [0.06, 0.88]), demonstrating that people who are perceived as more trustworthy and dominant as judged from their faces, receive similar ratings based on their voices. The correlations between face and voice ratings for attractiveness (r = .36, p = .105, 95% CI = [–0.08, 0.79]) and competence (r = .33, p = .132, 95% CI = [–0.11, 0.77]) also followed the same direction, but were not significant.

Relationships between ratings based on auditory and visual cues in the perception of trustworthiness (top left), dominance (top right), attractiveness (bottom left), and competence (bottom right).
Acoustic Measures
Because trustworthiness as judged from the voice was significantly correlated with election success, we extracted a number of acoustic measures using the ProsodyPro script (Xu, 2013) in Praat (Boersma & Weenink, 2016). The acoustic parameters included the following: (a) speech rate, calculated as the average number of syllables produced per second of speech, as a measure of how quickly each utterance was produced; (b) fundamental frequency (F0) range, as a measure of how much variation in intonation was present; (c) median F0 as an average measure of how high-pitched a speaker’s voice was (a measure preferable to mean F0, as it reduces influence from outliers due to octave jumps, Lindh, 2006); (d) mean intensity; (e) formant dispersion between F1 and F3, calculated as the average distance between the first three formants; (f) vocal jitter, measured as the “mean absolute difference between consecutive periods, divided by [the]mean period” (Xu, 2013); (g) vocal shimmer, measured as the “mean absolute difference between amplitudes of consecutive periods, divided by the mean amplitude” (Xu, 2013); (h) harmonic-to-noise ratio, as a measure of the “degree of acoustic periodicity” (Xu, 2013). For male speakers, the pitch calculation range in Praat was set between 75 and 300 Hz, whereas for female speakers, the range was set at 100-500 Hz. These values conform to the normative values recommended by Boersma and Weenink (2016). No single acoustic cue in the voice acted as a consistently reliable predictor of voting behavior, implying that participants were using other cues to inform their social judgments.
Discussion
This article aimed to explore the relative contribution of auditory and visual cues to social traits associated with success in a novel context: student elections. Participants rated student representative campaign videos capturing candidates’ own faces and voices on a number of social dimensions. These ratings were then correlated with the proportion of votes received by each candidate. Our findings showed that trustworthiness was the only trait related to the election outcome. Although this was true in all three modalities (auditory, visual, and audiovisual), trustworthiness as judged from the candidates’ voices was the best predictor of election success.
These results are particularly interesting given previous data on the role of perceived competence in election and leadership decisions. It is possible that these different findings reflect the use of real, dynamic (rather than photographic) stimuli. However, it seems highly likely that they are influenced by the different social contexts, for example, electing a President of the United States is rather unlike electing a student sports representative. Indeed, there is already evidence that the context of an election can guide the dimensions people use when making their decisions. Little, Burriss, Jones, and Roberts (2007), for example, collected social ratings based on the unrecognizable morphed images of George W. Bush and John Kerry and asked participants to cast hypothetical votes in two different contexts—a time of war and a time of peace. The results showed a strong preference for the morphed face of Bush in time of war, but the morphed face of Kerry received a higher proportion of votes in time of peace. Critically, Bush’s morphed face was perceived as more dominant and masculine, whereas Kerry was seen as more likable and intelligent. There is also evidence for changes in the importance of assigned to each social trait in different cultures and different countries. Berggren, Jordahl, and Poutvaara (2010), for example, show that ratings of attractiveness are a better predictor of election success in Finland, while Rule et al. (2010) report that judgments of warmth predict elections outcome the best for Japanese participants.
Our findings on the importance of voice cues for election decisions are in contrast to those of Klofstad (2017) as well as to previous findings on the greater contribution of face cues when judging trustworthiness (Mileva et al., 2018; Tsankova et al., 2015). It should, however, be noted that trustworthiness as judged from the face and from the voice were very highly correlated, suggesting that first impressions from faces and voices both signal the same integrated person evaluation. This is further supported by the highly positive correlations between face- and voice-based ratings of trustworthiness and dominance as well those for ratings of attractiveness and competence, although they were not significant. It is, therefore, possible that the effects of face and voice cues in social evaluation are complementary, rather than independent. This is an important finding as most previous research has been unable to address this issue. Most studies have used face and voice stimuli of different identities paired together or manipulated voices artificially (Klofstad, 2017; Mileva et al., 2018) instead of the ecologically valid approach we adopt here. It should, however, be noted that our analysis was based on a relatively small stimulus sample which could potentially affect the reliability and generalizability of our findings. Nevertheless, we report a very consistent pattern of results across all three presentation conditions, which help strengthen our interpretation and conclusions.
Overall, this study shows that trustworthiness emerges as the most important trait in student elections when information about candidates’ own faces and voices are available. Our results support the role of context in the selection of social traits associated with electoral success. Most importantly, given that this is the first study to integrate face and voice cues in a more ecologically valid way, its findings provide a more complete account of the role of first impressions in predicting the outcome of leadership decisions.
Supplemental Material
Mileva_OnlineAppendix – Supplemental material for The Role of Face and Voice Cues in Predicting the Outcome of Student Representative Elections
Supplemental material, Mileva_OnlineAppendix for The Role of Face and Voice Cues in Predicting the Outcome of Student Representative Elections by Mila Mileva, James Tompkinson, Dominic Watt and A. Mike Burton in Personality and Social Psychology Bulletin
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The research leading to these results has received funding from the European Research Council under the European Union’s Seventh Framework Programme (FP/2007-2013)/ERC Grant Agreement n.323262 to A. Mike Burton.
Supplemental Material
Supplemental material is available online with this article.
Notes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
