Abstract
Despite recent evidence of a positive relationship between cortisol levels and voice pitch in stressed speakers, the extent to which human listeners can reliably judge stress from the voice remains unknown. Here, we tested whether voice-based judgments of stress co-vary with the free cortisol levels and vocal parameters of speakers recorded in a real-life stressful situation (oral examination) and baseline (2 weeks prior). Hormone and acoustic analyses indicated elevated salivary cortisol levels and corresponding changes in voice pitch, vocal tract resonances (formants), and speed of speech during stress. In turn, listeners’ stress ratings correlated significantly with speakers’ cortisol levels. Higher pitched voices were consistently perceived as more stressed; however, the influence of formant frequencies, vocal perturbation and noise parameters on stress ratings varied across contexts, suggesting that listeners utilize different strategies when assessing calm versus stressed speech. These results indicate that nonverbal vocal cues can convey honest information about a speaker’s underlying physiological level of stress that listeners can, to some extent, detect and utilize, while underscoring the necessity to control for individual differences in the biological stress response.
Keywords
Introduction
Vocal communication of stress has been a central topic of research for decades (Kreiman & Sidtis, 2011b; Scherer, 1986). Yet, due to the inherent complexities of defining, evoking, and measuring stress (Murray et al., 1996), and individual differences in the magnitude of the stress response and in its behavioral manifestation (Pisanski et al., 2018; Scherer, 1986), the effects of stress on vocal production and perception remain unclear.
Attempts to identify the nonverbal vocal correlates of stressed speech have most consistently shown an increase in fundamental frequency (F0, perceived as voice pitch) during stress (for reviews, see, Buchanan et al., 2014; Giddens et al., 2013; Kirchhübel et al., 2011). Fundamental frequency is positively related to the rate of vocal fold vibration, and thus increases as the vocal fold’s tense or as subglottal pressure increases, as often occurs under stress. Past research has shown less consistent effects of stress on other nonverbal vocal characteristics including voice perturbation and noise parameters, such as jitter (a measure of instability in vocal fold vibration causing small asymmetries in periodicity), shimmer (instability in voice amplitude), and harmonics-to-noise ratio (HNR, representing the degree of additive noise in the voice signal) (Kreiman & Sidtis, 2011a; Titze, 1994). The influence of psychological stress on resonances of the vocal tract (formant frequencies) and their relative spacing (ΔF, Pisanski et al., 2014; Reby & McComb, 2003) is also not known.
Importantly, in two recent papers, we show that increases in cortisol levels under lab-induced psychosocial stress (Trier Social Stress Test) and under naturally occurring stress (oral examination) predict increases in voice fundamental frequency (F0, pitch) among women (Pisanski, Nowak, et al., 2016; Pisanski et al., 2018). Similarly, Buchanan et al. (2014) showed that pauses in speech during stress are more pronounced in speakers with a relatively high cortisol stress response. Together, these studies indicate a degree of response coherence between vocal and endocrinological stress systems (Andrews et al., 2013) but, crucially, also show that vocal changes under stress may occur only or predominantly among speakers who show a corresponding biological stress response.
Researchers examining voice-based stress perception have focused largely on developing computer algorithms for speaker identification and speech recognition (both of which are degraded in stressed speech, Hansen & Patil, 2007) and related technologies for lie detection (for review, see, Hollien et al., 2014). Few studies have examined human listeners’ capacity to assess stress from the voice, and to the authors’ knowledge, no previous study has tested accuracy in voice-based stress detection, which here, we test by measuring how closely listeners’ judgments of stress map onto a key biological marker of stress in the speaker (cortisol). Earlier playback studies have, however, examined which vocal parameters predict listeners’ ratings of stress using either re-synthesized speech or natural speech stimuli recorded in real-life stressful scenarios, or under induced or imagined stress. These studies have generally shown a positive correlation between the mean or maximum F0 of speech and listeners’ perceptions of stress (Protopapas & Lieberman, 1997; Streeter et al., 1983) but have produced inconsistent results regarding the influence of other vocal parameters on perceived stress. While re-synthesized speech allows for careful control of vocal parameters, natural speech from real-life situations benefits from ecological validity, and critically, allows for direct tests of accuracy in listeners’ stress detection.
In the present study, we tested whether changes in the salivary cortisol levels and vocal parameters of speakers who were recorded under real-life stress predict listeners’ voice-based judgments of those speakers’ stress levels. Importantly, in light of recent evidence that vocal changes under stress covary with the magnitude of the biological stress response (Buchanan et al., 2014; Pisanski, Nowak, et al., 2016; Pisanski et al., 2018), we test the prediction that listeners’ stress ratings will correlate positively with the underlying cortisol levels of speakers, and with their stress-linked vocal parameters, particularly voice pitch.
Methods
Speakers and Stimuli
Twenty voice recordings were taken from a sub-sample of 10 female students registered in an upper-level psychology course who were audio-recorded in a quiet room immediately before an oral examination (stress context) and approximately 2 weeks prior to the exam (M= 12.3, SD = 2.2 days; baseline context), as part of a study examining response coherence in stress systems (see Pisanski, Nowak, et al., 2016 for detailed sample description and recording procedures). In both contexts, speakers were asked to familiarize themselves with and subsequently read the first five sentences of the Rainbow Passage (translation) from which we extracted the third sentence, translating to: “The rainbow is a division of white light into many beautiful colors.” Audio was digitally encoded using an M-Audio Fast Track ultra interface at a sampling rate of 44.1 kHz and 16-bit amplitude quantization and stored onto a computer as WAV files. Voice stimuli were amplitude normalized to 70 dB RMS SPL in Praat (Boersma & Weenink, 2020) for playback.
Saliva samples were collected in duplicate from each speaker into 2 ml microtubes using the passive drool method (Gröschl, 2008) and were taken 15 min after the baseline and stress voice recordings to allow for cortisol produced by the adrenal glands to manifest in saliva (Kirschbaum et al., 1993). Saliva samples were frozen at –70°C until analysis. Free cortisol levels were measured with enzyme-linked immunosorbent assay (ELISA) commercial kits (DEMEDITEC®, Germany; assay sensitivity 0.014 ng/ml, intra-assay CV 5.9%, inter-assay CV 9.4%). For additional details, see Pisanski et al. (2016).
Acoustic Analysis
Acoustic analysis was performed in Praat (Boersma & Weenink, 2020) on voiced segments of speech. Fundamental frequency parameters included F0 mean, minimum, maximum, and standard deviation (SD) measured using Praat's autocorrelation algorithm (search range 100–600 Hz). Formant frequencies (F1–F4) were measured using Praat’s Burg Linear Predictive Coding algorithm (ceiling 5500 Hz); formant number was adjusted to fit predicted values to observed formants. Formant spacing (ΔF) was taken as a measure of the distance among adjacent formants using equations described in Reby and McComb (2003) and Pisanski et al. (2014). We also measured noise (harmonics-to-noise ratio, HNR), frequency perturbation (jitter: local, local absolute, rap, ppq5, ddp), and amplitude perturbation (shimmer: local, local dB, apq3, apq5, apq11, dda) using Praat's cross-correlation algorithm. Due to strong collinearity, jitter measures and shimmer measures were respectively grouped into two principal components that explained 89% (Jitter PC) and 77% (Shimmer PC) of the variance in each perturbation parameter. Finally, speed of speech was measured as duration and words-per-minute (WPM).
Listeners
Fifty listeners (aged 19–28, M= 23.5, SD= 2.4; 52% female) were recruited from two universities in a large European city to rate voice stimuli. All participants provided informed consent. The study was performed in accordance with the Declaration of Helsinki on Biomedical Studies Involving Human Subjects and was approved by the Institutional Ethics Review Board (project no. 2014/13/B/HS6/02636).
Playback Procedure
Participants completed the experiment in individual sessions in a designated lab space. Voice stimuli were presented via a custom computer interface and Sennheiser HD 201 professional headphones at a constant, pre-set volume. Participants rated all 20 voice stimuli (10 baseline, 10 stress) in a randomized order on a scale from 1 (this person is not at all stressed) to 10 (this person is very stressed).
Results
Changes in Voice and Cortisol Under Stress
Linear mixed models (LMMs; see Table 1 footnotes for model structure) fit by restricted maximum likelihood confirmed that speakers had a higher voice pitch (F0), greater formant spacing (ΔF), and spoke faster (more WPM, shorter duration) under stress relative to their unstressed, baseline speech. There were no changes in speakers’ voice perturbation or noise parameters (jitter, shimmer, or HNR). A separate LMM confirmed that speakers’ cortisol levels were, on average, significantly higher under stress relative to baseline (Table 1).
Increased Cortisol Levels and Changes in Vocal Parameters During Stress.
Descriptive statistics and results of linear mixed models.
aLMMs: Each speaker parameter was entered as the dependent variable, and context (baseline, stress) as a fixed factor. Speaker ID was included as a random subject variable with random intercept.
**p < .001. *p < .01. †p < .10.
Voice-Based Judgments of Stress
On average, listeners’ ratings of stress were not higher for voices under stress (M ± SEM = 5.2 ± 0.07) than at baseline (5.4 ± 0.07). To examine whether speakers’ cortisol levels predicted listeners’ stress ratings, stress ratings were then entered into two separate linear regression models, with speakers’ cortisol levels as the predictor. These significant models confirmed that listeners’ ratings of stress increased with the speakers’ cortisol levels (stress model: standardized β = 0.38, S.E. = 0.03, t = 9.1, p < .001; baseline model: β = 0.09, S.E. = 0.01, t = 2.1, p = .04). Simple zero-order Spearman’s correlations, pooling listeners’ stress ratings for each individual speaker, confirmed a positive relationship between mean stress ratings and speakers’ cortisol levels (rs = 0.27). Stress ratings were unimodal but slightly positively skewed; however, regression models with log-transformed stress ratings produced comparable results.
To examine which voice parameters predicted listeners’ stress ratings, all 10 voice parameters were first subjected to principal component analysis (PCA) with varimax rotation to control for collinearity (see Table S1). The PCA produced four components explaining 88% of the variance in voice parameters, corresponding respectively to vocal perturbation and noise (PC1: jitter, shimmer, and HNR); speech rate (PC2: duration and WMP); voice pitch parameters (PC3: F0 mean, max, and SD); and formant spacing (PC4: ΔF).
These acoustic principal components were then entered into a LMM with context (baseline, stress) as a fixed factor, controlling for Speaker ID and Listener ID (Table S2, see footnotes for full model structure). The model showed a significant main effect of voice pitch parameters (PC3) on stress ratings overall. Crucially, however, the model revealed significant interactions between stress context and acoustic principal components, indicating that different vocal parameters predicted listeners’ stress ratings in the baseline and stress contexts.
To investigate this further, separate LMMs were conducted for each context. In the Stress model (F = 4354.21,495, p < .001), the acoustic principal components corresponding to voice pitch parameters (PC3: F = 29.51,495, p < .001) and to voice perturbation and noise (PC1: F = 22.81,495, p < .001) significantly predicted listeners’ ratings of stress. Zero-order correlations, averaging across listeners’ ratings (n = 50) within speakers and contexts, confirmed moderate positive relationships between listeners’ stress ratings and the F0 mean (r = .0.37), F0 min (r = .45), F0 SD (r = .39), and jitter (r = .52) of speakers’ voices in the stress context. Relatively weaker correlations were observed between listeners’ ratings and the shimmer (r = .29) and HNR (r = –.19) voice parameters of stressed speakers (see Table S3). In the Baseline model (F = 3682.31,495, p < .001), the voice principal components corresponding to formant spacing (PC4: F = 13.71,495, p < .001) and again to voice pitch (PC3: F = 7.51,495, p = .006) explained significant variance in listeners’ stress ratings. Zero-order bivariate correlations were, however, small in effect size, with the largest correlation observed for mean F0 (r = .28; see Table S3 for all correlations). Although the LMMs indicated that speech rate parameters (PC2) did not predict listeners’ stress ratings in either a stressful or baseline context, bivariate correlations showed moderate relationships between duration (r = –.38) and words-per-minute (r = .33) on ratings of stressed speech.
Discussion
Our results show that listeners’ voice-based judgments of stress can be predicted both by the underlying biological stress response of speakers and by their vocal parameters.
Increases in the free cortisol levels of university students recorded during an oral examination, and 2 weeks prior, positively predicted listeners’ voice-based ratings of stress. Acoustic analyses further showed that fundamental frequency (pitch), vocal tract resonances (formants) and speed of speech all increased in speakers during stress, and that higher voice pitch consistently predicted higher stress ratings by listeners. However, the influence of other vocal parameters on stress ratings was less consistent and varied by context. Taken together, these results suggest that listeners can, to some extent, judge a speakers’ physiological stress response from their voice.
Critically, as listeners’ stress ratings were, on average, not predicted by the presence or absence of a stressful context (i.e., oral examination), our results further underscore the importance of controlling for individual differences in speakers’ biological stress responses when conducting research on stressed speech (Pisanski, Nowak, et al., 2016). Indeed, not all speakers in a ‘stressful’ condition will show a physiological stress response nor corresponding vocal changes, and thus, listeners’ perceptions of stress should not be expected to vary as a mere function of context, but rather, as our results suggest, may vary instead with individual differences in biological and behavioral markers of stress, regardless of context.
With the exception of voice pitch (F0) that consistently predicted stress ratings, different combinations of vocal parameters predicted listeners’ stress ratings of speech recorded in baseline and stress contexts, suggesting that listeners utilize different perceptual strategies when assessing calm versus stressed speech, as recently demonstrated for listeners’ assessments of different pain intensities from nonverbal vocalizations (Raine et al., 2018). Listeners’ ratings of stress in stressed speech were also predicted by vocal perturbation and noise parameters, despite a lack of strong evidence that these parameters changed under stress in our sample of speakers. Controlled psychoacoustic experiments on a larger and more diverse sample of voice stimuli and listeners are now needed to systematically test and model the relationships among specific vocal parameters and listeners’ judgments of stress, mapping both onto the physiological stress responses of speakers, to better inform multimodal models of stress detection.
Public speaking and social evaluation evoke among the strongest stress responses in humans (Dickerson & Kemeny, 2004), often manifesting both physiologically and behaviorally, with some limited evidence of response coherence (covariation) in psychophysical responses to stress (Andrews et al., 2013). The capacity to measure an underlying physiological or endocrinological stress response from a noninvasive behavioral marker such as the voice could improve the efficacy of real-time stress detection, yet has posed a challenge for researchers (Alberdi et al., 2016; Andrews et al., 2013). Our results indicate that human listeners can successfully utilize nonverbal parameters of the voice to gauge stress, and that they may do so by exploiting correspondences between vocal and biological markers of stress (e.g., pitch and cortisol: Pisanski, Nowak, et al., 2016). This supports the hypothesis that vocal communication in humans, like many other mammals (Morton, 1977), functions to communicate ecologically and socially relevant information about speaker traits and motivational or emotional states (Pisanski & Bryant, 2019). While humans have a unique capacity to volitionally modulate nonverbal parameters of the voice (e.g., to feign or conceal emotional states, Pisanski, Cartei, et al., 2016), the current findings suggest that the human voice conveys honest cues to stress, most likely due to the relationship between the physiological stress response and vocal anatomy, that may be difficult to conceal. At the same time, our results show that only 7% of the variance in listeners’ stress ratings was explained by speakers’ cortisol levels, indicating that the capacity to accurately assess stress from the voice in humans is far from perfect.
Supplemental Material
sj-xlsx-1-pec-10.1177_0301006620978378 - Supplemental material for Human Stress Detection: Cortisol Levels in Stressed Speakers Predict Voice-Based Judgments of Stress
Supplemental material, sj-xlsx-1-pec-10.1177_0301006620978378 for Human Stress Detection: Cortisol Levels in Stressed Speakers Predict Voice-Based Judgments of Stress by Katarzyna Pisanski and Piotr Sorokowski in Perception
Supplemental Material
sj-pdf-2-pec-10.1177_0301006620978378 - Supplemental material for Human Stress Detection: Cortisol Levels in Stressed Speakers Predict Voice-Based Judgments of Stress
Supplemental material, sj-pdf-2-pec-10.1177_0301006620978378 for Human Stress Detection: Cortisol Levels in Stressed Speakers Predict Voice-Based Judgments of Stress by Katarzyna Pisanski and Piotr Sorokowski in Perception
Footnotes
Acknowledgements
We thank Judyta Nowak (Institute of Genetics and Microbiology, Faculty of Biological Sciences, University of Wroclaw) for conducting hormone analysis.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by a research grant from the National Science Centre to K.P. (2014/13/B/HS6/02636).
Supplemental Material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
