Abstract
Listeners exhibit orienting responses to voice changes in audio messages. However, the impact of pitch similarity between voices on the nature of the OR has not been explored. We conducted a 3 (Vocal Pitch) × 2 (Location of Change in Message) × 2 (Repetition) within-subjects experiment to address this question. Four non-professional announcers were selected based on differences in vocal pitch. Twelve radio commercials were produced using these announcers to include a single voice change—with either Low-, Medium-, or High-Pitch Differences. The voice changes occurred either within the first or last 20 seconds of the commercial. Heart rate and recognition memory data were collected from 41 subjects. Results show that vocal-pitch difference between speakers impacts automatic attention allocation via the orienting response, and recognition memory for the message is thereby affected. Furthermore, results suggest that having voice changes occur early in an audio message produces the best attentional result.
For some time, media psychology researchers have recognized message processing is directed not only by the interests and motivations of individual audience members but also by structural aspects of the messages themselves. In auditory messages, one of the most common formal features is the voice change, when one speaker’s voice is instantaneously replaced by that of another (Potter, 2000; Potter, Lang, & Bolls, 2008). However, very little is known about how variations in the overall sound of these voices—variations that Krieman and Sidtis (2011) call indexical properties—impact how a voice change is processed by the human cognitive system. This study focuses on one of these properties: the vocal pitch of the speaker. The goal of the study is to test predictions about how differences in vocal pitch between speakers impact information processing in order to build upon a theory of limited cognitive capacity in media audiences. A second aim is to test—using psychophysiological measures—the prediction that listener processing strategy changes over the duration of an audio message from one of controlled attention to semantic content to one of simply monitoring sounds. Finally, this study attempts to use the results obtained to better understand how vocal pitch can be harnessed by media producers to optimize message effectiveness.
This study uses a limited cognitive capacity model of mediated message processing (Lang, 2006a, 2006b; Potter & Bolls, 2011) that conceptualizes individuals as cognitive processors with a limited pool of resources to allocate to encoding information from an audio message. During processing, listeners also make sense of the message by conducting concurrent retrieval of stored knowledge from long-term memory networks. This past knowledge is then synthesized with the information most recently encoded and stored as new long-term memories. Most of the resources allocated to message processing is the result of a controlled process—one dictated by listener interests, goals, and desires. However, some resource allocation to information encoding and storage happens independent of the person’s conscious control. For example, the orienting response (OR) is an automatic allocation of cognitive resources toward encoding new information in the environment (Lang, 2006a; Öhman, 1979; Sokolov, 1963). The OR has been naturally selected for as a way to identify novelty as a source of threat or benefit. Although the most obvious indication of an OR is the turning of the head toward the source (Lynn, 1966; Pavlov, 1927; Singer, 1980), there are also physiological markers of the orienting response. One of these is the decrease in signal power of oscillations in the alpha frequency range in electroencephalographic (EEG) data. Reeves et al. (1985) demonstrated such a decrease following television camera changes and onscreen motion, both of which were hypothesized to introduce enough novelty into the visual field of the television viewer to elicit the automatic allocation of resources to encoding.
In their efforts to find a more convenient way than EEG to identify the OR, Lang and colleagues (Lang, 1990; Lang, Geiger, Strickwerda, & Summer, 1993; Thorson & Lang, 1992) relied on the phasic/short-term activity of the heart, which had been identified as an indicator of orienting (Graham, 1979; Lacey & Lacey, 1974). Congruent with an initial quieting of the body for information intake, an OR elicits a decelerating pattern in heart rate that results in a statistically significant main effect of time over the seconds following the environmental novelty eliciting the response, coupled with either a significant quadratic or cubic trend in the shape of the cardiac response curve (CRC; Potter & Bolls, 2011). Lang and colleagues used the same stimuli as Reeves et al. (1985) and demonstrated cardiac orienting in response to novelty-introducing visual structural features. The onset of visual novelty in websites also results in phasic cardiac reactions (Diao & Sundar, 2004; Wise, Kim, & Kim, 2009), although the finding seems to be much less robust (Lang, Potter, & Bolls, 2009).
People also orient to auditory novelty. Potter et al. (2008) established a list of structural features that elicit cardiac orienting, finding indications of automatic allocation of cognitive resources in radio listeners following the onset of production effects, sound effects, funny voices, and music. Words that are themselves inherently emotional have also been found to cause cardiac orienting responses within the real-time dynamic delivery of audio messages (Lee & Potter, 2006).
Potter (2000) argued that the voice change—defined as the instantaneous replacement of one voice by another in the auditory stream—provided the best audio analog to the cut in video messages because it is frequently used in a similar way: to introduce new information, to change topics, or to move a narrative forward. In an experiment where voice changes occurred either at a fast or slow pace over the course of a 2-minute message, data showed that the voice change introduced sufficient novelty to result in an orienting response (Potter, 2000). Specifically, the cardiac deceleration was in a cubic (S-shaped) pattern associated with processing of speech rather than a quadratic U-shape often found in response to novel noises without semantic content (Chase & Graham, 1967).
Based on the past work of Potter (2000) and Potter et al. (2008), our first hypothesis predicts a replication of data showing that the voice change is a formal feature of audio messages that elicits an orienting response:
The primary focus of the current experiment, however, is whether or not the pitch of the voices that constitute the voice change has an impact on the automatic attention that listeners pay to the message. In the current study, differences in voices were operationalized as differences in the fundamental frequency in the announcer voices. These pitch differences were established both perceptually through self-reported pretest measures and by using spectral analysis to quantify the frequency of the voices used, including the fundamental frequency and the higher overtones that combine to create the overall sound of a voice (Klatt & Klatt, 1990).
The impact of voice changes on processing have been explored previously in the work of Vitevitch (2003) who presented a series of two spoken word lists to subjects, with a break between each presentation. Subjects were instructed to repeat each word immediately after it was spoken. The word lists presented before and after the break were either spoken by the same voice or by a different voice, although the pitch qualities of the different voices were not described. Results of open-ended questions show that nearly half of the subjects presented with a different voice failed to recognize the change (Vitevitch, 2003). Furthermore, the accuracy was greater and response time quicker for the subjects who heard the same speaker compared with the subjects who were presented the series of words with two different speakers and reported noticing the change. Using the degree of difference between the voices as a multi-level independent variable will add depth to this line of inquiry by more accurately describing whether relatively small pitch differences are cognitively processed similarly to larger differences.
The findings of Vitevitch (2003) are similar to results found by Cole, Coltheart, and Allard (1974), who delivered a series of aurally presented letters to subjects who were instructed to report whether the same letter was spoken consecutively in the list. Response times were quicker to repeated letters if the same speaker was used for both letters compared with when different speakers were used for the two letters, presumably because the change in voice interrupted the controlled processing task of letter identification. Gregg and Samuel (2009) found that acoustic properties provide a weaker cue than the semantic meaning of a message for both change detection and encoding of information. The researchers found that unless it is useful for a given task, indexical properties of a voice (such as pitch) will be less efficiently processed in favor of obtaining the meaning of an auditory message (Gregg & Samuel, 2009). This provides empirical support for the reason why failure to hear a voice change in an auditory message—what Vitevitch (2003) calls “change deafness”—might occur. Similarly, Eramudugolla, Irvine, McAnally, Martin, and Mattingley (2005) found that subjects completed tasks more accurately when their attention was directed to specific locations in an auditory series where a difference in voice would be expected as opposed to when subjects were given no indication where the change would occur in a sequence. The overall findings of Eramudugolla et al. (2005) support that knowing the voice change locations in advance results in more controlled processing because the changes became more salient in the task. This serves to override the default hierarchy of information processing outlined by Gregg and Samuel (2009).
As a result of these studies, it seems that there are times when people are unable to recognize, at least to the extent that they can self-report it, the change between two speakers (Eramudugolla et al., 2005; Vitevitch, 2003). However, behavioral measures suggest that there are times when voice changes nevertheless interfere with the cognitive processing of semantic information being presented (Pilotti, Antrobus, & Duff, 1997; Vitevitch, 2003). Because the orienting response is, by definition, an automatic reaction to novelty in the environment, and because past research suggests that changes in speakers sometimes do not rise to the level of constituting consciously observed environmental novelty, it is interesting to consider to what extent differences in vocal pitch impact the robustness of the orienting response to a voice change. It is reasonable to assume that if two voices are very similar in pitch, it is more likely that listeners will fail to orient to a replacement of one voice by the other. Therefore, the following hypothesis is made:
There is evidence in both television (Lang, 1991, 1996; Thorson & Lang, 1992) and radio processing (Potter, 2000) that although the functional purpose of the OR is to deliver cognitive resources to encoding, the ability of the cognitive system to utilize those resources to process information does not happen immediately after the OR occurs in response to formal features of media. Potter (2000) found that there is approximately a 3-second window following an auditory structural feature during which recognition memory for information in the message is significantly dampened compared to immediately prior to feature onset. Lang (1996) suggests that this decrease may be due to momentary overload of the limited capacity system while the automatically allocated resources come online following orienting. If it is the case and, as we hypothesize, the strength of the OR following a voice change is dependent on pitch similarity between the voices, then recognition memory for information delivered immediately following a voice change could be expected to vary in a pattern similar to the magnitude of the OR elicited. If more drastic differences in pitch lead to a greater degree of orienting as predicted in H2, then we expect a similar impact on recognition memory such that,
Also of interest to this study, Potter (2000) showed that orienting to the voice change did not habituate over the course of 2 minutes. In other words, not only did voice changes occurring early in a radio dialog cause listeners to automatically apply cognitive resources to message encoding, the same automatic reaction occurred later in the message when the ongoing dialog was well established, and the pitch of both voices were also less novel than early in the message. However, the visual representation of the cardiac orienting shows a fairly consistent dampening of the OR over time (Potter, 2000, Figure 2). So, at the end of the 2-minute message, although the significant main effect for time and trend pattern in the data indicates an orienting response, the depth of cardiac deceleration diminishes and the overall shape of the curve seems to follow a quadratic rather than cubic pattern. This suggests that perhaps as an audio message unfolds, listeners shift their processing strategies from one of primarily controlled processing to one where automatic processing dominates. This experiment allows for a test of that hypothesis. When processing sounds with semantic meaning (e.g., words), the heart responds in a cubic S-shaped pattern when orienting occurs. When semantic meaning is absent, however—when the orienting is a reaction to a change in simple audio stimuli like a change in frequency of tone pips—the cardiac response is a quadratic or U-shape (Chase & Graham, 1967; Potter et al., 2008). Therefore, if radio listeners begin processing a new message by applying a comparatively high level of controlled resources to the semantic content of the message, but then shift away from controlled processing after feeling as if they understand the overall gist of the message, we would expect the pattern of cardiac orienting responses to reflect this. Therefore,
Method
Design
This study employed a 3 (Pitch Difference) × 2 (Voice Change Location) × 2 (Repetition) within-subjects factorial design. The first factor, Pitch Difference, had three levels representing the difference in fundamental frequency between the voices constituting the voice change within each message. Operationally, each stimulus contained a single voice change occurring between two speakers. One speaker’s voice—referred to as the reference voice—occurred in each stimulus and was matched by one of three other voices to constitute the voice change. In the Low Difference level of the factor, the reference voice was paired with another male voice sounding very similar in pitch to comprise the voice change. For the Medium Difference level, the reference voice was paired with a different male announcer’s voice, one with subjectively different vocal pitch. Finally, the High Difference level represented a change from the reference voice to a female voice, selected specifically to have a subjectively extreme difference in pitch and thereby provide the most drastic change in vocal pitch across the voice change.
The second factor, Voice Change Location, had two levels: early or late. Voice changes that occurred early happened in the first 20 seconds of the 60-second message, those that occurred late happened in the last 20 seconds of the message.
Pitch Difference and Voice Change Location were fully crossed. The reference voice was as the first voice in the pair in half the messages and as the second voice in the other half. The Repetition factor provided two examples of each combination of change type and location. Stimulus presentation was randomized by Media Lab software (Jarvis, 2010b) to control for order effects.
Stimulus Materials
In order to increase the likelihood that only the independent variable of interest—pitch difference—was explored, non-professional speakers were used for stimuli creation to reduce the influence of prosodic elements and vocal emphasis associated with dramatic interpretation of the stimuli scripts expected by practiced announcers. The final stimuli used three males and one female as announcers. 1 Each of the speakers was Midwestern American with English as their first language and had no noticeable accents. Selection of the announcers began by identifying a male speaker as the reference voice and then choosing three others whose vocal pitch seemed to the authors to vary enough in pitch to operationalize the three levels of the Pitch Difference factor. To confirm this subjective evaluation, we recorded each speaker reading an identical paragraph of text, one different from the scripts used as experimental stimuli. These pre-test recordings were then digitally edited into six combinations such that each half of the sample paragraph was read by two speakers. Therefore, the following six combinations of interest were produced: Reference voice/High Difference Female voice, High Difference Female/Reference, Reference/Medium Difference Male, Medium Difference Male/Reference, Reference/Low Difference Male, and Low Difference Male/Reference. These combinations were played for pre-test subjects (n = 16), who were asked “How different did the two voices sound to you?” This was done using 7-point scales ranging from 1 = not at all similar to 7 = extremely similar immediately after hearing each combination. The data were reverse scored to provide a measure of the extent to which the voices sounded different from each other (1 = not at all different/7 = extremely different). Analysis returned a statistically significant main effect of perceived difference across the levels of the Pitch Difference factor, F(2, 48) = 21.86, p < .001. Post hoc analyses show significant differences (ps < .004) in self-reported perceived difference between the three pairings: low (M = 1.63), medium (M = 2.69), and high change (M = 3.94).
With the subjective perceptual differences between the four speakers confirmed, each then was digitally recorded reading 12 radio scripts. These scripts were transcriptions of actual broadcast radio commercials featuring a single speaker in a straight-read style. The recordings were edited so that voice changes occurred at natural sentence breaks within the script. The breaks were located in the first and last 20 seconds of each recording to create the two levels of the Voice Change Location factor.
Because the initial pre-test of the pitch difference between the four voices used different recordings than the final stimuli, and because that analysis was based on listener perception rather than the pitch of the speakers themselves, a spectral analysis was performed to quantify the fundamental frequency of each speaker. Using PRAAT software (Version 5.3.56; Boersma & Weenink, 2005) to compare the average values of the first and second voices heard on each stimulus recording, the difference in fundamental frequency between the voices was approximately 12 Hz, 22 Hz, and 63 Hz in the low, medium, and high levels of the Pitch Difference factor, respectively. This was at least 6 times the Just Noticeable Difference for pitch changes of 2% (Puts, Hodges, Cárdenas, & Gaulin, 2007), further confirming the differences between levels of the Pitch Difference factor.
Dependent Variables
Orienting responses
The orienting response is an evolved cognitive reaction to novelty or signal stimuli in the immediate environment (Graham, 1979; Lang, 1990; Sokolov, 1963), which in this study was identified using cardiac data. When a structural feature of media causes an orienting response, it is identified in aggregate heart rate data through a significant main effect of time and cubic or quadratic pattern in the response during the brief period following the feature onset (Graham, 1979; Potter et al., 2008). Operationally in this study, average heart rate in beats-per-minute/second was calculated for the 5 seconds prior to each voice change as a baseline to control for incidental variations that may have been present in the one inter-beat interval (IBI) occurring just prior to the voice change. Change scores were calculated from this average baseline for the 12 seconds following voice change onset.
Recognition memory
After listening to the 12 radio ads, and completing a distraction task, subjects were asked to listen to a series of sixty 3-second sound bites taken from the recordings of the 12 scripts made by each of the four voices. The 3 seconds of audio was always portions of the script that occurred immediately following the voice change. Subjects were instructed that the memory probes would be the same words they had heard before in the radio stimuli. Their task, however, was to determine whether the voice used in the memory probe was the same voice that they had heard speak those words earlier in the experimental session. They were instructed to indicate their determination as quickly as possible using computer keys labeled “Yes” and “No.” Recognition memory was operationalized as percentage of correct classifications: pushing “Yes” when the words were spoken by the same voice and “No” when the words were spoken by a different voice.
Apparatus
Heart rate data were collected by three 8 mm Ag/AgCl electrodes filled with conductive gel placed on the subject in the lead I configuration. The bioelectric signal was sent to a Coulbourn Instruments Lab Linc V Physiological Data Acquisition System. This equipment includes a bioamplifier with a bandpass filter that was set to attenuate noise from frequencies below 8 Hz and above 150 Hz. This analog signal was sent to a window comparator module, which registered each R spike of the cardiac polarization cycle in real time, resetting a Schmidt trigger that generated digital data in the form of the number of milliseconds between each R spike, a value known as the interbeat interval (IBI).
Subjects also had skin conductance and facial electromyographic data recorded during exposure to the audio messages. However, because these measures are not related to our hypotheses, they are not reported here.
The physiology data collection was controlled by a 386 computer running VPM software (Cook, 2010). The stimulus presentation was controlled by Media Lab software (Jarvis, 2010b) running on a desktop computer using Windows Vista OS. Subjects listened to the audio messages, presented randomly, using a set of Sennheiser over-ear headphones at a volume level determined by the researchers and controlled across subjects.
The recognition memory test was taken via the Direct RT software program (Jarvis, 2010a) on a laptop computer brought into the testing room after the distraction task.
Participants
Participants (n = 41) were recruited from undergraduate telecommunications courses at a large Midwest university and received course credit for their participation. This study was conducted in accordance with approval given by the university Institutional Review Board (Study 1103004944).
Procedure
Upon entering the lab, participants signed an informed consent statement. They were then led into the experiment room, where the experimenter prepared the participant’s skin for the electrodes. For measuring heart rate, an alcohol pad was first used to clean the skin. After filling the three electrodes with conductive gel, one was placed on the left wrist as a ground, and the remaining two were placed on the inner side of each forearm.
After ensuring that all of the physiological measurement systems were operating correctly, the experimenter instructed the participants on their task and exited the room. The first task consisted of randomized presentation of the 12 radio commercials. Physiological measures were recorded, time-locked to stimulus exposure. Following each stimuli clip, self-report data were collected assessing the level of attention and arousal experienced, as well as how negatively or positively they perceived the commercial to be.
After the 12 ads were heard, the headphones were removed and the subject completed one of two brief distraction tasks. Then the experimenter removed the electrodes from the participant. Headphones were once again put on, and brief clips from the beginning of each commercial were replayed for the participants to cue them to a particular ad. The participants then completed a funnel-style memory task in which they described anything odd they remembered about that particular commercial. This open-ended portion was followed by a question asking how many voices the participant remembered hearing in the commercial. These questions were asked to determine whether the voice changes were overtly recognized as extraordinary. As the responses were unremarkable, they are not reported here.
After completing these open-ended memory tasks, the experimenter brought the participant a laptop computer and administered the forced-choice recognition task. Upon completion of this, the experimenter returned to remove the laptop and headphones, thanked and dismissed the participant.
Data Cleaning, Reduction, and Analysis
Heart rate data were cleaned and beats per minute/second calculated using VPM software (Cook, 2010). Change scores were calculated by subtracting the average heart rate during the 5 seconds prior to the voice change from each of the subsequent 11 seconds. To test for the presence of an orienting response, a 3 (Pitch Difference) × 2 (Voice Change Location) × 12 (Time) × 2 (Repetition) repeated-measures ANOVA was conducted. The 12 levels of the repeated measures factor for Time represented a zero value of onset followed by each subsequent change score for 11 seconds post voice change. Due to expected violations of sphericity in the cardiac data, results are reported using the Huynh-Feldt adjusted F values and p values with the unadjusted degrees of freedom.
Recognition results were submitted to a 3 (Pitch Difference) × 2 (Voice Change Location) × 2 (Repetition) repeated measures ANOVA.
Results
Orienting to Voice Changes
The main effect of Time on the CRC following the voice changes collapsed across all other factors was significant, F(11, 240) = 2.556, p = .034, ηp2 = .06. The quadratic trend in the CRC data was significant, F(1, 40) = 9.061, p = .005, ηp2 = .185, as was the cubic trend, F(1, 40) = 7.596, p = .009, ηp2 = .160, as can be seen in Figure 1. The shape of this CRC is in line with previous findings showing orienting to voice changes in professionally produced radio messages (Potter, 2000; Potter et al., 2008). Hypothesis 1 was supported.

Cardiac response curve showing orienting to voice changes collapsed across pitch difference and voice change location.
Effect of Pitch Difference on Orienting to Voice Changes
It was hypothesized that the pitch differences between the two voices involved in a voice change would affect the magnitude of the cardiac orienting response in reaction, with greater pitch differences leading to deeper cardiac deceleration. The Pitch Difference × Time interaction was significant, F(22, 880) = 1.886, p = .040, ηp2 = .045, and can be seen in Figure 2. As predicted, the Low Difference voice change—where the male reference voice was replaced by a male voice of similar pitch—resulted in the shallowest cardiac orienting response. This was followed by the Medium Difference voice change, also a male-to-male change but with much more difference in pitch than the Low Difference stimuli. Visual examination of the High Difference change presented a somewhat unexpected result. Although the initial cardiac deceleration was as predicted—more rapid and steep than either of the other two pitch conditions—a somewhat dramatic acceleration of heart rate began after about 1.5 seconds post voice-change onset. In fact, after initial deceleration heart rate increased to above baseline levels in response to the High Difference voice change.

Pitch difference × Time interaction on cardiac response curves.
Analysis of the CRCs obtained for each level of the Pitch Difference factor shows that the main effect of Time was not significant for the low messages (p = .425); however, the trend followed a significant quadratic shape, F(1, 40) = 4.778, p = .035, ηp2 = .045. There was a significant effect of Time on the Medium difference messages, F(11, 440) = 2.588, p = .032, ηp2 = .061, with a significant quadratic trend in the data, F(1, 40) = 6.574, p = .014, ηp2 = .141. The main effect for Time was also significant for the High Vocal Difference messages, F(11, 440) = 2.790, p = .016, ηp2 = .065. However, the quadratic trend was now not significant (F < 1), whereas the cubic trend showed the greatest effect across all CRCs, F(1, 40) = 9.313, p = .004, ηp2 = .189. The result of these analyses was partial support for Hypothesis 2. While pitch differences between speakers certainly did result in differences in magnitude of cardiac orienting, there was the unexpected finding that after initial cardiac deceleration, the High-Pitch Difference produced cardiac acceleration above baseline levels. Inductively, this can be attributed to the dual innervation of the heart by the sympathetic (SNS) and parasympathetic (PNS) nervous system (Potter & Bolls, 2011). The drastic difference in pitch—and perhaps the unexpected nature of a change in announcer gender during a single advertising appeal—appears to have led to activation in the SNS to such an extent that it overcame the initial PNS-driven deceleration, increasing the heart rate above baseline levels.
Effect of Pitch Difference on Recognition Memory
Hypothesis 3 predicted that the stronger the cardiac orienting, the less efficient the encoding of the information immediately following the voice change due to the amount of time needed for the automatically allocated cognitive resources to be utilized (Lang, 1996; Potter, 2000). As predicted, there was a significant main effect of Pitch Difference on recognition memory for information occurring immediately following a voice change, F(2, 128) = 4.19, p = .017, ηp2 = .157. Post hoc analyses show that the recognition memory was significantly better for information following the most shallow cardiac orienting responses elicited by the Low Pitch Difference voice changes (M = 62.8, SE = .03) compared with both the information following Medium Pitch Difference voice changes (M = 49.4, SE = .03) and that following High-Pitch Difference voice changes (M = 45.1, SE = .04). Although the means were in the predicted direction for the latter two recognition values, the differences were not statistically significant.
Effect of Pitch Change Location on Cardiac Orienting
The final hypothesis tested an interpretation of earlier work that suggested listeners shift their mode of processing over the course of a long-form audio message from one with a higher level of controlled resource allocation to one where automatic resource allocation predominates. Chase and Graham (1967) found that words—sounds that contain semantic meaning—produce a different cardiac orienting response in listeners than non-word sounds, suggesting a duality of underlying processes. Given this finding, it can be said that when hearing a radio commercial, the content can be processed in one of two distinct ways. The present study addresses the possibility that over time, listeners can shift from a more cognitive resource-intensive semantic meaning process to a less demanding general sound processing strategy. Another way to describe this is that listeners are processing the audio earlier in a radio message as words, whereas later in the message, arguably when they have received what they think the gist of the message is, they process the audio stream as merely a sequence of sounds. If that is the case, we predicted that when collapsing the voice changes across the three levels of the Vocal Difference factor, we would see different primary shapes in the CRCs formed following early and late voice changes.
This hypothesis was supported with a significant Voice Change Location × Time interaction on the cardiac change scores, F(11, 528) = 2.694, p = .023, and is shown in Figure 3. Subsequent analyses of each individual CRC show that orienting to the voice changes at the beginning of the message resulted in a significant main effect of Time, F(11, 528) = 3.301, p = .031, and had a significant cubic trend, F(1, 48) = 14.44, p < .001. However, as predicted, the CRC following voice changes later in the message was qualitatively different. Not only was the overall main effect of Time not statistically significant, F(11, 528) = 1.45, p = .20, but the shape of the curve was not cubic—F(1, 48) = 1.009, p = .32—and even the quadratic trend was only approaching significance, F(1, 48) = 3.00, p = .09.

Change location × Time on cardiac response curves.
Discussion
The primary purpose of this study was to investigate the impact that one element of the human voice—the fundamental frequency or pitch—has on the cognitive processing of voice changes in radio messages. Voice changes are the point in a message when one speaker’s voice is replaced by another in the auditory stream. Potter’s work (Potter, 2000; Potter et al., 2008) has found that voice changes are auditory structural features capable of eliciting orienting responses. However, previous work has not investigated how indexical variables of the voices may impact the strength of the orienting response produced by the voice changes. Our results suggest that investigating the effects of these variables may be important as we found the vocal pitch of the speaker affects automatic attention to voice chances. When the voice change involves two male voices, the robustness of the orienting response—as indexed by cardiac deceleration—is dependent upon how similar the two voices are to each other in pitch. When they are more similar, cardiac orienting is dampened compared with when the male voices are more distinct from each other in pitch. The greatest impact of pitch on orienting was when the male reference voice was matched with a female voice to comprise a voice change. Although the resulting CRC was not in the shape we predicted, this was arguably because the impact of pairing a male and a female speaker together was even more effective at attracting automatic attention than we expected. We predicted that the large pitch difference would result in the deepest cardiac deceleration. However, although the deceleration was initially the steepest in slope, the heart rate response to the High-Pitch level of the factor quickly accelerated as the SNS responded. Our recognition memory data support this interpretation. Lang (1996) and Potter (2000) have suggested that part of the temporal dynamics of the orienting response to auditory signals includes a “dead-zone” in the 3 seconds immediately following the orienting eliciting structural feature. This is conceptualized as a momentary period of cognitive overload while the automatically allocated resources come online for the encoding subprocess to utilize. So, we expected that recognition memory would be worse immediately following the onset of voices with the greatest differences in vocal pitch compared with voice changes comprised of voices that were similar in fundamental frequency. In this study, we show statistically significant impairment in forced-choice recognition memory for sound bites immediately following High and Media Pitch Differences compared with Low Difference Voice Changes.
Taken together, these data suggest that not all voice changes are created equal when it comes to eliciting orienting responses. Voices with greater perceived differences in pitch seem to be more effective at causing automatic resource allocation. This has implications for practice across a wide array of media categories. Consider emergency communications—such as crisis response teams using radio communications to respond to catastrophes. Radio dispatchers could be selected, in part, based on the fundamental frequency of their voices in an attempt to have them be of comparatively high contrast to those communicating from the field. Perhaps even more effective would be audio signal processing software designed to vary the vocal pitch of each speaker involved in first-responder communications to optimize the frequencies being heard and increase the probability that orienting responses would be elicited by voice changes.
Our pitch-difference findings also have implications for podcasters and producers of radio journalism. If sound bites are obtained by newsmakers or guests, the pitch of their voices should be considered, if possible, when determining which reporters or hosts to assign to produce the audio surrounding them. Our results can similarly be applied by advertising producers to optimize the persuasiveness of their messages. Multiple voices in radio ads can help increase automatic attention allocation by audiences, but especially if the pitch of the voices differs substantially. In fact, our finding that the most robust orienting occurred when male and female voices were combined in ads provides added support to Rodero, Larrea, and Vázquez, (2013) who argue that empirical data do not support the standard industry opinion that males are perceived as more convincing and effective as voiceover talent. Our results extend the argument, suggesting that marketers consider employing female announcers in their radio ads in order to attract automatic attention to their persuasive messages. And, while our stimuli created the male/female contrast by pairing two announcers within a single ad, the predominance of male voices in the world’s radio advertising makes it easy to conceptualize an advertising message voiced solely by a female as having a high probability of resulting in a male-to-female voice change at the onset of the ad itself.
A final practical recommendation arises from the recognition results, which, as mentioned above, once again demonstrate that memory for information occurring immediately after a voice change is impaired due to the time it takes for the cognitive resources allocated to come online. So, although increases in pitch differential led to an increased probability that automatic allocation would occur after the voice change, it is important to understand how to harness this to maximize benefit. If audio producers have control over when important content occurs relative to the voice change, writing copy where the most important information happens immediately following a voice change would be ineffective. For example, rather than write copy that occurs right after the voice change to say “Shanahan Auto Body is the place to take your car for quality work after a collision,” the most important part of the informational clause—the client’s name—should be spoken as far away from the voice change as possible to take advantage of the automatic resource allocation but give that added bit of attention time to be applied to the content.
There are also implications of our result showing the influence of pitch on the robustness of orienting to voice changes for theory development in media psychology. As mentioned, the finding that one indexical property—namely pitch—impacts the extent to which voice changes can trigger orienting suggests that further research should investigate other such properties to see how variance in them affects automatic resource allocation. Changes in accent, vocal prosody, volume, and speaking rate are a few such properties of voice that we hope future research will explore. Also, although novel auditory onsets such as sound effects and production effects have been shown to cause cardiac orienting responses (Potter, 2000, 2006; Potter et al., 2008), it seems reasonable to suggest that the amount of novelty in these auditory structural features is variable. If that is the case, it would have implications on the development of recent measures attempting to quantify the interactions between resources allocated to encoding as a result of structural feature onset and the amount of resources required to process the information introduced (Lang, Bradley, Park, Shin, & Chung, 2006; Lang et al., 2015).
Also of theoretical and practical importance are our results suggesting that radio listeners employ a different processing strategy when first listening to a message than when they have been exposed to it for more than 40 seconds. Regardless of the pitch characteristics of the announcers involved in the voice change, early cardiac responses showed an S-shaped curve, whereas later ones were U-shaped. In their seminal work on the orienting response to audio, Chase and Graham (1967) demonstrated that S-shaped cardiac responses occur when changes involved the processing of semantic information, whereas the latter are exhibited when novelty is associated with sound rather than content. So, our results can be interpreted to indicate that early in the message listeners are actively processing the content for information so they can understand the gist of what it being communicated. Changing the voice at this time results in a cubic trend in the cardiac response. Later, when listeners have shifted from controlled attentional processing to more environmental monitoring, voice changes are processed like a change in the auditory atmosphere and the quadratic curve results. Producers should think carefully about these results suggesting a shift in processing strategy early in a message compared with late. Although this result will need to be replicated in future studies, the suggestion is that building an argument or creating a narrative with a “pay-off” may not always be a good strategy considering that listeners seem to shift from a controlled processing mode to one where the audio message just becomes sound in the background once a general understanding of the message is obtained.
Footnotes
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
