Abstract
This article applies Bakhtin’s ideas of “dialogics” and “speech genres” and Bourdieu’s approach to the “linguistic field” in a study of the “instrumentalization” of voice in interactive voice response (IVR) telephony. The article examines two dimensions of voice as it appears in IVR services: technician-led installations and voice branding “on-hold messaging.” It is argued that technician-led installations of IVR are essentially monologic in character. In contrast, voice branding agencies’ IVR reveals a more dialogic form of voice. The article is based on data collected from telephone and face-to-face interviews with staff in both sides of IVR and from an examination of six IVR services of major UK companies. The article assesses the social indexing of recorded voice in IVR by drawing out how particular “speech genre” functions of IVR prompts coincide with accent, gender, and age.
Keywords
Social and Technical Dimensions of Recorded Voice
In technical applications of voice in telephony interactive voice response services (IVR), the term “voice” usually refers to speech that has been recorded in order to fulfill specific, repetitive, answering functions for business and other organizations. IVR platforms have both software and hardware dimensions in which computer voice files (usually of the “ .Wav” type) are activated by software that responds to either Dual Tone Multi-Frequency (DTMF) “touch tone” keypad or, in the case of voice recognition systems, “open dialogue” prompts from the speech of the caller (Attwater, McGrail, & Sargent, 2000). Most IVR continues to be based on DTMF keypad inputs to initiate the menu choice the caller makes after the “welcoming prompt” at the start of the call. DTMF IVR voice prompts are very short, lasting only about 15 seconds at the welcoming prompt where the various menu choices are introduced and then the options that deliver information take around another 10 seconds or so each. For example, the following is the opening command prompt of a DIY service:
Thank you for calling B&Q. You now have three options: for store opening hours, please press 1. For directions, press 2. To speak to a member of staff, press 3. If you do not have a touch-tone telephone, please hold to speak to a member of staff. (B&Q London Call Line accessed 12.6.2012)
Selecting Option 1 brings the following message:
Our normal opening hours are as follows: Monday to Friday 7 a.m. to 9 p.m. Saturday 7 a.m. to 8 p.m. Sunday 10 to 4 p.m. We are open every bank holiday except Easter Sunday and Christmas Day. (B&Q London Call Line accessed 12.6.2012)
This message then terminates with the return option: “To hear these options again, please press star.” Following all the prompts for any one service menu therefore produced around 3 to 4 minutes of recorded voice data.
Recorded telephonic voices of this type are largely overlooked in the study of “secondary mediations” of speech in communication studies of language and the media. Once recorded, the human voice is generally seen as divested of its sociocultural codings (i.e., in traditional sociolinguistic terms of accent, dialect, etc.) when used in telephony and other industrial functions. Thus, for Weidman the forms of recorded voice we hear in media and computing put ideas of the authenticity of voice into question (Weidman, 2010, pp. 318-319). For Jeremijenko (Jeremijenko, 2005) recorded voices are subject to “social-technical systems” in which, as the result of technical processes of sound editing, they lose much of their human quality (such as hesitancy, pitch variation, and modulation in tone, “breathiness”). Jeremijenko argues that one of the key affordances of voice chip technologies is that they do not make any claims on us in terms of social identity:
Our exposure to voices (and other communicative sounds) that emanate from inanimate objects has become a significant part of our daily interactions: from radios to the more recent talking elevators, answering machine messages and pre-recorded music, television, automated phone menus, automatic teller machines, alarms and alerts, each of which . . . speaks in a language or dialect that makes little distinction between music, sound effects and articulated words, and privileges the situational function of language over the semantic and interactive. (Jeremijenko, 2005)
Recorded voices are here seen as radically detached from their social origins as they become articulated in the “techno-social landscape” (Jeremijenko, 2005). Indeed, one of the key rules in technician-designed IVR is of the need to “constrain” the impetus of callers to respond to “natural language” patterns or colloquial rhythms (Stentiford & Popay, 1999, p. 28) so that the system avoids stimulating expectations of normal conversation.
But even if IVR platforms technically manipulate and instrumentalize recorded voices to serve as “avatars” for company or organizational identities, it is still possible to examine how the process of technicizing of voice is itself subject to sociocultural determinations. Jeremijenko herself states that we cannot make a complete dichotomy between machine speech and social speech—that machine and human are coproducers of these systems (Jeremijenko, 2005). Similarly, Suchman (2004) argues that we need to think about the social indexing of “talking things”:
Voice chips, like other signs, rely very immediately upon and invoke their embedded locations for their intelligibility. Rather than things made in the image of the monadic, rational actor, we need reimagining of humans as contingently divisible participants in socio-material collectives who live out their particular histories in uniquely inflected ways. (pp. 264-265)
Similarly, Weidman notes how the soundscape of the media “creates new sonic encounters” (Weidman, 2010, p. 295) through which we might imagine social differences. Other writers note that the “residue of the social is part of the ‘uncanny’ feeling we experience from encountering recorded voices” (Neumark, Gibson, & Van Leeuwen, 2010). Writers concerned with the history of voice in the media such as Taylor or Hastings and Manning (Hastings & Manning, 2004; Taylor, 2009) also argue that we need to see the media and social forces as “coproducers” of voice. Hastings and Manning draw on Goffman’s ideas of “authors” and “animators” (2004, p. 309; Taylor, 2009) to underline the social roles present in producing speech in print and other media:
[Goffman] decomposes the speaker role into a production format of animator (the actual performer of the utterance), author (the one responsible for the text that it performed), principal (the one whose position is expressed), and figure. This final component—a kind of “embodied voice” which condenses a whole series of semiotic and social characterological features—is at once the least used and perhaps most useful for the analysis of the kind of realignments of voice, speaker and speech that we are interested in. (Hastings & Manning, 2004, p. 302)
The writers note the attribution or citing of “natural figure” voices in, for example, parody become animated by other agents and “shot through with alterity” (Hastings & Manning, 2004, p. 300). Indeed, most of the voices we hear are not in embodied form but are often staged or printed (i.e., in novels) and need to be evaluated in terms of mediated contexts.
Bakhtin, Dialogicism, and Voice
Seeing recorded voices as constructed, mediated but nevertheless inherently subject to sociolinguistic processes is taken up in this article by adopting Bakhtin’s ideas about “speech genres” (Bakhtin, 1986) and “dialogicism” (Bakhtin, 1984). Bakhtin notes how much of our spoken interaction is based on “pre-scription,” that even our mental concepts are linguistic in nature and our minds replete with others’ words (Bakhtin, 1986, p. 88). For Bakhtin all words are subject to “accentuation” (Morson & Emerson, 1990, p. 139), giving rise to a pervasive mimicry and echoing of others’ words, or inflexions of them in our speech (Bakhtin, 1984, p. 88). Bakhtin also sees relationality as fundamental to voice, that the utterance is “finalized” when it evokes a reply. However, intonation has the function of pointing the utterance beyond its immediate context (Bakhtin, 1986), endowing the utterance with social “addressivity” (Bakhtin, 1986).
Figure 1 from Morson and Emerson summarizes the broad outlines of Bakhtin’s dialogics and also has echoes of Goffman’s approach noted above. Direct discourse is often of the type of professional technical language articulated with a monological picture of audience in mind. Double-voiced words relate to direct narrative voicing, for example, putting words into a character’s mouth in a story. In “passive” voicing the “author/speaker” imitates someone else’s voice by “semantically inflecting” it (Morson & Emerson, 1990, p. 150) for a different purpose. Double voicing also occurs in parody, which is seen as an “arena of battle between two voices” (p. 193). But an original (“animating”) voice may sometimes spring back in the third type of active dialogics as it starts to undermine the parodic one and put it into a different light.

Morson and Emerson (1990) on Bakhtin’s discourse types.
Bourdieu, Linguistic Field, and Voice
As in Bakhtin, Bourdieu’s (1991) approach to language also focuses on voice in the sense of parole (Hasan, 1999), a critical view of voice identified in terms of class and accent, prosody, and other embodied (“hexis”) aspects of speech. But Bourdieu’s idea of the “linguistic field” also helps us to take stock of the invariably hierarchical relationships between the types of voices that may become stereotypically associated with primary or secondary speech genres. Bourdieu’s idea of symbolic violence in language, that the predominant or prestige dialect has “enhanced voice and control over discourse” (Bartlett, 2004, p. 141), acts as a key polarity in the system of linguistic differences. Bakhtin’s dialogics needs Bourdieu’s more critical approach to infuse it with this awareness of the differences in power or status between speakers resulting from the linguistic market:
In the case of symbolic production, the constraint exercised by the market via the anticipation of possible profit naturally takes the form of an anticipated censorship, of a self-censorship which determines not only the manner of saying, that is, the choice of language—“code switching” in situations of bilingualism—or the “level” of language, but also what it will be possible or not possible to say. (Bourdieu, 1991, p. 77)
Bourdieu enables us to recognize that the associating by IVR services’ staff of particular accents with particular speech genres is likely to be shaped by the codes of speech established in the linguistic field.
Monologic Voice: Social Indexing of Voice in Technician-Designed IVR
This section outlines the nature of the instrumentalizing of voice in technician-designed IVR services. Voice services technicians are telephonic engineers who deploy intuitive understandings of what they consider to be “appropriate” voices for the tasks or typical speech genres of IVR (such as “greeting,” “advising,” “directing,” “call finishing”). The IVR services telecoms industry is extensive and has been driven by companies wishing to avoid the high contract fees demanded by call-answering centers or to avoid employing large numbers of telephony staff. However, the setting-up of these services is quite expensive due to sound equipment, specialist microphones, and recording suites, and this means that most companies need to use voice services contractors to set up their IVRs. Although various dialogue standards exist in the technical side of the industry (i.e., “Dialogue 2000,” “Quickfuse”; Voxeo, 2012) and voice application editor software packages can be used to set general parameters (i.e., “Microsoft Speech App SDK”) for an IVR service, it remains technically demanding and usually voice technicians are the “authors” of the voices selected for these systems. Although the following quote from one independent voice contracting company seems to put the onus for selection on the client, it is clear that design and operation solutions are basically with the contractor:
Telecoms companies generally wait for the client to define the issue that needs resolving. Maybe at sales presentations they will go into the capacity of their particular system. But generally they will wait for the client to say, “We want three menu systems, one for in the office, one for when we are away from the office etc.” They will then come up with a solution to it—often, and then the client will not know about what else they could get. (Voice technician)
But in providing “solutions” technician-led voice services have to make a number of generally intuitive or experience-based choices about the selecting of voice, including the scripting of dialogue prompts/wording, and the type and range of “navigational aids” (Stentiford & Popay, 1999) required in the system.
Relational Weighting of Recorded Voices
In order to examine the choices of voices to be recorded and then associated with particular speech genres in IVR, six company IVR services were phoned up and recorded. Three of these were large national companies associated with home improvement and DIY, and three were large national bank/building societies. The “voice architecture” of each particular IVR platform was navigated as far as possible before one-to-one conversation with a human operator took over. This generally meant that there was a need to call the company quite a number of times depending on the complexity of the IVR menu structure so that variations in combinations of voices according to the range of caller queries could be explored.
When Bakhtin was writing in the 1950s he complained about the rudimentary nature of the available “taxonomies” of speech styles (Bakhtin, 1986, p. 64). However, there are now many elaborate taxonomies in “affective computing” (Nass & Brave, 2005; Scherer, Banziger, & Roesch, 2010), sociolinguistics (Crystal, 1975), prosodics (Bolinger, 1989), phenomenological (Ihde, 2007), and media studies (Haiman, 1985; Karpf, 2006).
The coding of voice in Table 1 below was adapted from the work of Abercrombie (1967) and Aronovitch (1976) and Eriksson (2007), who characterize voice on the basis of “voice cues” and characteristic “personality stereotypes” (Aronovitch, 1976, p. 208) or “conversational indexicals” in voice (Eriksson, 2007). Aronovitch gives a key role to pitch and loudness in categorizing voice (Aronovitch, 1976, p. 218) and refers to voice personalities such as “extrovert-introvert,” “kind-cruel,” and “bold-cautious.” Any perceived differences in degrees of pitch, tempo, and pause in IVR messages are derived from key prosodic indicators of voice in natural speech, those having a paralinguistic function of establishing “semantic stability between speakers’ meaning and intention” (Kennedy, 1991, pp. 8-9). Thus, although IVR is not naturally occurring speech its “prosodic units” or baseline variations in pitch, tempo, and pause still serve to engage the caller (Eriksson, 2007), and these recorded voices that suggest “personality stereotypes” (Aronovitch, 1976).
Summary of the Relational Weightings of Voice Indicators in IVR Answering Services.
Note: LE – London English.
Therefore, the values of “high,” “medium,” and “low” in Table 1 make no more claim than to be impressionistic, based on relational weightings of the various indices of voice. This means the focus was on the polarities perceived within the terms of the recorded IVR prompts alone, rather than exterior values based, for example, on voice spectrography or sociolinguistic phonetic markup. Recent ideas that voice belongs more to phonology than phonetics (Cavarero, 2005), that the significatory materiality of voice is “supra-segmental” and not accounted for in traditional linguistics but that it nevertheless has “performativity” (Dolar, 2006) support this idea of voice. Similarly, Bourdieu advises that we should avoid looking for distinctions in language that have little significance in light of the more sociologically significant fault lines occurring in the linguistic field (Bourdieu, 1991, p. 89). Following this approach avoids seeing intonation, for example, in naturalistic terms as an established variable rather than a constructed or relational one (in contrast, Crystal, for example, refuses to discuss anything about intonation that is not observable; Crystal, 1975, p. 30).
Main Contrasts in Voices Associated With Speech Genres in DIY and Banking IVR
DIY Services
The IVR voice architecture of Homebase, one of the three national DIY companies IVRs to figure in this research, has three menu options. The gateway menu is “welcoming,” and once an option is taken the same voice then states blandly the numerical options for one or other of the various departments—a mildly “directing” speech genre. Once the required department or section is chosen a second voice states the routine reminder that the call may be recorded “for security and training purposes.” The third option then occurs where another voice advises the caller they are now in a queue, this message being repeated every 20 seconds and interspersed with light soothing music. All three voices are female, the first and third sounding mature while the second sounds more youthful. The gateway voice is generally low in pitch but has nevertheless an imperative tone, associated with a commanding and directing speech genre. This is also not regionally accented, in moderate tempo, giving the caller an impression of a personality that is alert, efficient, and capable.
In contrast, the second female voice sounds more youthful, definitely hesitant at points, and also regionally accented. Due to its place as the “subset” to the gateway in the system this voice is structurally subordinate—an underling voice for the genre of general routine advice. The third voice articulates an advisory speech genre telling the caller that they are in a queue. This is much quieter than the first voice, more therapeutic—the type of voice commonly heard on relaxation CDs. This voice, also, was used for another, interjectory, on-hold message that suggests that the caller might “perhaps” wish to call back at another time.
Wilkinson’s DIY has just two menus. The first gateway menu command prompt is in a strident voice of the type often associated with the continuous loop TV video promotions of home improvement products. This voice is declamatory, importunate, and very dynamic. Its main role is the routine advisory warning about “recording of calls for training purposes” but moves seamlessly into outlining the submenus. Once an option is chosen a second, younger-sounding female voice cuts in to present further choices. As with Homebase, this voice is much more hesitant, lower in pitch but speeds up after finishing the six-option menu when it advises there is a queue and the caller might wish to go to the company’s website, the address of which is delivered at rapid-fire speed. This subordinate voice in Wilkinson’s IVR was not markedly accented/regional but distinctly youthful.
The third DIY voice answering service examined was B & Q’s. This also has just two levels and one voice, although most of the prompts are at the gateway command point menu. This gives six menu choices and is delivered by the voice of a young-sounding Scottish woman. The delivery of the options in this recorded voice is moderately paced, characterized by a micro pause where a barely audible inhalation (allowed/overlooked or otherwise not edited out) occurs as she moves from one item to another. The voice pitch is soft and regular in tempo but, as Scottish, is a markedly regional accent. This voice continues at the second level because there are actually no more options, and the change of function is simply the routine warning before the caller is put through.
Banking Services
In key terms of social and prosodic indices of voice there were some significant contrasts between DIY and bank telephone answering services. Three banks were called: Santander, HSBC, and Nationwide Building Society, the existing bank account holders’ line being navigated generally to the point where, after failing to enter an account number, the service breaks off. In contrast to voices in DIY companies’ IVR there is more or less complete uniformity of voice across all three banks. The key indexicals here of all three voices was that they sounded mature, unaccented, and were all female. In terms of prosodic indicators, all also were very similar in levels of stress, pitch, and loudness. Another contrast with DIY IVRs is that a stereotypically clear and “clipped” “Modern London” form of the “prestige dialect” Received Pronunciation was evident. Thus, the menu option prompts were spoken with sharp resonance, underlined by a moderate continuous pace with a pitch that was generally high. These personality-type aspects of banking IVR voice give a sense of an “efficient,” “trustworthy” person, coinciding with a mainly command speech genre.
It follows that no important changes in voice were encountered during the transitions from one menu to another in any of these services. There were, however, slight but significant variations of voice between the three banks” systems. First, there were far fewer menus at the HSBC service. Santander’s voice also had more significant pauses and audible breaths between articulating sub-menu numbers. And there were considerably more genres of “help and advice” characterizing Santander’s service than in the other two banks. However, at one point in the Santander service a slight hint of irritation was allowed to come through, heard in a brief exclamatory prompt “You have not entered anything!” when the “mistake” was made of entering the nonoption “#” on the touch-tone tone key.
The dominance of a prestige dialect–accented voice in banking IVR suggests a field effect—of higher social status associated with banking services. Banking IVR voice is more consistently in a command speech genre even though, across DIY and Banks” IVR, all voices were female, confirming a female “gender standard” in technician IVR:
Gender selection is also important, and, more often than not, a good, solid female voice works best, Bryan suggests. “In voice prompts, females far out-number males, probably because they have more of a calming tone,” Smith says. “Even in some companies that are heavily male-dominated, a female voice is usually requested.” (SpeechTech, 2012)
But while there were youthful-sounding female voices in DIY services accompanying subordinate speech genres, mature-sounding voices dominated banking IVR. This contrast suggests that the intuitive classifications of voice made by voice engineers, company PR managers, or whoever else was involved in the process, are still influenced by social codings of voice.
Dialogic voice: Voice Branding Agencies
Voice branding agencies provide copy for “on-hold marketing” in which voice is deployed in much more consciously “creative” (Kotelly, 2003) ways. Voice branding is one of the key “intermediary” (Gress & Bagchi-Sen, 2007, p. 565) institutions in broader processes in the instrumentalizing of the human voice. Gress and Bagchi-Sen’s study of these agencies in South Korea notes that they act as “quasi-facilitators,” coordinating vast networks of “voice actors” involved in “local sound production industries” (p. 567). In terms of the significance of this industry, one of the companies interviewed in the research noted that they produce around 700 voice samples a month in response to inquiries from potential clients. The interviews with people in this industry show how the generally “monologic” nature of voices selected for the speech genres in technician-led IVR are often objects of ridicule. Voice branding stresses the need to “maintain caller interest” and to use the waiting time to inform customers of more than just basic information. But just as in technician-developed IVR, voice branding agencies also tend to rely on social stereotypes of voice.
Agents interviewed in these services were generally critical and dismissive of technician-led IVR services:
(In) audio branding we want to give clients” callers an incredible sense of the company in contrast to when the caller comes from the website to make the call and it’s Fred answering. (Audio branding script writer)
The derogatory reference to choosing in-house staff (“Fred”) is common in branding IVR, usually dismissed as the voice of “Sarah from Accounts.” In contrast, audio branding extols the virtue of only using “real voices,” heavily engineered voice being seen as robotic and unlikely to fit the brand image of most companies:
Does the woman who does the London Underground really exist? We deal only with real speech. (Voice branding agency manager)
This does not, however, mean using nonprofessional speech. Branding agencies offer a wide range of voices as part of their premium services:
To be honest one, maybe two, voice over artists could supply all of what we wanted if that’s what we wanted—but every client is different so we don’t use the same voice-over artists a hundred times. We try to vary it. So we might do male and female prompts alternating and we really try to follow this. So the voice over artists record it at home and send us the file and we can turn things over within hours, we do it regularly and we’ve so many voice over artists we never have to wait for one to be free. (Voice branding agency manager)
As well as being dismissive of in-house voices audio branders deride technicians’ resort to scripting platitudes such as “Your call is important to us.”
So voice branding agencies reflect more on the type, nature and character of a voice that might be suitable for any particular speech genre. They are thus more interested in the wider role of prosody and sound:
We also deal with literary voice. We think a lot about how we sound, whether being read or listened to. When we look at our clients’ websites we imagine how they would sound. (Sound studio production manager)
There is an appreciation of the production values of subtleties in tempo and pitch—that a conversational style should be at normal conversational level:
In a practical sense we don’t want anything too slow because the ideas in each prompt is seconds long and our clients don’t keep their callers on hold for hours on end so that they can listen to a very slow prompt. There has to be a balance between [being] audible and you’ve got to be able to understand it, and keeping it at a normal conversational level. I think that is the main guidelines—keep it at a conversational speed. We occasionally send things back because we think they are too slow, but sometimes it’s the voice over artists’ interpretation of the brief [that is wrong]. (Audio branding studio technician)
On most voice branding services’ websites there are usually to be found sound files giving an extensive range of “voice talents” often classified in prosodic and personality-type terms (i.e., “mellow Scots,” “sporty female”). Needless to say, these natural-sounding voices are rarely untutored or completely unengineered.
But voice branding IVR is promoted as dialogic and creatively scripted in this way. Rather than following technicians’ standardized formats for menus, voice branders “author” voices dialogically to create entertaining messages aimed at maintaining callers’ interest:
Our data is heavily cleansed. We really target and research companies before we call. Six copy writers will research the client and will write the samples or full scripts. Then a short brief to the voice artists with notes on intonation and emphasis perhaps on certain parts. And copy have real ideas of how it will sound—it’s written not to be read but spoken. (Production manager, North West England Sound Studio)
In Bakhtin’s terms, the key function of voice branding services is to overcome the fundamentally monoglossic nature of the experience of being “on-hold” and to turn the experience into a more dialogic one that might even be pleasurable for the caller (as well as being a market opportunity for the company):
You can still talk to people even though they are on hold. (Voice branding company director)
However, on-hold services face problems in creating the impression of dialogue on the caller. They must avoid the natural tendency of the caller to complete or “finalize” (in Bakhtin’s terms) the on-hold utterance. Thus, on-hold messages rarely have significant pauses or falling off of stress, which might create an expectation in the caller that they can reply. The pace of on-hold messages is generally conversational, but occasionally, micro pauses are used to create the impression of dialogue, as this interviewee notes:
We sometimes give short, very short, silences in which the caller can mentally comment—so the caller is trying to answer the questions. (Copywriter)
By mimicking conversational style, voice branding agencies thus avoid finalization of an utterance. For example, on one on-hold company website the voice of the well-known British and Hollywood actor Brian Blessed is employed to illustrate how an essentially monologic service can be “double voiced” in on-hold:
Is that it? [Incredulous] Really? Leaving me here on my own to listen to this . . . er . . . er nothing. Nothing? All this – silence. It reminds me of a girl I once knew—bored me to tears, [patronizing:] the poor dear. [sound effect—double bleep of a call waiting tone] Ah, there you
At the start, Blessed’s voice suggests he is in a mellow, reflective mood, and he muses on the problem of being kept on hold; this is humorous because the actor is famous for his stentorian voice. But this example shows the type of dialogics characterizing voice branding IVR services. The intonation and other prosodic cues alongside the “conversational” style of the voice are consciously dialogicized by voice branders to create the impression of a number of voices.
Dialogics and Parodic Voice: Estuary English
However, the dialogics of voice branding is also highly dependent on social coding, particularly seen in relation to accentuation:
If you go into the client area of our website. We have different examples of voices and I think we have a way we categorize voices. It’s Northern Male, Southern Male, Older, Younger. And we also categorize voices by emotion, not the right word! But Upbeat Male, Sombre Male. These are subjective but in general we try to categorize them as younger voice, older voice, north south, a loose categorization of them. (Audio branding company manager)
Voice stereotypes thus circulate and are fundamental to the service. Highly sales directed companies usually choose a functional, rapid-fire voice that simply “pushes the product.” In this type of voice accent or personality are less important than the ability to speak quickly (or amenable to “acceleration” at the editing desk). Sometimes identifiable regional accents are avoided because it is assumed that large national and international brands require a “universal voice,” that is, the “prestige dialect” such as Received Pronunciation. In contrast other, usually regional, businesses are seen to have “stronger brand guidelines” suitable for regional accents:
I guess you could divide it into the top box, which is the national companies, Samsung for example. And they want a standardized voice—they don’t want a dialect or regional accent—they want to focus on the entire market. A company which has a strong regional identity, like Gymbox, and want to stand out from their competitors and want quirky style/message. And then there are the small companies without a strong brand who will go for the safe option, family-run businesses will go with a soft option and similarly for Jaguar and Hyundai they go for the safe option. (Audio branding company manager)
Gymbox’s IVR is interesting in revealing how accentuation is important in branding agency dialogics. As a London area health and fitness club aimed at professional 25- to 35-year-olds, the company decided to use an “Estuary English” accent for its IVR. Estuary English is sometimes called “Mockney”—usefully connoting the parodic nature of the dialogics involved here (in relation to Figure 1 above)—because the accent is usually affectedly “code-switched” or adopted by the new middle classes in particular situations—Tony Blair was consummate in sprinkling his speeches with this “demotic” touch (Fairclough, 2000). In essence, Estuary English melds London “cockney”/working-class vowel sounds with London English. As Gymbox is noted for its “quirky” and “risqué” forms of exercises, Estuary English was therefore seen as an appropriate voice to attract its clientele:
Gymbox has a funky upbeat brand in London. They don’t have a lot of speech on their website. But when we produced the sample we didn’t fit the voice, it was an older man’s voice which didn’t fit the brand. So they liked the idea or concept of sort of colloquial voice which fitted with what they wanted to say and the people who were joining. It was due to the regionality of the voice and the way the words were structured in the sentences [spoken]—a lot more slang was used that was fitting to the east London style. (Voice branding director)
But the use of Estuary English as articulated or authored in this way has been criticized recently in Owen Jones’ study of social issues around the term “chav” as designating the contemporary British white working class (Jones, 2011). The adoption of this voice by Gymbox was used to suggest “grit,” that its exercises would enable clients to defend themselves in the case of violent or aggressive behavior by (lower working class) “chavs” or “Hoodies,” as the Gymbox copy quoted by Jones suggests:
Forget stealing candy from a baby. We will teach you how to take a Bacardi off a Hoodie and turn a grunt into a whine. Welcome to chav fighting, a place where punch bags gather dust and the world is put to rights. (Jones, 2011, p. 3)
Jones uses this example of IVR branding to indicate how working class people in general are viewed as a physical threat by the middle classes. But the speech genre articulated in this copy—of savvyness, sophistication with a literal “punch line” is matched with its “passive”/parodic double voicing—Estuary English as a middle-class articulation of elements of a working-class accent. In Bakhtin’s terms the Gymbox voice is made passive, a parodic rendition of working-class speech.
Conclusion: Social Indexing of Voice in IVR
The above analysis has sought to show how both technician and voice branding IVR seek to overcome the conversational vagaries that arise from voice as a form of naturally occurring speech: Effectively callers’ utterances need to be subordinated to the IVR’ “speech plan.” A caller’s predisposition to “expressive intonation,” based on instinctive principles of naturally occurring language, is suppressed by the underlying monologic principles of both technician and branding agency IVR voice prompts. In this way, both types of IVR services seek to avoid “typical compositional and generic forms of finalization” (Bakhtin, 1986, pp. 76-77) that mark everyday dialogue. IVR menus thus prepare us as listeners, and position us, to mentally accept the subordination of our voices to those of IVR. Nevertheless, technician-directed IVR revealed primary speech generic structures such as advising, commanding, and placating therapy but also social referencing or “addressivity” of voice in accentual and gender/age codes. However, branding agency IVR was found to be more consciously dialogical and aesthetically stylized, distancing itself from clichéd copy by drawing on more literary and secondary speech genres.
Although Bakhtin’s idea of speech genres separates primary and secondary forms, his dialogics also suggests that poetic and everyday forms of voice are interrelated (Bakhtin, 1986, p. 65). Accordingly, this article’s adoption of his approach to understand the instrumentalization of recorded voice has shown how “authors” such as voice technicians and branders act to different degrees to dialogicize IVR. Bakhtin sees secondary literary speech genres as being more stylized whereas plainer styles mark “business documents, military commands, verbal signals in industry” (p. 63), a contrast also found in different approaches to the instrumentalization of voice in branding and technician IVR.
As Ahearn notes, Bakhtin’s dialogics tends to see all texts as “multivocal and egalitarian rather than univocal and authoritarian” and that “structure emerges through situated action which gives a sense of democratic play in language” (Ahearn, 2001, p. 128). Bakhtin allows us to recognize how different voices may articulate different speech genres in a dialogical and relational way—that “in essence, meaning belongs to a word in the position between speakers” (Maybin, 2003, p. 69). But it is by adding Bourdieu’s approach to Bakhtin’s that a more critical assessment of the relation of particular speech genres with socially coded voices can be made. The bringing together of Bourdieu and Bakhtin’s ideas about language has largely been overlooked (but see Hanks, 1987) although both share a primary focus on language at the level of parole. And while there is no clear formulation of the idea of field in Bakhtin, he did see speech genres as having an existence “beyond context” (1986, pp. 85-86) —it was just in particular articulations that they become “accentuated.” Uniting these two approaches, then, has some epistemological validity and, in practice, has allowed this article to critically engage with the process of instrumentalizing voice in the contemporary voice industry.
Footnotes
Acknowledgements
Thanks to Erin O’Gara the editor and my anonymous reviewers for their very helpful comments on drafts of this paper—the responsibility for the final content being my own; the School of Humanities Research Committee at UEL for awarding me a sabbatical to undertake this research; my interviewees and ‘Sarah from accounts’.
Declaration of Conflicting Interests
The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author received no financial support for the research, authorship, and/or publication of this article.
