Abstract
The question of “staccato” rhythm in Stockholm’s multiethnolect is investigated by comparing nPVIV measurements of the speech of 36 adult male speakers. The men, ages 24–43, come from a stratified sample of social classes and racial groups. Three contextual styles were recorded and analyzed: informal, formal, and very formal. The distribution of nPVIV values in informal speech across class and racial group indicates that speech rhythm splits three ways: low-alternation “staccato” rhythm among the racialized lower-class men, high-alternation rhythm among the white lower-class men, and an intermediate level of rhythm among higher-class men, regardless of racialized category. The “staccato” low-alternation feature is also less stylistically sensitive than the high-alternation feature, implying that the latter is a more established feature than the former. Further, the “staccato” feature is more stylistically sensitive among younger speakers than older speakers, implying an ongoing change from indicator to marker status. For all speakers, age has a stable main effect, which means that younger speakers, independent of racial group and class, have lower alternation than older speakers. Implied here is that low-alternation is a change from below that originates within the racialized working class. While it may be incrementally transmitting into the wider speech community, the white working class is the most resistant to its incursion.
1 Introduction
This study sets out to investigate the “staccato” variant in the prosody of male speakers of Stockholm’s multiethnolect, and in doing so, the results extend beyond this single variant. While finding evidence in support of said staccato feature, the findings also show that the social stratification of rhythm is horizontal rather than top-to-bottom. This is to say that white working-class men have the highest rhythmic alternation in their prosody, and racialized 1 working-class men have the lowest alternation in their prosody. Speech rhythm among the upper and middle classes fall between the two.
These findings join in on two separate discursive threads that have been simultaneously running for some time. One thread is grounded in local descriptive work, and the other in global linguistic theory. Locally, and ever since Kotsinas’ (1988a) first survey of Rinkeby Swedish, scholars and laymen alike have been commenting on the staccato impression of the new youth vernacular in Stockholm (as well as in Malmö and Gothenburg). In addition to these accumulating impressionistic accounts, two small investigations of rhythm have been conducted (Bodén, 2007; Young, 2018b), but neither have been able to fully substantiate the staccato claim.
As it pertains to linguistic theory, since the early 1980s, a global literature on speech rhythm has been growing and maturing. This literature has taken aim at rhythmic typologies for the world’s major languages as well as the contact varieties associated with many of those languages. What has not been seen, however, is an investigation of rhythm as a sociolinguistic variable in the Labovian sense of the word. Specifically, the rhythmic variants in contact scenarios have not been examined as changes from below whose approximate salience and age can be assessed by means of stylistic sensitivity. This theoretical paradigm is a main point of departure for the present investigation.
The analysis will show that rhythmic alternation of the white working class is more stylistically sensitive than of the racialized working class, which aligns with the interpretation that the former is a legacy feature of Stockholm’s working-class Södersnack and the latter a feature of the newer multiethnolect. For all social groups investigated, low-alternation rhythm appears to be a feature of younger men, and high-alternation rhythm a feature of older men, which is to imply a systematic change in apparent time toward lower-alternation prosody all around. This main effect for age, however, is complicated by the aforementioned horizontal stratification of the race and class nexus.
The research questions addressed by this study are (a) whether the speech of Stockholm’s racialized working class has lower rhythmic alternation (staccato) than the speech of other male groups in the city, (b) to what degree the speech-rhythm variants are stylistically sensitive, and (c) what conclusions can be made about the trajectory of rhythm in the city as a whole, including in other varieties besides the urban vernacular.
2 Background
It became apparent during the ethnographic portion of this study that three distinct social groups operate in Stockholm today—(a) a new racialized working class, (b) the established white “Swedish” working class, and (c) a diverse middle and upper-middle class whose cosmopolitan aspirations somewhat serve to neutralize racial and ethnic boundaries. This study investigates speech rhythm among male speakers within each of these three groups. The first group generally speaks varying “intensities” of Stockholm’s multiethnolect; the second group generally uses linguistic features from Stockholm’s industrial-era Södersnack; the third group speaks standard Central Swedish with varying degrees of upper-class feature bricolage. 2 This background review will focus on the two working-class groups, their history, and their speech forms–since they are the predominant locus of the stratification of rhythm in Stockholm.
2.1 Multiethnolects, racialization, and late modernity
As I propose above, Stockholm has a working class that has divided itself along racialized lines. A key social category used in this analysis is whether participants self-identify as svensk or invandrare, two terms that literally translate as “Swedish” and “immigrant,” respectively. Although the terms ostensibly refer to properties of national and migrant origin, they are in fact used in Sweden (in the 2010s) as terms of race and racialization and translate into “white” and “non-white,” respectively. When one wishes to colloquially refer to actual immigrants or speakers of late-acquired L2 Swedish, the term import is used (no speaker in this study would qualify as an import). Throughout this paper, svensk and invandrare are used emically to refer to their colloquial racial connotations instead of actual national or migration background.
In order to understand how these two identities figure into the analysis of rhythm, it is necessary to provide a review that builds a sociological argument for their importance. Unlike, for example, social class, the concept of “race” in the European context is often questioned. Furthermore, the term “immigrant” can lead readers to forget that these speakers are early acquirers of Swedish, with all of them reporting Swedish to be their strongest language and most of them having an age of onset between 1 and 2 years (see Section 3.1).
The linguistic development in Stockholm, as in many cities that have witnessed significant post-War migration, is a uniquely late-modern phenomenon. Late modernity, itself a term often left underdefined, is characterized by Wacquant (2008) as the current epoch in which both manufacturing and the welfare state have starkly weakened. The Poststructuralist school has referred to the era as “post-Fordist,” and Bauman (2000) has famously referred to it as “liquid modernity.” One key way in which late modernity is tightly connected to the present linguistic development is via the liberalization of schools (Rampton, 2006). Multiethnolects did not just arise from the banal confluence of contact scenarios; rather, the establishment of independent charter-school markets in numerous European countries like the UK and Sweden has led to the hyper-concentration of minority pupils in some schools and majority pupils in others (see Forsberg, 2018, for an excellent analysis). Growing income stratification and residential segregation have worked in tandem with decades-long school segregation to racialize the social-class hierarchy, something that is now said to be a signature feature of the European strain of late-modernity (Hesse, 2007; Lee, 2010; Lentin, 2008; Lentin & Titley, 2011).
In Stockholm, this demographic subgroup has developed its own linguistic variety after more than 40 years of school segregation, social exclusion, and relegated suburban enclosure. Critical Race theorists like Hübinette, Hörnfeldt, Farahani, and Rosales (2012) have therefore also proposed that “invandrare” is in fact a racialized euphemism for what Mulinari and Neergaard (2004) have referred to as Sweden’s racialized working class. The term Swedish multiethnolect, while used in this paper, is actually inadequate because it neither indexes the variety’s raced nor classed attributes (Rampton, 2011). Rather than being the lect of multiple ethnicities, the lect is the variety of an underclass for whom ethnic differences have been erased amid the engrossing salience of their phenotypic markedness within the majority population.
The above discussion is not just important for understanding the current racialized system that has emerged in Stockholm. It is also important for framing the research of multiethnolects in a more general sense. Scholars too often assume there to be L1 transfer on L2-Swedish while forgetting two important points. First, the speakers of these varieties are early acquirers of the majority language, well within Lenneberg’s (1967) Critical Period, and many have the majority language as their home language. Second, working-class youth—monolingual or not—have always been the main innovators and drivers of language change (Labov, 2001). Sometimes these innovations are appropriated from a more endogenous feature pool like t-glottaling in Norwich (Trudgill, 1988) or PRICE-centering in Martha’s Vineyard (Labov, 1963), and other times they are appropriated from a more exogenous feature pool like KIT-raising in Los Angeles (Mendoza-Denton, 2008, pp. 230–264). Despite these two points, the mechanical effects of bilingualism are sometimes seen as the default explanation for variation when L2-speakers enter the scene, while social factors are sidelined (Bodén, 2007; Kotsinas, 1988a).
2.2 Stockholm’s traditional working-class Södersnack
Stockholm’s multiethnolect, the variety of the city’s newer working class, sits in juxtaposition to Södersnack, the variety of the city’s established working class. The relationship between the two varieties is complex. On one hand, multiethnolect has absorbed a large number of slang terms from Södersnack (e.g., beckna “to sell drugs,” Young, 2018a). On the other hand, its surface phonetics give the impression of a dramatic departure.
Linguistic contact was also rife during Stockholm’s Industrial Revolution, and the massive relocation of rural migrants resulted in intense koinèisation within the speech of their children (Kotsinas, 1988b). Just as is the case for late modernity, the Industrial Revolution was an epoch defined by intense social change and stratification. Aside from the relatively short “Golden Era” of Swedish social democracy (1930s–1980s, Therborn, 1998), Stockholm has long been a hierarchical and stratified place. When industrialization began to partition the citizenry according to their relationship to production, Stockholm’s social classes began settling on separate islands within the city’s dense archipelago. Figure 1 contains a map of the city in 1841. The central island is the original medieval city Gamla Stan, and Södermalm to the south is where the new industrial working class was confined. The growing bourgeoisie spread to Norrmalm and Östermalm towards the North. Since then, the city’s population has grown to fill the full map, but the social classes today continue to be geographically separated. 3 These symbolic and physical enclosures contributed to the emergence and maintenance of Södersnack then (Kotsinas, 1988b; Thesleff, 1912) and of Swedish multiethnolect today.

Stockholm in 1841 (Topografiska Corpsen, 1861). Licensed for open use by Stockholm City.
Stockholm’s traditional industrial working class has undergone a displacement that resembles the Cockney march into Essex (Watt et al., 2014). Starting with the Million Homes Program from 1965 to 1974 (Hall & Vidén, 2005) and accelerated by gentrification on Södermalm, this community has migrated out to the city’s periphery. According to popular knowledge and according to fan-tracking for Södermalm’s football team Hammarby (Young, 2019, p. 29), their migration has appeared to target the suburbs to the south and southeast (see map in Figure 2).

Participants matched to their respective neighborhood. Labels provide pseudonym, age, svensk/invandrare identification, and occupation. (Map licensed under the Creative Commons and GNU Free Documentation License, Holmér, 2014).
Very little research—sociological or sociolinguistic—has been conducted on this population. Prosody in Södersnack has never been studied, but Öqvist (2010) has proposed that it may have “steeper intonational contours” and a “nasal quality” (p. 256). Three of its long vowels are also reported to be more diphthongal than those in standard Stockholmian (Kotsinas, 1994; Öqvist, 2010), which could translate into more rhythmic contrast than in multiethnolect and the received standard. The vowel in
Södersnack vowel trends (adapted from Öqvist 2010, p. 254).
2.3 Stockholm’s multiethnolect: Europe’s “first”
Stockholm’s racialized working class has been popularly ascribed a speech variety known as Rinkeby Swedish (Kotsinas, 1988a), Europe’s earliest-known and Scandinavia’s first multiethnolect (Quist, 2012, p. 2). This term multiethnolect has been advocated by Clyne (2000) and Quist (2008) to refer to the new urban repertoires spoken by “second generation immigrants” in countries that have experienced significant postwar non-Western migration. Research on these varieties has typically investigated teenage speakers and their construction of distinct and oftentimes oppositional identities (Cheshire, 2013; Cornips & de Rooij, 2013; Doran, 2001; Keim & Knöbl, 2007; Lehtonen, 2011; Pharao et al., 2014; Quist, 2008; Rampton, 2006; Wiese, 2012).
Adult speakers are less commonly discussed in the literature, which is a serious gap. Prior to the present study, I reviewed media discourses on Swedish multiethnolect and found that they extend back as far as 1979 in reference to “semilingual” teens in Stockholm’s suburbs (Sveriges Television, 1979). The date of this first report implies that the variety may have been circulating for at least 40 years, so one would expect substantial linguistic focusing in Stockholm by now. Elsewhere in Europe, recent work has started to address the question of adult speakers and to investigate whether these styles now have become full-fledged sociolects (Keim, 2007; Rampton, 2011; Sharma & Rampton, 2015; Sharma & Sankaran, 2011; Young, 2018b). Research on adults was one motivation for Rampton (2011) to disfavor terms like “youth language” while advocating for the designation contemporary urban vernaculars. He claims that across Europe, these later-stage linguistic developments share the following three properties—three properties, I would add, that are also very much applicable to Stockholm: (a) they emerged in urban neighborhoods shaped by immigration and class stratification; (b) they are “connected-but-distinct” from migrant languages, the traditional working-class variety, and the standard variety; (c) they are widely known and represented in media and popular culture (p. 291).
The present study seeks to add to this growing genre of literature by investigating the speech of adult male speakers in Stockholm, Sweden. The study focuses on Rampton’s (2011) criterion 2 by investigating whether the speech rhythm of Stockholm’s multiethnolect indeed is “connected-but-distinct” from both the traditional working-class variety and the standard variety. Additionally, as it pertains to the exceptional longevity of Stockholm’s multiethnolect, also discussed above, a second linguistic implication may be at play—that a number of features will have already reached marker or even stereotype status (Labov, 1972, 2001). This study will investigate whether staccato rhythm might be one such feature.
2.4 Multiethnolects and staccato rhythm
Rhythm, the variable of interest to this paper, is a popular topic when the surface phonetics of multiethnolects are discussed. But only a handful of phonetic studies actually circulate on Swedish multiethnolect. Many more studies have remarked that the variety sounds “jerky” or “staccato,” even though most of these works are not phonetic (Bijvoet & Fraurud, 2008, p. 22; Bijvoet & Fraurud, 2011, p. 16; Bijvoet & Fraurud, 2016, p. 22; Fraurud, 2003, p. 88; Kotsinas, 1988a, p. 268; Kotsinas, 1990, p. 257; Milani & Jonsson, 2012, p. 54; Nordenstam & Wallin, 2002, p. 257).
Kotsinas offered early impressionistic descriptions of the phonology of Rinkeby Swedish and often referred to its prosody as “jerky” (stötigt) and speculated that it might be due to a reduction in the difference between long and short syllables (Kotsinas, 1988a, p. 268; Kotsinas, 1990, p. 257). This, however, was never tested. Bodén (2007) compared vocalic duration between the received Malmö standard and the local multiethnolect (Rosengård Swedish) in a small random sample of her corpus of secondary-school pupils. No significant difference was found in the sample. That is to say, phonologically-long vowels were not shorter and phonologically-short vowels were not longer than those within the standard Malmö Swedish samples (p. 29). She hypothesized that the absence of elisions and reductions might instead have been responsible for the staccato effect that she was perceiving (p. 33).
In the pilot that prefaced this study, I set out to investigate which phonetic features correlated with listener perceptions of “rough” and “non-Swedish” among native-speakers of Stockholm’s vernacular (Young, 2018b). Eight short speech samples, all with grammatical and lexical variation removed, were assessed by a panel of 27 native Stockholmers. I then tested which phonetic variants correlated the strongest with their assessments. Low et al.’s (2000) nPVIV of duration (reviewed in Section 4.1) correlated most strongly: the low-nPVIV stimuli garnered assessments of “rough” and “non-Swedish,” and the high-nPVIV stimuli garnered assessments of “refined” and “Swedish.” The stimuli, however, were not matched guises, so it was not possible to credibly tie these specific evaluations to rhythm, in specific.
Staccato rhythm has also been observed and investigated in other European multiethnolects. Hansen and Pharao (2010) concluded that the staccato impression of Copenhagen’s multiethnolect was due to transformations of phonologically long and short vowels. Twelve speakers of multiethnolect and 12 speakers of standard Copenhagen Danish participated in a map task that targeted 33 test words. The authors found that speakers of multiethnolect generally had “equal duration of long and short vowels before syllables containing a full vowel” (p. 93) and that this was generally “due to shortening of long vowels rather than lengthening of short vowels” (p. 91).
Torgersen and Szakay (2012) compared nPVIV of duration (reviewed in Section 4.1) in London’s multiethnolect (Multicultural London English, MLE) with that of the white working-class varieties in Havering. They found intervocalic durational contrast to be significantly lower for the former than for the latter. Like Hansen and Pharao (2010), they proposed that the difference might be phonotactic and due to, for example, specific segmental transformations like the monophthongization of the
Fagyal (2010), motivated by earlier observations of staccato rhythm in Parisian Verlan (Cerquiglini, 2001; Duez & Casanova, 1997), investigated the speech rhythm of young working-class boys in a Parisian suburb. She divided up her participants by whether they had Algerian heritage or French heritage, yet found no rhythmic significant differences for the three rhythmic measurements taken: (a) percentage of total vocalic duration over total segmental duration, (b) standard deviation of vowel durations, and (c) standard deviation of consonantal durations. One explanation could have been the focus on speaker heritage and L2 effects at the expense of sociolinguistic factors such as self-identification or the possibility that staccato was a shared youth-vernacular feature that spanned all heritage groups (something that I attempt to address in the current study). Another explanation could be that she relied on global rather than local metrics, a topic covered in Section 4.1.
Turning to the segmental phonology of Stockholm’s multiethnolect, no conclusive studies have been conducted to date. A number of observations do circulate, however, and I myself have embarked on some earlier work that I will briefly review here. Just as Torgersen and Szakay (2012) proposed for MLE, I have suspected that the long vowels in Stockholm’s multiethnolect are more monophthongal. In preliminary work (not peer-reviewed), I have found this to be the case for the vowels in
2.5 Staccato rhythm and social salience
Fraurud (2003) proposed that, alongside lexicon, prosody appeared to be the most defining feature of Stockholm’s multiethnolect (p. 87). Bijvoet and Fraurud (2008, 2011) found that laypeople were making similar metapragmatic evaluations. Two listener groups assessed a series of speech stimuli, one of which contained a sample of multiethnic youth language that prompted starkly split assessments. One listener group consisted of monolingual ethnic-Swedish university students; the other listener group consisted of multiethnic secondary-school students from a working-class high school. Whereas most listeners in the first group designated the speech sample as “Rinkeby Swedish” and even pointed to its “staccato-like rhythm,” the second group overwhelmingly saw the speech sample as unmarked “good” Stockholm Swedish. The authors note that, This speech sample was recorded in a relatively formal situation (a presentation in front of the class). It contains neither slang words nor grammatical deviations and the pronunciation cannot, according to a panel of linguists from Stockholm University, be traced to any particular first language. It is only on the prosodic level that this speech sample diverges from the dominating regional norm—with its light touch of the “staccato intonation” often mentioned in descriptions of multiethnic youth language. (Bijvoet & Fraurud, 2011, p. 16)
Implied here is that “staccato” is salient for outsiders but still not too salient for insiders, which in turn could mean it is in the intermediate stage of evolution, somewhere between what Labov (2001) refers to as indicator and marker (p. 196).
In a later examination of their data, Bijvoet and Fraurud (2016) found that the impression of “staccato” rhythm for speaker Eleni was sufficient to make listeners think they had heard grammatical errors that were not actually there. They also note that even though Eleni sheds a number of “suburban” 4 features when she style-shifts, she does not quite have access to her prosody.
Evidently, a single prosodic feature associated with a low-status variety was enough to make listeners also hear what was in reality not there. (Bijvoet & Fraurud, 2016, p. 22)
Prosody and style-shifting emerge as a more explicit topic in Milani and Jonsson’s (2012) ethnographic study of youth in a multiethnic suburb of Stockholm. Participant Emre relays an encounter with a police officer whereby he shifts out of his vernacular register into the normative Stockholmian style, shedding his so-called staccato rhythm.
On the one hand, the performance of “the policeman” is rendered with a lower pitch and an easily recognizable southern inner city Stockholm accent. On the other hand, his own answers are recounted with the “staccato-like” rhythm associated with Rinkeby Swedish. (Milani & Jonsson, 2012, p. 54)
The above accounts imply that (a) some sort of variable is operating in Stockholm’s multiethnolect to give the impression of “jerkiness” or “staccato” and (b) that there is some degree of social salience. Testing these two implications constitutes the main purpose of this study. A third aim of this study is to examine the age distribution alongside the style-shifting data to make an apparent-time assessment about the trajectory of speech rhythm in Stockholm.
3 Data collection
3.1 Participants
Thirty-six men, ages 24–43 in 2017, were interviewed for this study between 2015 and 2018. They hail from a stratified sample of social classes and ethnicities that itself constitutes a subsample of a larger ongoing project on the speech of Stockholmers. In order to provide a tangible portrait for the speaker population, they are listed in Figure 2 by pseudonym, age, racialization, and occupation. They are superimposed over a map of Stockholm and visually linked to their home neighborhood. Stockholm’s metro line is also included in the map because this is the spatial framework to which the city’s residents often associate its social dialects (Bijvoet & Fraurud, 2012). The majority of the working and lower middle-class invandrare hail from the northwestern and southwestern suburbs, the majority of the working and lower middle-class svensk hail from the southern and southeastern suburbs, and most of the upper middle-class speakers hail from the central four boroughs and the eastern and northern suburbs.
I recruited participants through my own personal and professional networks, and I used the snowball method to seek out new participants through an existing participant’s network. Participants received 100 Swedish kronor ($14) per interview; funding was provided by the British Economic and Social Research Council (ESRC).
All participants either have Swedish as a first language or began acquiring it well within the early stages of Lenneberg’s (1967) Critical Period. All were born in Sweden except for Antonio (2 yrs), Kevin (4 yrs), and Sohrab (6 yrs), whose age of arrivals are in parentheses; they entered Swedish preschool at age 6. The remaining 33 speakers have a Swedish age of onset, by means of preschool entry or home language, of 0 (
Figure 3 contains a population pyramid for Stockholm in 2017 broken down by age and sex. It shows that the age range that I sampled for this study, 24–43 in 2017, is part of the largest two cohorts within the city’s population. It also shows that the focus on male speech excludes women, who constitute a full half of that cohort. Although data collection for women is ongoing and rendering substantial speech data, it remains incomplete. Therefore, this paper will only examine the male speech data collected thus far. As a result of this, it can and will only make claims regarding the portion of Stockholm’s population shaded in black in Figure 3.

Population pyramid of Stockholm according to age and sex for year 2017 (Statistics Sweden).
3.2 Elicitation and interviews
Speakers participated in an adapted sociolinguistic interview (Labov, 1972) that consisted of elicited narratives followed by a formal questionnaire followed by two reading tasks. Some speakers also permitted me to record them with their peers; in those cases, I used this data for
Morris and Zetterman’s (2011) The Circus was selected over Engstrand’s (1990) The North Wind and Sun because its style is less archaic, and I did not want the passage to “key” the voicing of an older speaker. The Circus also contains more syllables than The North Wind and Sun—367 contra 155—and has multiple occurrences of every Swedish phoneme, whereas The North Wind and Sun only has single occurrences. The aim of this elicitation approach was to harvest speech data along a formality cline:
The original sample contained 38 speakers, but two were excluded from the analysis due to poor literacy. The remaining 36 speakers had between average and above-average literacy. Extensive social data was also collected in the interviews, which allowed me to place the participants along a numerical social-class scale that is detailed in Section 4.2 and within one of the two racial categories discussed in Section 2.1.
3.3 Recording, transcription, and segmentation
Recordings were made on individual Zoom H1 recorders with self-powered Audio-Technica lavalier microphones. They are in wav format, mono, with a sample rate of 16,000 Hertz. The speech data was transcribed by native-language transcribers, which was financed by a grant from the Sven och Dagmar Salén Foundation. The transcriptions were then checked by me and subsequently phonetically time-aligned using SweFA (Young & McGarrah, 2017). I then manually corrected the segmentations in accordance with standard segmentation protocol and the guidelines provided in Engstrand et al. (2000). Segmental metrics were extracted using a customized adaptation of Brato’s (2015) script for Praat (Boersma & Weenink, 2017).
The final foot before a pause was included in the calculation (Torgersen & Szakay, 2012; Fuchs, 2016; White & Mattys, 2007; Low et al., 2000; though cf. Thomas & Carter, 2006; Sarmah et al., 2011). I delineated breath groups by pauses that exceeded 150 milliseconds. This is in line with Fuchs (2016, p. 107) but contradicts Thomas and Carter’s (2006) recommendation of 70 milliseconds.
In Stockholm Swedish, coda /r/ typically coalesces by means of a sandhi process by which the subsequent /d, n, s, t/ become [ɖ, ɳ, ʂ, ʈ], respectively (Riad, 2014). In front of other consonants, it usually occurs as an approximant
6
or is partially or fully elided. Syllable-final /r/ was included as part of the vowel because the boundaries V+[ɻ] are highly subjective (Thomas & Carter, 2006, p. 341). Syllable-final /ʝ/ was included as part of the preceding vowel for the same reason. On the other hand, intersyllabic /r/ and /ʝ/ (V+/r ʝ/+V), including the external-sandhi effect of coda /ʝ, r/ + onset V (e.g., han är ung → han ä
Hesitation markers and hesitation lengthening were manually removed as I encountered them. Disfluencies were removed as well. The reading passage contained two words that not all readers knew—“manegen” (circus ring) and “trippelsaltomortal” (triple summersault). These two words were removed from the analysis for all speakers.
The final dataset rendered between 295 and 1517 vocalic intervals per speaker per contextual style, totaling 40,277. The
4 Method
The current study operationalizes rhythm by calculating Low et al.’s (2000) nPVIV with Mishra et al.’s (2012) energy-F0 integral (EFI) instead of just duration. In the immediate subsections I offer a review of rhythmic analysis and a basis for why this multidimensional approach was taken.
Following the below review, the analysis makes use of a novel approach that first statistically models how internal factors affect individual rhythmic pairs—without taking into account any social factors (similar to Clopper & Smiljanic, 2015). Additional space is dedicated to detailing how the predictors are coded because the approach is somewhat new, and one goal of the article is to provide a template that is easy to duplicate. Subsequent to modeling the internal predictors, the analysis adds in the social factors to assess their influence on the response variable.
4.1 Contemporary approaches to rhythm
4.1.1 Defining rhythm
Surprisingly few linguistic studies on rhythm are explicit in how they define rhythm. White and Mattys (2007) argue that rhythm “derives from the repetition of elements perceived as similar” (p. 501). Arvaniti (2009) reminds us that White and Mattys’ (2007) definition is derived from psychological research (Fraisse, 1963, 1982; Woodrow, 1951) and notes that the grouping of stimuli relies not just on duration but on a host of factors, including relative intensity, relative and absolute duration and the temporal spacing of elements (Fraisse, 1963, 1982; Woodrow, 1951). This definition of rhythm implies the presence of meter, which is distinguished from grouping itself: while grouping deals with phenomena that extend over time, meter is an abstract representation that relies on the alternation of strong and weak elements, not on absolute or relative durations (Lerdahl & Jackendoff, 1983). (Arvaniti, 2009, p. 57)
Lerdahl and Jackendoff’s (1983) definition of meter, cited above, is especially appealing because it pivots on the notion of local salience. Rather than specifying an internal property, such as long versus short or loud versus soft, it focuses on the relativity component: strong and weak. White et al. (2012) adopt a similar definition for what they call contrastive rhythm, a feature that is “evident in any string of sounds in which there is an alternation of strong and weak elements” (p. 665).
In this study, I treat meter and contrastive rhythm as the same constructions and refer to them as simply rhythm, defined as the alternation of strong and weak elements. Although these elements have traditionally been conceptualized as purely durational in the literature, a growing and credible body of work advocates for a multidimensional approach.
4.1.2 Global and local metrics
A common methodological approach that this study avoids is the deployment of multiple rhythmic metrics. As I outline below, the rhythm literature has only deployed two algorithms that directly measure the alternation of strong and weak elements. The remaining algorithms are proxy measures that sometimes capture rhythmic effects, albeit extraneously.
According to Fuchs (2016, pp. 35–69), contemporary approaches to measuring rhythm can be divided into two categories: global and local. The most common global metrics originate from Ramus et al. (1999) and are the sum of vocalic intervals divided by the total duration of the sentence (%V), the standard deviation of consonantal durations (DC), and standard deviation of vocalic durations (DV). These metrics are called global because they assign numerical representations to the overall distribution and variation of segmental matter without specifically modeling sequential contrast. Ramus et al.’s (1999) metrics were the beginning of a distinct era in rhythm studies. Many subsequent studies either incorporated these metrics or iterations of these metrics.
The problem with global metrics is that they serve as proxies; the internal properties of the algorithm were not forged to mathematically simulate rhythmic contrast. So, as proxies, they have a number of inherent weaknesses. Standard deviations of rapid speech are smaller than standard deviations of slow speech, regardless of actual contrastive variation. Consider the standard deviation of sequence [100, 50, 100, 50, 100, 50] versus sequence [50, 25, 50, 25, 50, 25]; both have arguably very similar contrasts with very different standard deviations (
Local metrics provide a resolution to this problem because their internal properties actually mathematically model sequential contrast. Two are currently in circulation: (a) the rhythmic irregularity measure (RIM, Scott et al., 1986) and (b) the normalized pairwise variability index (nPVI, Low et al., 20007) and its alternate variant the rhythm ratio (RR, Gibbon & Gut, 2001). For most vowel sequences, the RIM and nPVI/RR produce very similar measurements. 8 Therefore, this study will only use nPVI, and it will take its measurements from vowels. Therefore, I henceforth refer to it as the nPVIV (formula provided in Section 4.2).
The nPVIV algorithm (shown in Section 4.2) is simply a percentage-change calculation with an agnostic denominator. Whereas the percentage change of, say, a temperature rise from 30 degrees to 60 degrees would be
As it pertains to rhythm in European multiethnolects, Torgersen and Szakay (2012) investigated contemporary London varieties on a large scale with a particular focus on Multicultural London English (MLE). They found that young multiethnolectal speakers from Hackney had an overall lower nPVIV, followed by older speakers from Hackney. The highest nPVIV values were produced by the remaining white speakers from both Hackney and Havering (p. 829). The latter point becomes important for the present study’s findings on svensk working-class speakers (discussed in Section 7.1). In an earlier perception study of Stockholm Swedish (Young, 2018b), I ran a series of regression models that tested the correlation between several phonetic variants and listener assessments of neighborhood and affect. Of all variants tested, speech rhythm as measured by nPVIV showed the strongest statistical correlation to both assessments. These findings motivated the current study and its examination of rhythm in the context of speech production.
There have, however, been some savvy critics of the current rhythmic algorithms. Gibbon (2003) finds the mathematical premise of nPVIV to be problematic. The algorithm assumes a strictly binary interpretation of rhythm and does not have the ability to model unary, ternary dactylic, or anapæstic rhythms. This can therefore result in an averaging out of otherwise important differences. Wiget et al. (2010) caution against the over-reliance of any metric because metrics merely provide approximate indications of stress contrast that often are influenced by extraneous factors (p. 1567). Arvaniti (2009) has expressed similar concern and takes particular exception to studies that apply every algorithm to the data in blanket fashion and then focus on results that fit a priori assumptions. To illustrate her point, she applied such a blanket technique to English, German, Greek, Italian, Korean, and Spanish spoken by different speakers and in different speech styles. For most of the algorithms tested, interspeaker and intraspeaker variation was as high within each language as across. Arvaniti’s (2009) work serves as a reminder that algorithms should only be used if their internal mathematical properties closely align with the natural phenomenon one wishes to model. It also reminds us that measures need to be put in place to control for style and speaker effects, something the present study attempts to do.
4.1.3 Moving beyond the durational paradigm
Durational operationalizations of rhythm have dominated the literature for some time, but a growing body of literature has offered a more multidimensional perspective. Low (1998) incorporated integrals into her early calculations of nPVIV—that is, combined calculations of more than one vowel property. She calculated a duration-F0 integral (duration U+22C5 mean F0) and an amplitude integral (duration ⋅ U+22C5 mean intensity) in her analysis of Singapore English. Fuchs (2016) also incorporated measurements beyond duration in his calculations of nPVIV, including nPVIV of sonority, voicing, F0, the duration-F0 integral, and the duration-amplitude integral in his analysis of Educated Indian English.
Galves et al. (2002) proposed a novel way of investigating rhythm by calculating mean sonority and pairwise sonority contrasts for 25 millisecond-intervals of speech. The authors constructed a function that assigned values close to 1 for the most sonorous intervals and values close to 0 for the most obstruent (p. 324) and applied the function to read-aloud sentences of English, Polish, Dutch, Catalan, Spanish, Italian, French, and Japanese. Results produced measurements for each of the languages that closely reflected their intuitive placement in the “syllable-timed” and “stress-timed” continuum (p. 326).
In her 2010 dissertation, Cumming (2010) found that dynamic F0 contributes to perceiving non-speech sounds and isolated monosyllables as longer than those without a dynamic F0. However, she also found that listeners were more likely to assess stimuli as rhythmic when they had concordance between F0 excursion and duration (2010, p. 191).
Similarly, Fuchs (2014b) conducted a perception experiment whereby listeners assessed stimuli that consisted of a single syllable followed by a second syllable in which the duration of the vowel was varied in 18 steps from 40 to 300 milliseconds and F0 was varied in both syllables at 85, 115, and 145 Hz. Results showed that “for every 60 Hz in F0 difference, there is a 4% increase in perceived duration in the second syllable/vocalic interval.” (p. 1951). Fuchs (2014b) then applied this adjustment to the durational measurements taken from each vowel before calculating the nPVIV of duration in his comparative analysis of British and Indian English (p. 1951). He referred to this adjusted algorithm as “perceived variation” and found it to render a difference between the two varieties that was 6.6% higher than the difference that was measured from just duration.
As mentioned above, Low (1998) examined amplitude in her dissertation that prefaced Low et al.’s (2000) seminal study. She calculated the nPVI for root mean square (RMS) amplitude and found higher RMS amplitude nPVI values for British English than for Singapore English (1998, pp. 52–53). He (2012) similarly proposed using amplitude instead of duration for three established measurements—the standard deviation, the variance coefficient, and the nPVIV. This was motivated by a qualitative observation of intensity graphs for L1 English, L1 Mandarin, and L2 English spoken by an L1-Mandarin speaker. All three measurements aligned with the preconceived notion that English was higher-variation than Mandarin and that L2 English spoken by an L1-Mandarin speaker fell in between. Fuchs (2014a) proposed using both average amplitude and duration together in a single calculation by adding individual nPVIV(duration) values to individual nPVIV(amplitude) values and then taking the average for each speaker.
Crucially, the difference in simultaneous variability in duration and loudness between IndE and BrE was higher than either the difference in variability in loudness or duration. This result shows that a measure of simultaneous variability in duration and loudness, the nPVI-V(dur+avgLoud) suggested in this paper, captures an important aspect of variability in prominence between successive vocalic intervals. (Fuchs, 2014a, p. 292)
This is to say that the integral approach produced a result that was distinct from the separate results on duration and the separate results on amplitude.
The integrated measurements reviewed above have often offered a picture that is much more in alignment with perceptions than durational measurements. Not only do these studies imply that rhythmic contrast is multifaceted, they also show that individual segmental elements often work in synergy such that the presence of one can make listeners think they have heard the other.
4.1.4 Prominence in Swedish
If we return to Lerdahl and Jackendoff’s (1983) notion of “strong” and “weak,” it becomes necessary to discuss and define prominence. A separate body of literature circulates on prominence alone, unconcerned about questions on rhythm. There is general agreement in this literature that prominence consists of the following three cues: duration, F0, and intensity (Breen et al., 2010; Fry, 1955, 1958; Kochanski et al., 2005; Lieberman, 1960; Turk & Sawusch, 1996; Wagner & Watson, 2010). Researchers, however, disagree on the extent to which each cue contributes to the construction, and this is complicated by the fact that it may differ by language (Wagner & Watson, 2010, p. 925).
As it pertains to Swedish, Fant and Kruckenberg (1994) found duration to be the most important correlate and note that this is unsurprising since Swedish is a quantity language. However, they note also that F0 is of near-equal importance. Fant and Kruckenberg (1994) also observed that intensity typically correlates highly with F0 except at the very peak of an F0 swing, whereby there often appears to be an inverse relationship between energy and F0 (pp. 137–141). In later work, they expanded their study to perception and tested the correlation between assessments of prominence with duration, F0, energy, and subglottal pressure (Fant et al., 2000). While the authors found that all four parameters correlated with prominence in Swedish, they were unable to rank the contributing weight of each (p. 81).
Strangert and Heldner (1995) conducted a perception study that found that “the greater the F0-rise, the stronger the agreement on focus accent. That is, the size of the focus accent cues the degree of prominence” (p. 59). They note, however, that it only partially explains the variation among listener assessments. Heldner and Strangert (1997) later found that “F0-rise is neither necessary nor sufficient for the perception of focus” (p. 55). In subsequent work, Heldner (2003) found that overall intensity and spectral emphasis were strong correlates to focus accents in production.
The literature offers no clear path for prioritizing or assigning weights to duration, intensity, or F0, but what is clear is that they all likely play a key role in constructing prominence in Swedish. Therefore, all three are incorporated in the operationalization of nPVIV, which I detail in the next section.
4.2 Preparing and coding the data for analysis
This study departs from earlier approaches by treating each nPVIV measurement as its own separate observation. Therefore, the analysis will not rely on the correlation of social predictors to means or medians of nPVIV. Rather, a model will be built that takes into account internal influences on rhythmic contrast, which will facilitate a more credible investigation into whether and to what extent social factors can commandeer rhythm to do their work.
Rhythm is obviously influenced by multiple phonological features, so no examination of external social predictors can be conducted without first identifying language-internal constraints. According to Tagliamonte (2006), any investigation of sociolinguistic variation must first factor in internal predictors (pp. 104–205). For example, in Sharma and Sankaran’s (2011) analysis of (t), they included preceding and following segments and word class as predictors. These were modeled alongside gender, age, formality, and so on (p. 412).
To date, only one study has similarly modeled nPVIV at the observation level (Clopper & Smiljanic, 2015), and no study has attempted to account for any additional internal predictors beyond speech rate. The typical procedure is to model social predictors against the speaker’s mean or median nPVIV while including mean speech rate as the sole internal parameter (see, e.g., Torgersen & Szakay, 2012, p. 830). This is problematic since a single speaker’s nPVIV can move from, for example, 100 to 70 to 55 to 120 to 40 for just a single sequence of vowels. The pair-by-pair nPVI calculations are highly varied and governed by vigorous phonological factors that restrict to what extent a single linguistic community can purpose rhythm for social practice.
Tables 2 and 3 offer a visualization of how the present study differs in its analytic approach. Table 2 shows how my dataset would look in a traditional analysis modeled on means/medians. Table 3 shows my dataset in the new approach, the predictors of which will be discussed in the following subsections on variable coding.
Excerpt of the dataset that a
Excerpt of the dataset that
The response variable
Duration, mean F0, and mean intensity were extracted from every vowel. Where F0 could not be measured, the mean value for the speaker for that style was used. Mishra et al.’s (2012) energy-F0 integral (EFI) was then calculated followed by nPVIV for each vowel pair.
where
where
where
where
where
Internal predictors
Phonological quantity, pitch accent, or the lack of accent
Swedish is a pitch-accent language with two lexical accents: accent 1 and accent 2. It is also a quantity language with two phonological categories of vowels, long and short, that resemble the English or German lax/tense distinction (see Riad, 2014). These interact to form four possible combinations: long vowels with accent 1 (
Table 3 offers an example for how the coding is distributed among the four accent/quantity combinations. For example, if either vowel in the nPVIV pair is prominent, phonologically long, and part of an accent-1 word, then the predictor
A lack of prominence strips a word of its lexical pitch accent. For example, on line 6 of Table 3, the
Phrase-finality
According to Klatt (1976), English vowels lengthen within a pre-pausal foot. This is why Thomas and Carter (2006) and Sarmah et al. (2011) chose to omit phrase-final vowels from their calculation of nPVIV. On the other hand, Low et al. (2000), White and Mattys (2007), Torgersen and Szakay (2012), and Fuchs (2016) chose to include them. Fuchs (2016) examined the effects of phrase finality on nPVIV and found that its inclusion or exclusion contributed to negligible change for British English. However, he found that including phrase-final syllables in his analysis of Indian English resulted in a slight decrease in nPVIV for read-aloud speech and a slight increase in nPVIV for spontaneous speech (p. 103). White and Mattys (2007) make a strong case for inclusion: first, lengthening does not occur before all pauses, and second, if there is lengthening, it may contribute to the overall perception of rhythm (p. 507). Since preliminary work has identified phrase-final lengthening as a potential feature of Stockholm’s multiethnolect (Young, 2019, pp. 197–198), it is even more important to code for it (or remove it from the analysis). Therefore, every vowel that falls before a pause is coded
Syllable-final /r/
As discussed in Section 3.3, I take the approach of Thomas and Carter (2006) and combine syllable-final /r/ with the preceding vowel, treating them as a single segment. Since, however, this renders larger vocalic measurements, which, in turn, renders larger nPVIV measurements, they must be accounted for in the statistical model. These occurrences are coded
Speech rate
Speech rate has been shown to affect rhythm even in cases where rhythm is operationalized by means of a rate-normalizing algorithm. The literature is consistent in showing that higher rates of speech result in generally less durational contrast (Dellwo, 2010; Deterding, 2001; Torgersen & Szakay, 2012; although cf. Fuchs, 2016, pp. 150–151). In the case of London’s multiethnolect, Torgersen and Szakay (2012) have demonstrated a correlation between higher rates of speech and lower nPVIV (pp. 831–832). This is also the case with the present dataset. The relationship is nonlinear and strongest at higher speech rates, so the natural logarithm is used in the statistical model. Rate is measured as the mean duration of the two syllables that contain each vowel pair (Deterding, 2001; Thomas & Carter, 2006; though c.f. Torgersen & Szakay, 2012, p. 828, and Dellwo, 2010, p. 5). 9
Lexical frequency
Lexical frequency can affect vowel reduction (Pluymaekers et al., 2005) and fuller realizations of segments (Zhao & Jurafsky, 2009), both of which can affect intervocalic contrast. Naturally, the
External predictors
Contextual style
Each vowel pair is coded under the
Racialization
As reviewed in Section 2.1, speakers colloquially refer to themselves as svensk or invandrare. The two emic designations are coded as binary factors for the
Social class
Inspired by much of the early First Wave variationist work (Fontanella de Weinberg, 1974; Labov, 2001, 2006 [1966]; Trudgill, 1974; Wolfram, 1969), a 100-point index was devised for the
Age
The
Orthogonality of social predictors
Participants were sampled during the fieldwork such that the three social predictors maintained a correlation coefficient of 0.5 or less between one another. Table 4 provides a matrix showing the correlation coefficients between the three predictors. The predictors show a middle range of orthogonality, which implies that variance inflation factors should especially be heeded when the predictors are included in the same statistical models.
Correlation coefficients R between the three social predictors.
Random effects
Speaker
For this dataset, speaker is a particularly important random effect because the number of observations per speaker varies. The lowest observation count is
Vowel
Every observation-level nPVIV consists of a vowel pair. These pairs are not unique for every observation; rather, there are 3334 unique pairs within the 40,277 nPVIV calculations.
5 Analysis
Three multiple regression analyses were run. The first model includes just internal predictors, the second includes all external predictors, and the third includes an optimal internal/external model. Each model output is provided in Table 5. For the internal model, the following call was used in R:
Mixed-effects linear regression models with nPVIV as the response variable. Model 1 includes internal predictors only. Model 2 includes all internal and external predictors and all possible interactions. Model 3 is the optimal model for internal and external predictors and interactions. For categorical predictors, the reference category is in italics (e.g., yes). Coefficients are indicated in the center column, standard errors in the parentheses, and variance inflation factors (VIF) to the right.
p < 0.001, **p < 0.01, *p < 0.05, °p < 0.1.
Since the main research question concerns the speech of young racialized working-class speakers, these three combinations are possible when adding social factors to Model 1: 10
The general rule of thumb in a regression model is ten participants per parameter, and according to Heo and Leon (2010), a fourfold increase is needed in order to run three-way interactions (
The
Table 5 Model 2 contains the output, but the number of insignificant predictors—as well as the mildly higher VIF for
6 Results
6.1 Internal influences on rhythm
Model 1 in Table 5 predicts that all internal factors have a significant effect on nPVIV. Tables 6 and 7 offer a calculation of a case study of how each phonological type affects rhythm. Table 6 contains the calculation for a base constant for the rhythm of a vowel pair that has a word frequency of 7000 (a middle-range figure) and an average speech rate of 150 milliseconds (a middle-range rate). From this base constant, estimated nPVIV values are calculated for each of the main vowel combinations in Table 7.
Calculation of a base constant for a “typical” vowel pair, as predicted by Model 1 in Table 5. The assumption is that the word frequency is 7000, speech rate is 150 milliseconds, and that phrase-finality and an /r/ in coda position are not present. This base constant is added to the accent-length coefficient in Table 7 in order to demonstrate the internal effects that accent (or the lack thereof) has on nPVIV in the dataset.
Vowel-pair combinations in dataset rank-ordered by effect on nPVIV (highest to lowest), as predicted by Model 1 in Table 5.
Table 7 aims to show to what extent rhythm is internally-governed, absent of social effects. In other words, the internally-constrained envelope of variation (Labov, 1972) is substantial and lies between approximately 36.4 on the lower end and 71.9 on the upper end. Clear here is that rhythmic variation is dominated by the distribution of prominence and phonological quantity, two features that are highly language-specific. This explains why the internal coefficients dwarf the external coefficients in size and continue to retain both strength and significance in Models 2 and 3.
6.2 External influences on rhythm
6.2.1 Age
For all speech styles, Model 3 predicts that age will have a main effect such that for every year older a speaker is, his 100-point nPVIV will increase by
6.2.2 Racialization and social class
For casual speech, Model 3 predicts that invandrare speakers will have a lower nPVIV than svensk speakers. But this is complicated by social class. Racialization only has a minor effect on nPVIV when social class is high. When social class is low, it has a large effect. An elite invandrare man with a social class index of 100 is predicted to have an nPVIV that is
Important also is that elite speakers, regardless of racialization, are predicted to have an nPVIV between the two lower-class groups. A lower class svensk speaker is predicted to have an nPVIV that is
6.2.3 Contextual style
When comparing the main effects for
Both higher-class svensk and invandrare speakers, however, are predicted to style-shift in the same direction. Higher-class svensk men are predicted to increase their nPVIV by
6.2.4 A systematic overview of the results
While individual coefficients do offer insight, they are by necessity stripped of context and comparative utility. Figure 4 therefore contains a case study of the implications of Table 5, Model 3. Four speakers from opposing ends of the class and race spectrum are modeled: lower-class and higher-class svensk and lower-class and higher-class invandrare. A class index of 1 is used for lower class, and a class index of 100 is used for higher class. A fifth speaker, an invandrare within the lower-middle range of the class distribution, is also provided (class index

Rhythm, class, and race in apparent time for Stockholm, year 2017: A simulation built from the coefficients in Table 5, Model 3 for five hypothetical speakers in three contextual styles from three cohorts—born in 1977, 1987, 1997. Hypothetical speakers are working-class (WC) svensk, working-class (WC) invandrare, lower middle-class (LMC) invandrare, upper-middle class (UMC) svensk, upper-class (UMC) invandrare. Working class is modeled with a class index of 1; lower-middle class is modeled with 30; upper-middle class is modeled with 100. Contextual styles are informal, formal and very formal.
Overall, nPVIV decreases in apparent time, independent of other social factors. However, the interaction with race and class enriches this picture. Although the entire system appears to be moving towards low alternation, the social stratification between invandrare and svensk remains stratified relative to the speech community. Among lower-class invandrare men, staccato rhythm is a consistent variant that becomes more and more staccato from 1977 to 1997 while maintaining its relative difference from the overall system. For lower-class svensk men, the high-alternation variant also appears to becoming lower and lower, albeit in consistent relationship to the overall movement of the system.
The interaction with style adds another dimension that is of particular importance to the question of rhythm and social salience. As the system moves in apparent time toward an overall lower degree of alternation, speakers continue to move toward a shared rhythmic norm in more formal styles. The
7 Discussion
7.1 Race and class stratification
The findings indicate that rhythm is socially stratified such that it splits three ways in the vernacular of men: low “staccato” alternation within the racialized invandrare working class, high alternation within the svensk working class, and intermediate alternation within the middle classes/elites. The latter finding supports much of the earlier local work that has described the prosody of Stockholm’s multiethnolect as “staccato.” Beyond local relevance and of interest to sociolinguistic theory is the finding that the main stratification is horizontal rather than the top-to-bottom trend seen in typical class-based sociolectal variation:
This finding is somewhat surprising in light of some proposals that European multiethnolects have close ties to their cities’ respective traditional working-class varieties. For example, Rampton (2011) has found that speakers of London’s contemporary urban vernacular also use “a significant number of traditional London vernacular features” (p. 288) such as TH-dropping and fronting, H-dropping, alveolar ING, and centralized
In light of other work, these results are less surprising. As was discussed in Section 4.1, Torgersen and Szakay (2012) found that while young speakers of MLE had the lowest overall nPVIV, the highest nPVIV values were for the white working-class speakers from Hackney and Havering. Similarly, in an investigation of the
In Copenhagen, a city similar to Stockholm in terms of language and sociocultural context, one variable also seems to have dual working-class variants that straddle an “intermediate” middle-class feature. Strong t-affrication [ʦ] is a traditional Low Copenhagen variant for /t/ (Quist, 2010, p. 638) and is still a widely-known stereotype of the “Amager” working-class persona. On the other hand, mild affrication [ts] is the standard middle-class (and traditional High Copenhagen) variant. In recent years, however, palatalized [tj]—a variant produced at the other end of the articulatory range—has emerged as the multiethnolectal variant (Pharao & Maegaard, 2017) in some of the very neighborhoods where [ʦ] had dominated earlier.
In Stockholm, as is the case in Copenhagen, London, and Paris, the racialized working class lives in much closer proximity to the white working class than it does to the middle classes. Therefore, results like these implicate two possible catalysts—one being “strong mechanical” and the other being “strong identity” in nature. Within the “strong mechanical” hypothesis, the group second language-acquisition effect is so strong that its output remains impermeable to the locally-available variant. So if a substrate variant like low-alternation prosody dominates the feature pool in a multiethnic housing estate and overwhelms the inputs that subsequent generations in that housing estate are exposed to (by virtue of many heritage languages having such prosody), it will emerge despite the fact that high-alternation prosody is the dominant feature in the greater urban area outside of that housing estate. Within the “strong identity” hypothesis, children in the housing estate perceive from a very early age which social types produce which variants, and a norm is established during adolescence that involves an identity move toward one social type (e.g., working-class immigrant) and away from another (e.g., working-class Swede). So if a substrate variant like low-alternation prosody is competing against other variants within the feature pool, its chance for domination and selection into the new vernacular increases if its phonetic “opposite” is already in use by an opposing group.
The present study does not have the data to prefer one hypothesis over the other, but I believe that discussing these results, and similar results elsewhere, within this schematic can be beneficial—especially if we are to understand how class and racialization tug at the mechanics of language contact.
7.2 Rhythm in Södersnack
While the three-way split for rhythm is interesting from a socio-systematic perspective, it is also interesting in light of the paucity of studies on Södersnack. Section 2.2 offered a review of Stockholm’s industrial-era working-class variety Södersnack and proposed that it might have higher rhythmic alternation than the city’s other varieties due to its higher inventory of diphthongal long vowels (Kotsinas, 1994; Öqvist, 2010). A similar link has been proposed by Torgersen and Szakay (2012), albeit in the other direction. The typically diphthongal
Important here also is the result that Södersnack is considerably more rhythmically deviant from received Stockholmian than multiethnolect. Whereas numerous accounts of staccato rhythm circulate for Stockholm’s multiethnolect, I am unaware of any such metalinguistic account for the prosody of Södersnack. Furthermore, there has been an underlying assumption in any discourse on Swedish multiethnolect that the distance between its variants and standard variants would surely be larger than the distance between more “indigenous” varieties and the standard. The present findings disrupt this assumption and typologically place Stockholm’s multiethnolect closer to the received standard than Södersnack as far as rhythm is concerned.
The implication of the the strong tendency to style-shift among the svensk working class also implicates a degree of salience on high-alternation rhythm. But again, the lack of scholarly and popular commentary about its prosody frustrates any meaningful interpretation. So rather than rhythm being a sociolinguistic marker in the epiphenomenal sense, a more likely possibility is that it is a phonotactic result of Södersnack diphthongal segments being replaced with received-standard monophthongal counterparts in more formal speech. This link has neither been tested nor substantiated, but the proposal is not unreasonable given what is known about its vowels (Bergman, 1946; Kotsinas, 1994; Öqvist, 2010; Ståhle, 1975).
7.3 Social salience
Turning to the question of social salience, the respective rhythmic variants of both working-class groups appear to target the rhythmic pattern of higher-class casual speech. The focusing is less uniform in the unsolicited reading task (
As discussed above, the variant that belongs to the svensk working class appears to be much more socially salient than the variant that was of primary interest to this study, namely the staccato variant of the invandrare working class. If one accepts the indicator > marker > stereotype progression proposed by Labov (1972), this would imply that the former is a newer feature than the latter, something that bolsters the possibility that the former is a legacy feature of the city’s industrial working class.
As it concerns the social salience of staccato rhythm, the mild degree of style-shifting suggests that the feature still has mild social salience for its speakers. If the simulation in Figure 4 is re-examined, lower-middle class invandrare speakers are predicted to style-shift into the intermediate rhythmic pattern produced by higher social groups. The lowest invandrare class is predicted by the model to shift by the same amount, but their lower start point means that full normative rhythm is not achieved. If one considers these findings in light of those of Bijvoet and Fraurud (2011; reviewed in Section 2.5), one could imagine that the mild shift predicted for a lower working-class speaker would be sufficiently prestige-sounding for his peers while remaining deviant-sounding for higher-class speakers. So if one looks at such a shift from 46.8 to 48.8 for a working-class invandrare born in 1997 (Figure 4), this may register as “good Swedish” for a number of multiethnolectal speakers while still sounding like “Rinkeby Swedish” for speakers from other social groups.
Returning to the evolutionary progression proposed by Labov (1972), Table 5 Model 2 offers appealing results because the added interaction of style with age renders an increase in stylistic sensitivity as the age of invandrare speakers decreases. This implies a continued movement from indicator to marker. However, as discussed in Section 5, the age and style interaction had to ultimately be discarded because it weakened the model fit and was not significant. The direction of the coefficients, however, serves as a reminder to test stylistic sensitivity in apparent time in future investigations.
7.4 Rhythm in apparent time
When examining age as an apparent-time proxy for diachronic development (Labov, 2001), the statistical model reveals two important trends: (a) the staccato low-alternation feature of working-class invandrare men is becoming more staccato over time, and (b) the speech of all male groups is becoming more staccato over time while maintaining similar stratifications, illustrated most clearly in Figure 4. To rephrase the last point, the whole of Stockholm appears to be becoming staccato over time, led by the invandrare working-class, while the svensk working class appears to lag the furthest behind.
A higher-class speaker born in 1987 (53.8 to 56.7) is predicted to have speech rhythm that resembles a working-class and lower-middle-class invandrare born in 1977 (54.6 and 56.4, respectively). A higher-class speaker born in 1997 (49.9 to 52.8) is predicted to speak with a rhythm that is actually lower than that of a working-class invandrare speaker born in 1977 (54.6).
Two possible reasons come to mind for this trend. The first possibility is that the contact prosody of Stockholm’s lower classes is an active change from below that is incrementally diffusing into mainstream speech all while younger speakers “hypercorrect from below” (Labov, 1972, p. 178) by moving even further into staccato territory. A similar process was found by Trudgill (1988) for t-glottaling in Norwich. T-glottaling was an exogenous contact feature from the South that first entered Norwich through the working-class vernacular. With time, it climbed the class hierarchy and became more or less socially ubiquitous (p. 45).
A second possibility is that extensive language contact, beyond that of the multiethnic periphery, is rendering Swedish prosody less typologically marked. Sweden is characterized by its many global brands and its robust export economy, and Stockholm lies at the center of this activity. Further, the city has witnessed a recent finance and technology boom that has brought in expatriates from all corners of the globe. When comparing nPVIV of duration for Swedish (54.9, Young, 2018b, p. 50) against the other languages that have been studied (Fuchs, 2014a, p. 81), only British, New Zealand, and Thai English have higher nPVIV values. All other languages tested in the literature have lower nPVIV measurements. Since English and Swedish are the two dominant lingua francas in Stockholm, it is plausible that contact with L1-accented Swedish, L1-accented English, and American English may be driving this incremental downward shift in the rhythmic alternation of Stockholmian prosody.
7.5 Accounting for rhythm phonotactically
As it pertains to phonotactic accounts, these findings lead to interesting questions. Table 7 demonstrated that the internally-governed variation within a single language like Swedish is extremely high for nPVIV. When an accent-1 long vowel is adjacent to an unstressed short vowel, the nPVIV values are predicted to be very high; when an unstressed long vowel is adjacent to an unstressed long vowel, the nPVIV values are predicted to be very low. Although the sociolinguistic analysis has revealed socially-governed variation that is independent of internal factors, it has not investigated where in the phonology the change is occurring. As discussed in Section 2.4, Kotsinas speculated that Rinkeby Swedish might be reducing the difference between long and short syllables (Kotsinas, 1988a, p. 268; Kotsinas, 1990, p. 257) and that my preliminary investigation shows that this may be the case (Young, 2019, p. 213). Are working-class invandrare men reducing their accented vowels durationally? Or are they enacting a reduction in, say, intensity? Or is it rather the case that the staccato effect is due to the enlargement—be it in duration, F0, intensity, or all three combined—of unstressed vowels? Or is it the case that the rhythmic grid is epiphenomenal to its segmental components, enacting enlargements or reductions wherever necessary to coerce a specific pattern? These questions are currently under investigation, and the results are still tentative (Young, 2019, pp. 189–198).
As was reviewed in Section 2.4, a number of other options are possible such as monophthongal long vowels (Young, 2019, p. 209), unstressed vowels qualitatively further from schwa (Young, 2019, p. 212), phrase-final lengthening (Young, 2019, pp. 197–198), and elided or flapped coda rhotics (Young, 2018b, pp. 50–51) among the invandrare working class. None of these findings have gone through the peer-review process and are still preliminary in nature. Furthermore, researching and substantiating their presence would only be the first step. The second step would be to test their correlation with the current stratification of rhythmic contrast. The present study, however, has established rhythm as a stratified variable in Stockholm and can hopefully be a point of departure for future studies that might wish to explore related phonotactic phenomena.
Footnotes
Acknowledgements
I wish to thank Devyani Sharma, Erez Levon, Michael McGarrah, Paul Kerswill, Nicolai Pharao, Adam Chong, and Erik Thomas for the helpful and generous feedback that led to this final analysis. Thanks also to Keder Akman and Laura Yu for helping recruit participants and to the participants themselves for being so generous with their time. I am also grateful to the anonymous reviewers who dedicated so much time and effort to ensure that this work might meet the rigor of our discipline. I am of course responsible for all remaining shortcomings.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was funded by The British Economic and Social Research Council and the Sven och Dagmar Salén Foundation.
