Abstract
All languages use individually meaningless, contrastive categories in combination to create distinct words. Despite their central role in communication, these “phoneme” contrasts can be lost over the course of language change. The century-old functional load hypothesis proposes that loss of a phoneme contrast will be inhibited in relation to the work that it does in distinguishing words. In a previous work we showed for the first time that a simple measure of functional load does significantly predict patterns of contrast loss within a diverse set of languages: the more minimal word pairs that a phoneme contrast distinguishes, the less likely those phonemes are to have merged over the course of language change. Here, we examine several lexical properties that are predicted to influence the uncertainty between word pairs in usage. We present evidence that (a) the lemma rather than surface-form count of minimal pairs is more predictive of merger; (b) the count of minimal lemma pairs that share a syntactic category is a stronger predictor of merger than the count of those with divergent syntactic categories, and (c) that the count of minimal lemma pairs with members of similar frequency is a stronger predictor of merger than that of those with more divergent frequencies. These findings support the broad hypothesis that properties of individual utterances influence long-term language change, and are consistent with findings suggesting that phonetic cues are modulated in response to lexical uncertainty within utterances.
1 Introduction
Much theoretical work in linguistics over the last 50 years has held that the important factors accounting for sound patterns in language must be highly general and abstract (cf. the competence/performance distinction, Chomsky, 1965). Consistent with this hypothesis, many details of the pronunciation of a given phoneme token are in fact significantly predicted by its abstract phonological context, rather than by the specific word it occurs in. For example, for any word in American English, a /t/ is virtually certain to be pronounced as a flap if it occurs between two vowels, where the second vowel is unstressed. Standard phonological theories (e.g., Chomsky & Halle, 1968; Kenstowicz, 1994; Prince & Smolensky, 2004 [1993]) assert that all phonological information must be encoded at the most abstract level of representation and that there cannot be any informational redundancy across representational levels: otherwise, individual words would be expected to be able to develop their own idiosyncratic phonetics (reviewed in Wedel, 2007).
This approach has been very fruitful in describing categorical, synchronic phonological patterns within languages, but has been arguably less so in accounting for how those patterns arise and change over time (e.g., Blevins, 2004). Further, a great deal of recent work has established that much of the variation that does exist in the pronunciation of a given phoneme is in fact predicted by a wide variety of more local, less abstract factors such as word identity (Pierrehumbert, 2002), near lexical neighbors (Baese & Goldrick, 2009; Munson, 2007; Munson & Solomon, 2004; Peramunage, Blumstein, Myers, Goldrick, & Baese-Berk, 2011; Scarborough, 2010; Wright, 2004), and most relevantly for this work, the information content of a phoneme (or other sublexical element) relative to context (Aylett & Turk, 2004; Cohen Priva, 2012; Jurafsky, Bell, Gregory, & Raymond, 2001; Kaiser, Li, & Holsinger, 2011; Raymond, Dautricourt, & Hume, 2006; Son & Pols, 2003, among many others). As a result, a number of newer models have been developed over the last two decades that integrate sources of variation at multiple levels into accounts of the development of phonological patterns over time. These models, which we will refer to with the umbrella term “variationist/usage-based/evolutionary,” or VUE, differ from earlier theoretical approaches in their explicit integration of an arbitrarily large set of possible influences on pattern formation in language, including possibly “universal” linguistic biases, domain-general cognitive biases as well as more idiosyncratic influences that may derive from particular structures in a language, or the culture and history of a speech community. Beyond their ability to account for a wider range of synchronic behavioral data, these models draw on a wide range of findings in fields such as evolutionary theory, cognitive science, historical linguistics and phonetics to provide new hypotheses about how variation at multiple levels and time-scales can drive the development, propagation and consolidation of phonological patterns over time. 1
In this paper, we present new evidence consistent with the hypothesis that sound change is biased toward selective maintenance of those phonemes that contribute more to distinguishing existing lexical items in usage (Blevins & Wedel, 2009; Wedel, 2012; Wedel, Kaplan, & Jackson, 2013). This finding represents a challenge to classic models of phonology, which cannot easily link sound change to contrast among actual lexical items; indeed, generative phonological formalisms are explicitly designed to exclude actual words from consideration (see section 4 below). In contrast, VUE models contribute testable hypotheses about the relationship between existing lexical forms, their usage, and the development of phonological patterns (e.g., Hay & Maclagan, 2012; Hume et al., 2013; Maclagan & Hay, 2007; Wedel, 2007).
1.1 The functional load hypothesis
Virtually all human languages exhibit what has been termed “duality of patterning” in which the smallest meaningful elements of language, often termed morphemes, are in turn composed from a limited set of contrastive but largely meaningless elements, often termed phonemes (reviewed in Ladd, 2012; see also Sandler, Aronoff, Meir, & Padden, 2011). The phoneme systems of languages are constantly in flux, and over the course of time a phoneme contrast can be lost from a language, either through phoneme merger, or when one member of the contrast erodes away in some context. As an example of phoneme merger, the historically contrastive vowels in the words caught and cot have merged for most English speakers in Canada and in the western part of the United States, such that the phonetic distinction between these two words has been lost. As a result of this homophony, caught and cot can only be distinguished through other sources of information, such as local morphosyntactic and discourse context. Phonemes can also be lost in specific phonological contexts, rather than lost from the entire language. An example of a context-specific phoneme loss is the case of [h] deletion before the high front glide [j] in some east coast dialects of American English, such that the word human is pronounced [jumən] rather than [hjumən] (Labov, Ash, & Boberg, 2006).
For nearly a century, language researchers have explored the intuitively attractive idea that a phoneme contrast might be less likely to be lost from a language if it does more “work” in information transmission in communication (Blevins & Wedel, 2009; Gilliéron, 1918; Hockett, 1967; Kaplan, 2011; Martinet, 1952; Mathesius, 1931; Silverman, 2010; Surendran & Niyogi, 2006; Trubetzkoy, 1939; Wedel et al., 2013). The notion of “work” in this sense has been operationalized and investigated in a number of ways (reviewed in Surendran & Niyogi, 2003), but until recently, clear evidence in favor of this functional load hypothesis has been elusive. In Wedel et al. (2013), we provided the first statistical evidence in support of the functional load hypothesis by showing that within a database drawn from eight languages, phoneme contrasts that distinguish more minimal pairs 2 were significantly less likely to have merged over the course of time. Furthermore, for those phoneme contrasts that distinguish few or no minimal pairs, we found that phoneme frequency was positively correlated with merger probability. As described below, these two findings are consistent with the functional load hypothesis within VUE models of language change. Here, we provide additional evidence consistent with the notion that the functional load effect is based in uncertainty between possible lexical meanings in usage. First, we show that the inverse correlation between minimal pair count and phoneme merger probability is strongest when pairs are counted at the level of the lemma, rather than at the level of surface word form (which often contains additional affixal material). Second, we show that the functional load effect is significantly stronger for lemma pairs of the same lexical category, which are less likely to be strongly distinguished by morpho-syntactic context. Finally, we report some evidence for relative lexical token frequency effects, as predicted on information theoretic grounds (Shannon, 1948).
1.2 Background
The VUE family of models has a long history reaching back to the 19th-century work of Baudoin de Courtenay (1895). Modern VUE models draw on findings showing that mental categories maintain some record of experienced variation, rather than consisting solely of abstract generalizations (reviewed in the context of language in Baayen, 2007; Bybee, 2002; Ernestus, 2011; Johnson, 1997; Pierrehumbert, 2002; Pisoni & Levi, 2007). Further, experiencing category token variants has been shown to influence subsequent production and perception of tokens of the same and similar categories (reviewed in Pierrehumbert, 2002, 2003; Wedel, 2012; Goldinger, 2000; see also Nielsen, 2007; Pardo, 2006; Eisner & McQueen, 2005; Kraljic & Samuel, 2005a, 2005b; Norris, McQueen, & Cutler, 2003). This body of findings suggests that a speaker’s lexical category system can be understood as a steadily-updating multi-dimensional network, in which experienced phonetic detail can be represented redundantly at multiple levels of analysis and where generalizations can emerge from and coexist with that detail (see e.g., Beckner et al., 2009; Bybee & McClelland, 2005; Elman, 1995; Hay & Drager, 2007; Walsh et al., 2010; Wedel, 2012). This conception of the lexicon as a densely interconnected network of representations of different granularities provides a pathway by which variants can spread from word to word (e.g., Bybee, 2002; Phillips, 2006; Wang, 1969) and from sound to sound (e.g., Kraljic & Samuel, 2005a; Mielke, 2008) over lifetimes and generations within a speech community. For this work here, the relevant prediction of the general VUE model is that if speakers pronounce words more clearly when they are less likely to be correctly recognized by a listener, both the lexical and phonological systems of the language should evolve to reflect this bias (Piantadosi et al., 2011; Wedel, 2006, 2012).
The problem of linguistic category recognition can be thought of as that of determining the most likely mapping from a noisy speech signal to the particular category intended by the speaker (where “linguistic category” here refers to any category that may be relevant for language processing, e.g., phonemes, lemmas or surface word-forms; Flemming, 2010; Jaeger, n.d. ; Jaeger & Ferreira, 2013). Listeners have two kinds of information that interact in this process: bottom-up information from the phonetic signal itself, and top-down expectations derived from experience with the language system: what range of category labels are available at a given level of analysis, how likely they are in the current context, and how likely they are to be realized by the current signal. Bayes’ Rule provides a way to combine the two types of information from the signal and from the system to assess the probability that the speech signal S should be mapped to the category C (equation 1). The term p(C |S, ctxt) is the posterior probability of the category C given the signal S in the current context, p(C, ctxt) is the top-down prior expectation for C given the context, and p(S|C ) is the probability of S given the category C . The summation in the denominator gives the probability of that signal realizing some category given the context.
Noise in the channel – that is, in the environment and in the production and perception processes themselves – always introduces uncertainty in the mapping from the signal to a category by the listener. This model predicts that all else being equal, the listener will be more likely to correctly infer the correct mapping from signal to category when the probability of the perceived signal given the intended category, p(S|C ), is higher. The probability that the particular signal S maps to C depends, at least in part, on the number of categories that compete in context, and how closely their signal distributions overlap with that of the intended category. (We will refer to competitor categories that are both relatively probable in context and phonetically similar as near competitors.) As a consequence, to achieve the same effectiveness of communication in a particular context, a category with a near competitor will require a signal with enhanced cues relative to one with no near competitors. Conversely, a category with no near competitors can be produced with a reduced signal and yet maintain a given effectiveness of communication.
The Smooth Signal Redundancy (SSR) (Aylett & Turk, 2004) and Uniform Information Density (UID) hypotheses (Jaeger, 2006, 2010; Levy & Jaeger, 2007; see also Son & Pols, 2003) propose that channel use by speakers in fact tends to become optimized in this way through smoothing out peaks and valleys in the uncertainty of the mapping between the speech signal and the intended message. These proposals are based in part on a wide range of experimental work showing that phonetic cues to category identity tend to be enhanced/reduced when the category is less/more predictable given the local discourse or sentential context (e.g., Aylett & Turk, 2004; Cohen Priva, 2012; Jurafsky et al., 2001; Kaiser et al., 2011; Raymond et al., 2006; Son & Pols, 2003). If cognitive systems underlying human language are more generally organized to facilitate effective information transfer in this way, then phonetic cues to a category with a near competitor should be relatively enhanced because a near lexical competitor creates additional uncertainty ( Hume et al., 2013; Jaeger, n.d.; Wedel, 2012).
VUE models of sound change, in turn, predict that frequent enhancement of a phonetic cue to a lexical category will become reflected in its long-term phonetic representation. In support of this prediction, Baese and Goldrick (2009) and Peramunage et al. (2011) report that a phoneme contrast distinguishing a minimal pair is relatively hyperarticulated in productions of one member of the pair even when the other is not present in the context (see also Cohen Priva, 2012). Finally, the set of observations that phonetic detail spreads from word to word within a community over time (Bybee, 2002; Phillips, 2006; Wang, 1969) provides a mechanism within the general VUE model for phoneme-contrast enhancement in some words, some of the time, to influence the probability of phoneme contrast merger across the entire lexicon (Wedel, 2012).
1.3 Factors tested in the model
Here, we extend the work reported in Wedel et al. (2013) to further investigate predictions of the functional load hypothesis regarding the relationship between phoneme-contrast merger probability and uncertainty between lexical categories.
1.3.1 Lemma versus surface word-form
Previous investigations of the general functional load hypothesis have focused on contrast measured at different structural levels of analysis within a language system, such as the phoneme, the syllable, or the word (reviewed in Surendran & Niyogi, 2003) and their potential relationship to phoneme-contrast loss. The relationships among these structural levels are non-trivial; for example, some phoneme contrasts are distributed across the lexicon of English such that they distinguish many pairs of words (e.g., /ɪ ~ ε/ in minimal pairs like bit ~ bet, pit ~ pet, bitter ~ better, etc.). Other potential contrasts may distinguish no minimal pairs at all, such as /h ~ ŋ/. Thus, an /ɪ ~ ε/ merger in English would result in many new homophones, while an /h ~ ŋ/ merger would not. As a consequence of this partial disassociation between the effects of contrast loss between levels, we can ask at what level of analysis (e.g., phoneme, morpheme, surface-form) a given measure of contrast loss is more closely correlated with phoneme contrast merger.
The functional load hypothesis as originally conceived (e.g., Martinet, 1952; Trubetzkoy, 1939) predicts that phoneme contrasts are maintained in relationship to their support of contrast between meaningful elements (e.g., morphemes). Further, the specific resistance to contrast loss that is often observed within morphological paradigms (Blevins & Wedel, 2009, and references therein) strongly suggests that inhibition of phoneme contrast loss can be driven by the avoidance of phonological contrast loss between semantically similar categories. Finally, if information density is modulated in speech to support transmission of semantic meaning (as opposed to, for example, solely formal meaning such as phoneme category identity), phoneme contrast merger may be best predicted by some measure of functional load based in meaningful lexical elements (Jaeger, 2006, 2010). In support of this notion, in Wedel et al. (2013), we showed that within a dataset of content words from eight languages, phoneme merger is better predicted by a measure of contrast at the surface-form word level (the number of minimal pairs distinguished by the phoneme contrast) than a range of measures of contrast at non-meaningful levels of analysis (including phoneme and biphone type and token frequencies, as well as phoneme and biphone entropy measured over the source corpus).
However, to our knowledge there is no clear prediction concerning precisely what level of “meaningful lexical element” should be most relevant to a functional load effect. On the one hand, surface-forms (e.g., a lemma plus any phonologically associated material such as clitics and affixes) may better represent the basic level that is processed in speech, and thus could better reflect the potential ambiguities that speakers and listeners may encounter. Silverman (2010) argues on this basis that surface-forms, and not lemmas, should be used in evaluating the functional load hypothesis. On the other hand, a growing body of research suggests that lemmas may be more predictive than surface-forms both of fine details of speech (Berlove, Caselli, & Cohen-Goldberg, 2012) and of broader phonological patterns (Albright, 2009, p. 37, and references therein). Here, we use an expanded dataset to further distinguish between contrast measured at two distinct levels of analysis: the lemma (i.e., root) versus the surface word-form, which includes any affixes that are present.
Most of the languages represented in our dataset exhibit nominal, adjectival and/or verbal morphological paradigms, with the result that a given lemma often appears in multiple surface forms that differ only in their affixes. As a result, a phoneme contrast loss that renders two lemmas homophonous will produce more or fewer surface-form homophones depending on size of the paradigm shared by those lemmas. For example, when counting minimal pairs distinguished by the phonemes /ɪ ~ ε/ in English, the verb lemmas pit and pet contribute a single minimal pair. At the surface-form level, however, the pit ~ pet lemma pair contributes four minimal surface-form pairs: pit ~ pet, pits ~ pets, pitting ~ petting, pitted ~ petted. (Entropy change measures are similarly influenced by the choice to represent the lexicon in terms of lemmas or surface forms.) Because the number and type of morphological paradigms that different lemmas participate in can vary widely, the contribution of a phoneme contrast to measures of contrast at the lemma level is correlated with, but distinct from, those measured at the surface-form level. We report below evidence that minimal pair counts at the lemma level are significantly more predictive of phoneme merger probability than counts at the surface-form level.
1.3.2 Syntactic category
Local morphosyntactic context often contributes strong cues to a word’s syntactic category, e.g. whether it is a noun, verb, adjective, etc. As an example, although the words caught and cot are homophonous in some varieties of English, because one is a verb and the other a noun, they appear in very distinct morphosyntactic contexts. Given rich evidence that context contributes to disambiguation of a wide range of linguistic category types (e.g., Levy & Jaeger, 2007; Norris et al., 2003), we expect that members of a minimal pair that are of different syntactic categories should compete less strongly on average than members of minimal pairs that share a syntactic category. Below, we report that restricting the count of minimal lemma pairs to those that share a syntactic category provides a significantly improved correlation with phoneme merger within the dataset.
1.3.3 Relative lexical token frequency
All else being equal, uncertainty in usage is expected to be greater between competitors whose frequencies are similar (Shannon, 1948). As a consequence, minimal pairs whose frequencies are similar should show a greater influence on phoneme merger probability than those whose frequencies are highly divergent. Here, we test a number of operationalizations of this prediction and report weak evidence in favor of it. In the following sections, we describe the construction of the datasets in more detail, followed by analysis and discussion.
2 Corpus study
2.1 Database
The rate of phoneme merger over the course of language change tends to be low, with the result that often only a small number of historically recent phoneme mergers are attested in related variants of any given language. Consequently, in order to obtain enough data for statistical analysis, we pooled data from multiple languages. The languages represented in the dataset are English (Received Pronunciation and Standard American), German, Dutch, French, Spanish, Turkish, Korean, and Hong Kong Cantonese. Data sources for these languages are summarized in Table 1.
Corpora used to construct the database.
RP: Received Pronunciation.
The dataset consists of 19 systems of phoneme contrasts from these nine different languages; these groups are summarized in Table 2. Each system of phoneme contrasts consists of phonemes within a single structural class such as “vowels” or “consonants,” and the set of phoneme contrasts within each group is limited to those differing in only one phonological feature such as voice or place of articulation. Each system contains at least one phoneme contrast that has merged in some dialect of the language, as well as all other phonologically similar phoneme contrasts in that structural class. Each system, then, can be considered as a comparison of a set of phoneme contrasts that have merged and a set of structurally similar phoneme contrasts that have not. In total, the dataset contains 56 phoneme contrasts that have merged, and 524 that have not. Within mergers, 35 are consonant–consonant mergers as opposed to vowel–vowel mergers, and 30 are conditioned by phonological context, as opposed to context-free. A context-free merger eliminates a phoneme contrast from a language altogether, while a context-sensitive one eliminates a contrast only in certain phonological environments. The North American English cot ~ caught merger is an example of a context-free merger, while the pin ~ pen merger, which occurs in southern dialects of American English, merges /ɪ ~ e/ only before nasals. As a consequence, the words pin and pen are homophonous in these dialects, while pit and pet are not. In addition to phoneme- phoneme category mergers, our dataset contains three phoneme deletions which we model as merger with zero. 3
Phoneme pairs and known actual mergers included in the database.
Word and frequency information for each language was obtained from a phonemically-transcribed corpus. 4 No grammatical or function words were included. These corpora are different on a number of dimensions (e.g., size and source genre), as are the languages they represent (e.g., complexity of phoneme inventory, syllable structure, complexity of morphology). As described below, we used hierarchical modeling to model the differences between languages as random effects. The fact that these models represent “partial pooling” of results across the languages makes our results more generalizable than simple logistic regression (Gelman & Hill, 2007) and therefore should be fairly robust against the heterogeneous nature of the languages and corpora.
For all the languages in this study except for Korean and Hong Kong Cantonese, corresponding surface-form and lemmatized versions of each individual corpus were available. For the analysis comparing the role of word-category in the minimal pair count factor (described below), we used a dataset which excluded Korean and Hong Kong Cantonese so that every phoneme contrast in the dataset could be associated with both lemma- and surface form-based minimal pair counts. The Korean corpus was available in a lemmatized form, and although the Hong Kong Cantonese corpus is technically a surface-form corpus, we treated it as a lemmatized corpus because of the isolating morphology of the language. For the subsequent analysis investigating the role of minimal pair member frequency relationships, we used only lemma-based minimal pair counts allowing us to include the data from Korean and Hong Kong Cantonese. 5
2.2 Analysis
In the following analyses, we fit variations of the basic model presented in Wedel et al. (2013), which is represented schematically in equation 2, where FL represents functional load, p(Ph) is the frequency of the phonemes in a potential merger, and FL’ represents a different functional form of FL, which interacts with the effect of phoneme frequency. 6 We take this general model as a given, and explore the extent to which we can refine the functional load variable beyond a simple count of minimal surface-form pairs.
Our primary analytic tool is fitting logistic regression models to predict cases of phoneme merger. More specifically, because the data are structured hierarchically, we will use hierarchical generalized linear models (also known as logistic mixed-effects models), fitted using the lme4 package for the R statistical software (Bates, Maechler, & Bolker, 2011; Gelman & Hill, 2007; Jaeger, 2008; R Development Core Team, 2011). This approach should provide more robust generalization across languages, avoiding both the overfitting that would result from regular logistic regression and the loss of power if models were fit separately for each language in our database. In the models that follow, phoneme system is used as the grouping variable (or random effects variable). Each system (e.g., American English consonant–consonant mergers, American English vowel–vowel mergers before nasals, tone mergers in Hong Kong Cantonese) is modeled with a random intercept, allowing the models to account for the fact that the base rate of merger differs across systems. Random slope models – which additionally allow the effects of our variables of interest to also vary randomly by system – were also fit, but these models were not better-fitting than the simpler models, and did not provide any different conclusions. We therefore only report the simpler random-intercept models.
Due to their underlying similarity, different formulations of functional load are typically highly correlated, which poses a challenge for any analysis which attempts to evaluate which is a better predictor of merger. We pursue two types of analysis, and corroborating evidence is taken as initial support for choosing one predictor over another, even if the predictors are highly correlated. The two analytic techniques are (1) model comparison using model fit statistics and likelihood ratio tests of nested models, and (2) comparisons of residualized versions of the predictors.
In the first step, models with competing sets of predictors are fit, and model fit statistics (AIC and BIC) are compared. Then the competing models are compared with a “superset” model that contains both competing models, using a likelihood ratio test. If these tests show that the superset model is significantly better than the model with the worse predictor, but not significantly better than the model with the better predictor, this, along with the model fits statistics, is taken as evidence that the “better” predictor is significantly better.
To complement this analysis, we use a residualization technique to assess the comparative effectiveness of competing predictors. For example, if comparing predictor A with predictor B, we create residualized versions by residualizing A on B and B on A, respectively. Then we fit two models, one with the observed predictor A and the residualized predictor B, and one with the observed predictor B and the residualized predictor A. The first model allows us to see whether the variance in B not predicted by A adds any significant prediction above and beyond A, and the second model allows us to investigate the converse. If the model comparison steps above selected predictor A, and we found evidence that the residualized predictor A added prediction above predictor B, but the residualized B did not add above predictor A, then we would have further evidence that predictor A is superior to predictor B.
3 Results
3.1 Role of lemmas and their syntactic category in merger probability
Wedel et al. (2013) report that the more minimal surface-form pairs distinguished by a phoneme contrast, the less likely that contrast is to merge. Here we compare minimal pair counts 7 based on surface forms (which may include affixal material beyond the root morpheme), to those based on lemmas (i.e., the root morpheme alone). In addition, we ask whether shared syntactic category membership influences the effect of a minimal pair. If lexical uncertainty in usage context plays a causal role in preservation of phoneme contrasts, minimal pairs of the same syntactic category (e.g., noun–noun pairs) should contribute more to the effect because morphosyntactic context contributes less toward their disambiguation. Our results show that compared to the alternatives, the count of within-category, lemma-based minimal pairs is the most effective as a predictor of merger.
We started by taking the subset of data for which we have both lemma-based and surface-form counts of minimal pairs (all languages but Korean and Hong Kong Cantonese; see section 2.1, above). This comprised a total of 35 mergers across 482 phoneme contrasts in seven corpora. We then calculated within- and between-category minimal pair counts, for both lemmas and surface-forms. Models fit to this data revealed an interesting cross-over pattern. Using lemma-based counts, within-category minimal pairs is a significant predictor, β = −3.38, z = −2.4, p = 0.016, while between-category minimal pairs is not, β = 0.92, z = 0.88, p = 0.38. The converse is true when using surface-based counts, namely that within-category pairs is not significant as a predictor, β = 0.29, z = 0.6, p = 0.551, but between-category pairs is, β = −2.67, z = −3.06, p = 0.002.
However, fit statistics favor the model using within-category lemma-based counts as the better measure of functional load (AIC = 235.22 vs. 239.4, BIC = 260.29 vs. 264.47). Model comparison tests demonstrate that the superset model with both factors is not significantly better than the model employing within-category lemma-based counts, χ2(1) = 0.15, p = 0.699, but it is significantly better than the model using between-category surface-form counts, χ2(1) = 4.33, p = 0.037. Further, residualization of the two measures shows that residualized within-category lemma-based counts still predict merger significantly above and beyond the information provided by between-category surface-form counts, β = −1.08, z = −2.03, p = 0.043, but that residualized between-category surface-form counts do not add prediction above the information provided by within-category lemma-based counts, β = −0.16, z = −0.33, p = 0.74. In summary, both the model comparison and residualization techniques indicate that within-category lemma-based minimal pair counts are a more effective predictor of phoneme merger than between-category surface-form counts. In the remaining analyses we concentrate only on lemma-based counts of within-category minimal pairs.
3.2 Role of frequency ratio
A further prediction of a VUE model of sound change is that the relative frequencies of the members of a minimal pair should influence the degree of uncertainty between them, and thereby the functional load. 8 Specifically, all else being equal, uncertainty is expected to be greater between minimal pairs members whose frequencies are similar (Shannon, 1948). We can conceptualize this influence of relative frequencies as a weight assigned to each minimal pair corresponding to how much functional load is induced by that pair. While there are many ways to operationalize this prediction, one straightforward way is to use the ratio of frequencies as a weighting factor in the minimal pair count. To do this, we calculated a simple ratio of the frequency of the lower-frequency minimal pair member divided by the frequency of the higher-frequency member, and summed these. 9 However, in our data, we observed an extremely high correlation of r = 0.985 (N = 580) between this weighted measure and the simple count of within-category, lemma-based minimal pairs. This high correlation is perhaps unsurprising if the distribution of minimal lemma pairs results from processes that mimic two random draws from the set of lemmas with their associated frequencies. This means that while it is relatively easy to conceptualize that these weighted and unweighted minimal pair counts might be independent, in the reality of natural lexicons, they are very tightly coupled, and thus difficult to distinguish.
Despite this very high correlation, we investigated whether model comparison could tease apart the predictive power of the ratio-weighted minimal pair counts described above from simple minimal pair counts. The model-fit statistics were extremely close (AIC = 318.85 vs. 316.9, BIC = 344.97 vs. 343.01), and neither model with each measure as the lone functional load variable was significantly different from the superset model including both (simple minimal pair count: χ2(1) = 1.98, p = 0.16; frequency ratio: χ2(1) = 0.02, p = 0.892). As a consequence, this formulation of a predicted frequencies ratio effect does not provide evidence either way.
However, it is also plausible that the relationship between frequency ratio and functional load of a minimal pair is non-linear. Therefore, we investigated a different formulation, capturing the notion that minimal pairs with relatively similar frequencies should contribute more to functional load than those with dissimilar frequencies. For this measure, we calculated a median minimal-pair frequency ratio independently for each of the 19 phoneme systems in the dataset and then counted the number of minimal pairs for each phoneme pair whose ratio was above or below the corresponding median ratio. This provides a simple way to split the minimal pair count into two bins, one containing minimal pairs with more balanced frequencies (i.e., having a ratio closer to 1) and the other containing minimal pairs with more divergent frequencies (i.e., having a ratio closer to zero). As expected, these two measures, which we term balanced and unbalanced minimal pair counts, are also highly correlated. Nevertheless, our results provide some evidence that the balanced minimal pair count is a significantly better predictor of the probability of merger within our dataset.
As in our analyses so far, the first step is model comparison. The model with the balanced minimal pair variable has somewhat better fit statistics than the model with the unbalanced minimal pair variable (AIC = 315.39 vs. 323.39, BIC = 341.5 vs. 349.51). This difference is confirmed by likelihood ratio tests, which show that the superset model with both predictors significantly outperforms the model with the unbalanced minimal pair variable, χ2(1) = 9.66, p = 0.002, but does not significantly outperform the model with the balanced minimal pair variable, χ2(1) = 1.66, p = 0.198. Finally, the comparison of residualized variables leads to the same conclusion. When the balanced minimal pair variable is residualized on the unbalanced minimal pair variable, it still acts as a significant predictor alongside the unbalanced minimal pair variable, β = −0.97, z = −2.83, p = 0.005, but the same is not true of the unbalanced minimal pair variable; when residualized on the balanced minimal pair variable, it no longer adds prediction above the balanced minimal pair variable, β = 0.37, z = 1.3, p = 0.193. In summary, both the model-comparison and residualization techniques indicate that there is enough distinct information in these highly correlated variables to suggest that the count of balanced minimal pairs (i.e., the minimal pairs where the probabilities of the members are relatively close) matters more to the prediction of merger than the unbalanced minimal pairs (i.e., minimal pairs where the members have very different probabilities).
While these techniques converge on the same conclusion, we should be cautious for two reasons. First, because the two measures are so highly correlated, their significant difference in predictive power could be due to a small number of data points. Second, this effect is only obtained when we make a fairly arbitrary decision in how to formulate the measure. This suggests that if the observed effect is not an artifact of some aspect of our data, there is some kind of non-linearity in the relationship between minimal pair frequency ratio and functional load. The nature of this possible non-linearity remains to be explored, and raises additional questions which may not be able to be adequately addressed with the data presented here.
In order to partially address the first concern, we can display the relationship between the fitted values of the two competing models, to get a sense of how or why the balanced minimal pair variable is outperforming the unbalanced minimal pair variable. The fitted values of each model are the estimated probability of merger, as predicted by that model. Figure 1 displays the differences between the fitted values on the y-axis, plotted as a function of the values predicted by the balanced minimal pair model. Points farther to the right indicate phoneme contrasts predicted to have high merger probability by the balanced model. Greater vertical distance above the line indicating y = 0 indicates that the balanced model predicts greater probability of merger than the unbalanced model, and conversely, greater distance below the line indicates that the balanced model predicts lower probability of merger than the unbalanced model.

Differences of fitted values between two models.
To the extent that the superiority of the balanced model is broadly based in the data, we expect a consistent divergence between the two models particularly where the balanced model makes its most extreme predictions. This appears to be the case, illustrated especially by the highlighted areas in Figure 1. Where the balanced model more strongly predicts non-merger (corresponding to low values along the x-axis), most points represent non-merged phoneme contrasts, and their position below the dotted line indicates that the balanced model predicts non-merger more strongly. Similarly, where the balanced model more strongly predicts merger, most points do represent mergers and most tend to be above the line, indicating that for these phoneme contrasts the balanced model predicts merger more strongly than the unbalanced model. In summary, although the differences in predictive power are subtle, the model that puts special emphasis on “balanced” minimal pairs – those minimal pairs whose words are close in their lexical frequencies – produces consistently better predictions.
We can use these fitted values to find example cases for which the balanced minimal pair model is doing a better job. The shaded areas in Figure 1 show quadrants where the balanced minimal pair model is clearly outperforming the alternative, as described above. These regions cover six cases of merger and eight cases of non-merger, which are given as examples in Table 3. Perhaps the most important thing to note from these examples is that the model is making these stronger predictions for a range of languages and phoneme classes in the dataset, suggesting that model performance in this dataset is not based on an artifact of inclusion of a particular language.
Selected examples predicted especially well by the balanced minimal pair model.
Am.: Standard American; RP: Received Pronunciation.
3.3 Summary of final model
We have presented arguments for a more specific formulation of the functional load variable, namely a lemma-based count of minimal pairs, where the minimal pair members are of the same syntactic category, and have unigram probabilities that are above the median for the phoneme system. Model comparison and residualization techniques provided statistical evidence that this formulation is a more effective predictor of merger than other reasonable alternatives that we tested. 10 In this section, we provide a statistical and graphical summary of the full model. Following the initial results described by Wedel et al. (2013), the model consists of a variable representing functional load, and a variable representing segmental (token) frequency which interacts with a dichotomized functional load variable. 11 In Table 4 we give the coefficients for the parameters in the final model. 12 The pattern of effects follows the same pattern as that found in Wedel et al. (2013). First, where there are any (within-category) minimal pairs, there is a significant effect of functional load, here operationalized as the number of within-category minimal pairs that have relatively balanced probability ratios (i.e., relatively close to 1), compared to the median frequency ratio value for the phoneme system. This effect is in the expected direction, such that the greater the functional load, the less likely merger is to occur. Second, for phoneme contrasts with minimal pairs, the effect of segment frequency is non-significant. Third, for phoneme contrasts with no minimal pairs, the effect is significantly more positive, such that greater segment frequency leads to greater probability of merger. 13
Fixed-effect parameters for final model.
While this model provides significant prediction, a great many other factors contribute to actual patterns of sound change (see e.g., Blevins, 2004; Hay & Drager, 2007; Labov, 1994, 2001). Nonetheless, inspection of the distribution of predicted values (i.e., fitted values) of merger probability separately for actual mergers and non-mergers shows that despite the extreme simplicity of this model, separation is fairly good (Figure 2). Predicted probabilities of merger somewhere above 0.30 appear to characterize mergers almost exclusively, and the majority of non-mergers are assigned merger probabilities of 0.10 or lower.

Distribution of predicted probabilities for final model, with the position of several American English vowel contrasts marked.
4 Discussion
4.1 Functional load and traditional generative phonology
Within a dataset of attested phoneme contrast mergers and similar, non-merged phoneme pairs, we have shown that the likelihood of merger is significantly, inversely correlated with the number of minimal lemma pairs defined by that contrast. Furthermore, the evidence indicates that this effect cannot be explained solely as a phoneme-level phenomenon (driven, for example, only by phoneme frequency).
This result stands in striking contrast to the approach of traditional generative phonology, which makes a sharp distinction between possible words (knowledge of which is part of a speaker’s phonological grammar) and actual words (which are an accident of the lexicon). Theories that explicitly aim to handle contrast, such as Dispersion Theory (Campos-Astorkiza, 2007; Flemming, 2002; Ní Chiosáin & Padgett, 2010; Padgett, 2009) and various implementations of underspecification (e.g., Hall, 2011; Mester & Ito, 1989), deal with the level of the segment or, at most, idealized possible words.
This focus on possible as opposed to actual words derives from the fact that given the structure of traditional models, giving the phonological grammar access to the lexicon predicts unattested patterns (Padgett, 2003, pp. 78–79). However, our results indicate that the functional load effect depends on contrast between actual lexical items, which is precisely what traditional generative phonology rejects. We emphasize that it would not be possible to capture this effect by making a simple modification to the machinery of a typical generative theory – for example, by “decorating” a language’s phoneme inventory with information about the frequency of each segment.
4.2 Functional load and variationist/usage-based/evolutionary models
In contrast to traditional models, a central feature of VUE models is the assumption of a causal chain linking properties of individual usage events to long-term change in the abstract, linguistic category system of a speech community (Beckner et al., 2009; Blevins, 2004; Blevins & Wedel, 2009; Bybee, 2001; Christiansen & Chater, 2008; Hay & Maclagan, 2012; Kirby, 1999; Labov, 1994; Ohala, 1989; Pierrehumbert, 2001, 2003; Wedel, 2007). This approach predicts that any consistent bias in utterances that can be perceived and reproduced by language users can in principle influence the trajectory of language change. The functional load hypothesis represents a potentially fruitful place to look for evidence for or against such a long-range connection, because it explicitly relates change in a phonological system to actual language usage events. Starting from the finding of Wedel et al. (2013) that a phoneme contrast is more likely to merge if it distinguishes fewer surface-form minimal pairs, we have tested three finer-grained hypotheses here.
4.2.1 Lemma versus surface-form levels
Both VUE models and the UID hypothesis (Jaeger, 2006, 2010; see section 1) are consistent with the idea that a phoneme contrast maintenance effect may show a particularly strong association with the number of lexical (i.e., meaning-associated) categories distinguished by that contrast in usage. To refine our understanding of what level of lexical category may be most important, we compared surface-form minimal pair counts, which are often inflated by association with additional affixal material, and lemmas, which are a more direct measure of the contrast between root morphemes. We found that lemma-based measures were significantly more predictive, consistent with recent findings that fine pronunciation details of roots in English are better predicted by aggregate properties of all surface-forms a root appears in, rather than the properties of individual surface forms alone (Berlove et al., 2012).
4.2.2 Effect of syntactic category
We found that merger is less likely when the phonemes in question are distinguished by minimal lemma pairs of the same syntactic category, while minimal pairs of different syntactic categories are less predictive. As an approximation of one kind of contextual disambiguation, this effect of syntactic category is consistent with the predictions of VUE models: minimal pairs matter to the extent that they are likely to be confused in actual speech events. Because morphosyntactic context contributes disambiguating information to words of different categories, they are inherently less confusable than words of the same syntactic category.
4.2.3 Effect of frequency ratio
The relative frequency of within-category, lemma-based minimal pair members was also found to be a significant (albeit weak) predictor of merger, consistent with prediction. To the extent that predictability of words in context is related to their unigram frequency of occurrence, the uncertainty between minimal pair members should be on average greater when their unigram frequencies are similar. If, as predicted by VUE models and the UID hypothesis, cue-enhancement is related to the degree of uncertainty (Shannon, 1948), minimal pairs for which the members are more similar in frequency should contribute more to the inhibition of merger. Several exemplar-based models of phonetic category merger also show more probable and more rapid merger when phonetically adjacent categories are produced with divergent frequencies (Pierrehumbert, 2001; Wedel, 2012).
We note that we did not find any measure based on absolute word frequency that was additionally predictive of merger probability within this dataset. There are a number of plausible, non-exclusive explanations for this. First, the dataset is based on modern corpora which will fail in any number of ways to accurately reflect the frequency relationships within the language during the inception and progress of the merger. We expect that the frequency relationships within the corpus are likely to be less reflective of the relevant state of the language than the more categorical measures of word-existence and word-category that are reflected in minimal pair counts. Second, unigram frequency itself may simply be a relatively poor predictor of uncertainty relevant to phonetic cue enhancement processes (see e.g., Piantadosi et al., 2011).
5 Conclusions and new directions
In addition to providing statistical evidence for a link between the probability of phoneme contrast loss and the role of phoneme contrast in information transmission in language use, this study provides some guidance for future work on the topic. Where previous literature has operationalized the notion of functional load in a wide variety of ways, our data suggests that some measures capture the relevant phenomena better than others, in particular that lemma-based counts are more predictive of merger than surface-form-based counts which include affixal material, and that local measures of such minimal pair counts and their relative frequencies are more predictive than more global measures such as system entropy. These findings suggest three clear avenues for further research: (1) enlarging the database to improve our ability to separate highly correlated variables, in particular by including more languages with different properties and from different language families; (2) exploring more direct methods to measure the uncertainty between minimal pair members, for example by comparing the n-gram conditional probabilities of minimal pair members in speech corpora; and (3) expanding the database to include phoneme pairs that show signs of active resistance to merger, through participation in chain-shifts or phoneme splits.
However, despite the simplicity of this model, it may already contribute to our understanding of certain phonological patterns. As a matter of expositional convenience, up to this point we have primarily framed the model in terms of predicting the probability of phoneme contrast loss, but it can just as well be thought of as making predictions about the probability of phoneme contrast preservation. As an example, consider the unusually crowded high/mid/front vowel space of American English, containing the contrastive vowels /i ~ ɪ ~ ε~ e/. Within a genetically- and areally-balanced sample of 628 languages (Mielke, 2008), only 6% of the languages’ vowel inventories include this set of four vowels. As a comparison, 69% of the languages’ vowel inventories include the more dispersed high/mid set /i ~ e ~ u ~ o/. Given the relative typological rarity of /i ~ ɪ ~ ε ~ e/ as a set, and the observation that they are more confusable with each other than with other vowels in the American English system (Hillenbrand, Getty, Clark, & Wheeler, 1995), we might expect that some of these contrasts would fully merge in some natively English speaking speech communities. Instead, these vowel contrasts often chain-shift with respect to each other in ways that preserve their distinctiveness (see Labov, 1994; Maclagan & Hay, 2007 for examples). In Figure 2 we compare model predictions for mergers and non-mergers, indicating the model predictions for the context-free mergers of /i ~ ɪ /, /ɪ ~ ε/ and /ε~ e/ vowel contrasts, with the actually merged /ɑ ~ ɔ/ contrast included for comparison. We see that the model makes its strongest possible predictions that the phonetically similar high-mid-front vowel pairs in American English will not merge across the board, while being essentially agnostic about the /ɑ ~ ɔ/ contrast. This supports the hypothesis that the failure to find context-free mergers of these high/mid/front vowel contrasts in dialects of English is related to their high functional load (e.g., Labov, 1994; Maclagan & Hay, 2007). This is consistent with the model described in Wedel (2012), which proposes a mechanism for a high functional load phoneme contrast to promote chain-shifting or phoneme-splitting over merger. Finally, in section 1 we noted that in addition to biases that are likely to be common among all humans, VUE models predict that idiosyncratic structural properties of languages should also play a significant role in shaping pathways of language change. The evidence we report here is consistent with this prediction, suggesting that the particular distribution of phonemes across the actual lexicon of a language must be taken into account in understanding the evolution of its phoneme inventory.
Footnotes
Acknowledgements
The authors would like to thank T. Florian Jaeger and Jared Linck for useful discussion. All remaining errors remain the sole responsibility of the authors.
Conflict of interest
The author declares that there is no conflict of interest.
Funding
This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.
