Abstract
Starting from the assumption that grammaticalization is rooted in situated language use, the present study tests the connection between functional reanalysis and formal reduction with a synchronic approach. It investigates a case of potential (but not actuated) grammaticalization in Present-Day English, the use of epistemic phrases of the type it could/might be (that), which can serve an adverbial function and undergo formal reduction in analogy to maybe. These phrases are analyzed in British English (spoken data and informal writing) for their syntactic complementation and for omission of the expletive subject it. The results show that omission rates are overall higher in “critical contexts,” that is, where the item is structurally ambiguous between a clause and an adverbial, though other usage types, such as idioms, may promote it-omission too. The findings suggest that formal reduction (it-omission) is connected to incipient/potential grammaticalization (critical contexts) even in the absence of a diachronic grammaticalization process. Thus, they provide evidence that the oft-observed correspondence between functional and formal changes emerges immediately in synchronic language use. A possible interpretation is that certain linguistic elements have a base potential for being put to more grammatical uses; while these uses need not initiate change, speakers tend to adapt the form to its function.
Keywords
1. Introduction
This study investigates usage patterns of the epistemic phrases (it) could be and (it) might be within the frame of grammaticalization. It explores the idea that grammaticalization processes emerge from cognitive-communicative preferences, which lead to the connection of semantic, functional, and formal changes. As these preferences are assumed to be of a general nature, they should be observable synchronically, even when no diachronic progress of grammaticalization is attested. Testing this hypothesis, the main research question is whether contexts that promote the reanalysis of an epistemic phrase as a sentence adverb also favor formal reduction as in the omission of the expletive pronoun it (as in 1).
(1) Could be Frankie and Johnny were lovers. (Spoken BNC2014, ST8H 1468)
Grammaticalization has typically been observed in historical retrospect, in tracing back sources and developments of grammatical markers such as determiners, auxiliaries, and inflectional affixes. The outcome is known, and its causes are investigated. This perspective presents grammaticalization as a process of systemic language change, often unfolding over centuries and affecting the grammatical configuration of a language, for example, molding analytic structures into synthetic forms (e.g., Givón 1979; DeLancey 2011; Lehmann [1982]2015:11-12). On the other hand, such changes can only emerge and proceed through individual usage events, which involve individual speakers at a given place and time—speakers who use their language to communicate in the here-and-now, with no consideration for its broader diachronic trends (cf. Fischer 2010a:182). We have accounts, for example, of how ambiguous contexts can invite new interpretations and give rise to semantic or structural reanalysis (e.g., Traugott 1989; Danchev & Kytö 1994; Fischer 1994; Diewald 2002; Breban & Gentens 2016). Thus, Hopper and Traugott (2003:2) state that grammaticalization emerges from “fluid patterns of language use.” Lehmann (2017:34) points out that grammaticalization is not “purely diachronic” but a “process operative in linguistic activity.” These statements align with current usage-based theories of language that see grammar as based on general cognitive principles and constantly shaped by usage (e.g., Diessel 2019), so that “[l]anguage change on the historical scale is the emergent outcome of speaker behavior in the here-and-now” (Hilpert 2019:108).
A “synchronic grammaticalization” approach can compare the status of linguistic items on a grammaticalization scale (Brems 2003; Correia Saavedra 2021), and it can describe how the same item can occur in different contexts with varying degrees of grammaticalization (Brems 2003; Robert 2005; Van Bogaert 2011). It is in the second sense that the present study is concerned with synchronic grammaticalization. More specifically, it pursues the idea of “potential grammaticalization.” If long-term developments are propelled through fluid patterns of language use, then likewise these patterns of language use bear the seed of change. Thus, the potential for change in linguistic conventions is connected to cognitive mechanisms at work in the individual speaker: “As soon as two words can be put together and used in context the potential exists for conventionalization of word order and the automatization, habituation, and categorization that go into creating grammatical morphemes and constructions” (Bybee 2010:119). This potential can be explored in synchronic usage data.
When grammaticalization (as a long-term change) is observed, it follows recurrent pathways and patterns, and appears to be directional (i.e., Bybee, Perkins & Pagliuca 1994; Heine & Kuteva 2002; Haspelmath 2004; Kuteva et al. 2019). Moreover, grammaticalization often combines developments on the semantic, structural, and phonological levels (Lehmann 2004; Wiemer 2014). 1 If these patterns of change are emergent from individual usage events, and if the sum of such events produces the directional pathways observed diachronically, then it should be possible to detect preferences in synchronic language use that are in line with grammaticalization. This leads to the general hypothesis that, where there is potential for grammaticalization, spontaneous innovations that constitute a step in a grammaticalization process are relatively more likely than those that do not. Whether these innovations are further propagated and proceed to become actual cases of language change is outside the scope of the hypothesis.
The present paper explores this idea with the epistemic expressions (it) could be (that) and (it) might be (that). Syntactically, these constitute a matrix clause with an expletive subject it and a that-clause complement. Functionally, they fit into a class of constructions that has been labeled “epistemic phrases” (Thompson & Mulac 1991) and which also includes, for example, I think, I believe, I’m sure, and it seems. These phrases take scope over a proposition which is expressed in the complement clause. On the other hand, they can also take a less compositional role as a syntactically independent “parenthetical clause” (López-Couso & Méndez-Naya 2014a:293-295). Importantly, they can be used “like disjunct adverbials conveying secondary information” (Kaltenböck 2013:287), thus serving the function of epistemic adverbials. Thompson and Mulac (1991) suggest that they undergo a development of increasing syntactic autonomy, from taking a that-clause complement to taking a bare complement clause, and on to filling other positions in the sentence (illustrated in 2-4, from Thompson & Mulac 1991:313).
(2) I think that we’re definitely moving towards being more technological.
(3) I think ∅ exercise is really beneficial, to anybody.
(4) It’s just your point of view you know what you like to do in your spare time I think.
Similar developments have been shown for I’m sure (Kaatari 2018), it seems (López-Couso & Méndez-Naya 2014a), it looks like (López-Couso & Méndez-Naya 2014b), (the) chances/odds are (López-Couso & Méndez-Naya 2021), and to varying extents for “complement taking mental predicates” (I think, I suppose, I believe, etc.; Van Bogaert 2011).
These items undergo steps of grammaticalization when they take on an adverbial function and lose their status as a matrix clause. 2 This structural reanalysis is context-dependent as it requires the item to modify a main clause and take semantic scope over a proposition. Scope expansion opens up more discourse-oriented interpretations, according to Narrog and Heine (2021:102), “as it signifies a move from propositional content, concerned with the description of the event and its participants towards categories which operate on the propositional content, and are deictic of the speaker and the speech situation.” Similarly, Brinton and Traugott (2005:136) classify these items as grammaticalized “phrasal discourse markers” exhibiting decategorialization “from a full complement-taking verb construction to an invariable particle-like form (a shift from major (open) > minor (closed) class)” (Brinton & Traugott 2005:138). Sentence adverbials generally have a modalizing function by referring to the status of a proposition (e.g., as being possible, necessary, evident, etc.; cf. Palmer 2001:1; Depraetere & Reed 2006:269; Hasselgård 2010:50-51). In this sense they clearly have a more grammatical status than, say, a subject-verb clause I think.
In what follows, the phrases (it) could be and (it) might be are analyzed in view of their potential for adverbialization, which is regarded as a subtype of grammaticalization. Like the other items above, they are sometimes used in the way of sentence adverbials (see example 1 above), as will be detailed in the following sections. These cases are described as uses in grammaticalizing contexts, but they do not imply that the forms are grammaticalizing in the sense of undergoing change. Rather, they show the forms’ potential for grammaticalization, in the variation of usage patterns that make them appear as more grammatical or less so.
The following sections present two corpus studies of these items. Section 2 explains the research aim in light of the development of the adverb maybe and the relation to other epistemic phrases. Sections 3 and 4 present a case study on transcripts from spoken British English, including an overview of pronoun omission (section 4.1), a scheme for the classification of tokens (section 4.2), and the results (section 4.3). Examining spoken data is important because spontaneous speech is most open to innovative linguistic forms. In section 5, data from an informal written register (blogs) are drawn on for comparison with a larger database. Being informal and unedited, these texts are relatively close to spoken registers, but reflect the more rigid syntactic structuring of written language. Section 6 provides a broader, quantitative summary of both studies. Section 7 presents a discussion of the findings, and section 8 wraps up the main points.
2. Background
Investigating the usage contexts of could be and might be, this study tests the connection of semantic, functional, and formal aspects. In grammaticalization processes, a (context-dependent) change towards a more procedural meaning tends to evoke a change in syntactic status, coalescence, and erosion of form. Can such a tendency also be detected in synchronic usage, when the changes are incipient at best?
The most striking precedent case for potential changes in could be and might be is that of maybe. Maybe is an “epistemic stance adverbial” expressing a degree of doubt or certainty toward a proposition (Biber, Johansson, Leech, Conrad & Finegan 1999:845, 868). Historically, it has developed out of the epistemic phrase it may be (that). Its process of adverbialization falls within the scope of grammaticalization because it involves the expression of a grammatical category (epistemic modality), structural reanalysis, and formal coalescence. López-Couso and Méndez-Naya (2016) have traced this adverbialization from Middle English to the eighteenth century by observing changes in complementation patterns. Their account runs as follows. In Early Modern English, a matrix clause it may be with a complement clause marked by that or zero, and in which the subject it is expletive, is reanalyzed as a syntactically autonomous “bare parenthetical,” which can occur in non-initial positions. With this change, the sequence it may be becomes fixed and set off from semantically similar alternatives such as it might be or the partially adverbialized, now obsolete may-fall, may chance; viz. the grammaticalization principle of “specialization” (Hopper 1991:25-26). With the expletive subject omitted, may be becomes a sentence adverbial and finally (in the eighteenth century) fuses into the monomorphemic adverb maybe (viz., the principle of “decategorialization”; Hopper 1991:30-31). A possible supporting construction on this last step is “it may be + phrasal constituent” (e.g., it may be two days), which occurs frequently in seventeenth century data.
Taking the development of maybe as a template, we would expect the following adverbialization cline for could be (and likewise for might be):
It could be that this is correct. > It could be this is correct. > This is correct, it could be. > Could be this is correct / This is correct, could be.
This possible cline makes it seem as though omission of the pronoun it is what characterizes adverbial status. However, in spoken registers, subject pronoun omission is possible at every stage, even when there is a referential subject (see also section 4.1). Therefore, pronoun omission is to be treated as a separate variable. In a grammaticalization account, pronoun omission is “morphological erosion” (Heine & Kuteva 2007:43), producing a reduced and more opaque form. We would therefore predict that it correlates with more adverbial-like complementation patterns. This question is tackled empirically in the next sections.
Another question arises from the analogy between the constructions. As could, might, and may all function as epistemic modals, the epistemic phrases it could be, it might be, and it may be are clearly related to one another, perhaps as members of a superordinate constructional schema in a Construction Grammar frame. This would mean that any concrete instance also activates the schematic representation it
3. Material and Methodology
This corpus study comprises transcript data of spoken British English from the spoken component of the British National Corpus (Spoken BNC1994, BNC Consortium 2007) and the Spoken BNC2014 (Love, Dembry, Hardie, Brezina & McEnery 2017). The Spoken BNC1994 is divided up into the “context-governed” and the “conversation” sections, while the Spoken BNC2014 is sampled from conversations only. With data from the early 1990s and 2012-2016, respectively, the two corpora lend themselves to comparing data from these two points in time, and this will be done here as well. However, the main goal of the study is not to trace or detect a diachronic development but to explore the (synchronic) correlation of syntactic embedding and form.
Initially, all tokens of the strings could be and might be were extracted from the corpora. It was necessary to disregard the surrounding elements in the first, wider search, as especially omitted subjects might otherwise go undetected. In (5), for example, the subject is not six hundred quid but expletive and omitted. Similarly, not all instances of the surface string it could/might be have the pronoun it as subject, as in (6), where the subject is the noun phrase the rest of it.
(5) [. . .] so he reckons about six hundred quid could be worse but like also sucks just to have to give it to her (Spoken BNC2014 SXCB 119)
(6) actually I think the rest of it might be mine if that’s okay (Spoken BNC2014 SM6B 1364)
The resulting sets were then narrowed down to items that match the structure of an epistemic phrase, that is, those in which the subject is either it or omitted (see examples 7-10 below). Table 1 presents an overview of the data. 3 All data annotation was done manually; the quantitative analyses were carried out in R (R Core Team 2019).
Token Numbers of Could/Might Be in the Spoken BNC Corpora (Raw Figures)
4. Analysis of Data From Spoken Corpora
Focusing on the strings it/∅ could/might be, I treat the presence or omission of the subject pronoun it as a dependent variable whose relation to other factors is to be investigated. These are the referentiality of the subject (section 4.1) and, for expletive reference, the complement of the phrase (sections 4.2 and 4.3).
4.1. Subject Omission
As the subject pronoun (it) in an epistemic phrase it could/might be is an expletive, that is, semantically “empty,” the reference of the subject (whether overtly present or omitted) was annotated as
(7) But you’ve got to find the place there it could be at the other end a mile away couldn’t it? (BNC1994 S_conv KP1)
(8) well Norwich should go up really but I got an idea it could be a Norwich Ipswich fi- play off final (Spoken BNC2014 S6YA 1430)
Subject omission occurs both with referential subjects and with expletives (as in 9 and 10), but is more likely in the latter. This holds more clearly for (it) could be than for (it) might be across all three subcorpora, as shown in Figures 1 and 2 (compare the χ2 effect sizes V, which are larger for could be throughout). While there is no observable diachronic change, the conversation data (from both 1994 and 2014) show a higher prevalence of omission than the context-governed set but no clearly bigger difference between referential and expletive subjects (as read from the χ2 heterogeneity tests in Figures 1 and 2). Overall, could be has higher omission rates than might be, but the distributional differences between referential and expletive subjects are similar.
(9) he’s got this mouse somewhere (.) could be under one of the floorboards (Spoken BNC2014 SXDQ 279)
(10) could be that he’s not stupid it’s just it’s just writing quickly (Spoken BNC2014 SEGU 1561)

(It) Could Be by Subject Reference in the SpokenBNC Corpora

(It) Might Be by Subject Reference in the SpokenBNC Corpora
This overview already shows that formal reduction in the form of pronoun omission correlates with the more procedural meaning of non-referential epistemic phrases. However, grammaticalization need not be the causal factor. As omitting a semantically empty “dummy subject” comes with no loss of information, it is arguably more likely to occur than the omission of a referential subject, with or without grammaticalization. The correlation of more grammaticalized uses and reduction can be tested further within the set of items with expletive subject reference. In the next step, this is done by analyzing the complementation patterns of these items.
4.2. Complementation Patterns
Grammaticalization is context-dependent—it proceeds through contexts that invite a new interpretation of a construction’s meaning or structural properties. At the early stage, these are “bridging contexts” (Heine 2002) or “critical contexts” (Diewald 2002, 2006) that create an ambiguity between the “old” lexical/propositional status and a grammaticalized/procedural reading. In the case of adverbialization, a precondition for this is an epistemic meaning with a non-referential subject (which as such may qualify as an “untypical context” in Diewald’s [2002] terms). Critical contexts then allow for or promote the reading of, for example, (it) could be as a sentence adverbial rather than a compositional phrase. In the attested adverbialization of maybe, López-Couso and Méndez-Naya (2016) identify a number of complementation patterns through which the development proceeds in historical written data. Based on their analysis, but geared toward present-day spoken English, tokens of epistemic (it) could/might be are categorized into the following patterns. They are combined into groups from the least to the most grammaticalized usage.
A. Extraposition: (it) could/might be + X + complement clause, where X can be an adjective, noun, or other element complemented by an extraposed clause (Biber, Johansson, Leech, Conrad & Finegan 1999:155) These patterns are illustrated in (11)-(14). 4 While there is a proposition in the complement clause, the structure is equivocal in several ways. The pronoun it can be analyzed as purely expletive or as making cataphoric reference to the complement clause; be has immediate semantic scope only over X (such as possible in 11), and the epistemic meaning does not clearly extend to the proposition of the complement clause.
(11) it could be possible that we’re just a creation of something else that has been created by something or whatever (Spoken BNC2014 S8Q3 149)
(12) (.) could be pretty er (.) interesting to have [S0337:] two dogs [S0339:] and all the family (Spoken BNC2014 S67P 320)
(13) I wonder if it might be better to come out for a walk like at five in the morning or something (Spoken BNC2014 SYU7 580)
(14) I mean er I can see there was so might be useful to know there’s a new member of staff (BNC1994 S_meeting JA9)
A special case within this pattern is the structure (it) could/might be Vpassive that, as in (15) and (16). Here, be is an auxiliary, and the epistemic meaning is compromised by a possible generic deontic reading (e.g., “one has the possibility of saying X” in 15) and may have a hedging function (cf. Larsson 2017:61). These cases are included in pattern (A) based on surface structure, but a subdivision is made where necessary into (A1) for the type in (11)-(14) and (A2) for passive complements, as in (15) and (16).
(15) Now it could be said of course that we don't offer the same kind of very intense opportunities that are on offer to undergraduates, but [. . .] (BNC1994 S_brdcast_discussn KRH)
(16) And you’re quite right, I mean it might be argued that there are big issues in local government still (BNC1994 S_unclassified F7T)
B. Idiomatic: (it) could/might be worse/better, (it) could/might be me. These are idiomatic, fixed phrases with specific (though transparent) meanings, as in (17)-(19). The complement (worse, better, me) is part of the idiom and the epistemic phrase does not take semantic scope over a proposition.
(17) yeah it could be worse at least you don’t work in a toilet (Spoken BNC2014 SUVQ 7283)
(18) could be worse you could have sand in your vagina (Spoken BNC2014 S8K6 561)
(19) And, that again, might be me but many of you when you’ve heard me say it in a service I’ve ended up with my asking at the end of a sermon asking the congregation to smile (BNC 1994 S_conv KB0)
Patterns (A) and (B) have expletive subjects but their scope and embedding show no signs of a shift towards adverbial status. They are therefore classified as non-grammaticalizing contexts.
C. that-clause complement: This pattern shows scope expansion, taking scope over a proposition (as an adverbial would), but syntactically (it) could/might be is a matrix clause and the that-clause is a complement to the existential verb be. This pattern is illustrated in (20)-(23).
(20) you do know quite a lot so it could be that you might come across as slightly challenging (Spoken BNC2014 S8CW 457)
(21) could be that he’s not stupid it’s just it’s just writing quickly (Spoken BNC2014 SEGU 1561)
(22) Er and it might be that er there’s a space there where a caravan looks as though it used to stand. (BNC1994 S_meeting GY4)
(23) not a question really, but might be that some of these other ones are pretty nice as well (BNC1994 S_unclassified S8H)
D. (it) could/might be + phrasal constituent: (it) could/might be precedes a phrase denoting time, distance or quantity, such that it is expletive and an adverb-like reading seems possible; however, the epistemic phrase has narrow scope (over a phrase rather than a proposition) and the phrasal reading is foregrounded, as shown in (24)-(27). López-Couso and Méndez-Naya (2016:172-173) see this type as a “supporting construction” that contributes to the adverbialization of maybe, while the type itself does not instantiate adverbial use.
(24) body can’t take it (.) three litres of vodka a day [. . .] or it could it could be twenty-four hours but still (Spoken BNC2014 SJG5 664)
(25) I mean divers the deep sea divers they’re down there for (.) you know could be for a long time so that’s (.) yeah (Spoken BNC2014 ST57 954)
(26) but I got it cos it was one pound and thought [S0324:] that’s good one pound [S0325:] yeah I think it might be one pound thirty (Spoken BNC2014 SCKW 354)
(27) I’ve got about enough for about two days, might be three (BNC1994 S_conv KBB)
With these characteristics, patterns (C) and (D) mark no clear move toward an adverb reading and are seen as a “pre-stage” of grammaticalization. They do not constitute grammaticalizing contexts but may open the gate toward more adverbial-like uses.
E. Bare clause complement: (it) could/might be takes semantic scope over the proposition of a bare clause, as shown in (28)-(31). This pattern is structurally ambiguous between a matrix clause with the complementizer that omitted and a sentence-initial adverbial. It has been suggested as the major catalyst of adverbialization in the cases of maybe (López-Couso & Méndez-Naya 2016), I think (Thompson & Mulac 1991), and I’m sure (Kaatari 2018).
(28) and it could be somebody is interested in how the heck did you become a u- a lecturer and what’s your background? (Spoken BNC2014 SGSY 248)
(29) could be all photos of her are deleted [S0587:] I think they still remain on Facebook (Spoken BNC2014 SZAK 540)
(30) Or it might be your neighbours have become victims of crime and you suddenly see it could happen to you. (BNC1994 S_unclassified KNF)
(31) should I do it for you? might be there’s a knack to this (Spoken BNC2014 SUAB 95)
F. Non-initial parenthetical: (it) could/might be occurs as a bare phrase in clause-medial or clause-final position. This pattern is illustrated in (32)-(35). Studies of other epistemic phrases (I think, I’m sure) have taken this pattern to be derived from the clause-initial bare phrase, i.e., pattern (E) (Thompson & Mulac 1991; Kaatari & Larsson 2019). Its syntactic mobility then speaks to its adverbial status. On the other hand, structures like this can also emerge by instantaneous “co-optation,” by which the phrase would be interpolated as a disjunct element (Heine, Kaltenböck, Kuteva & Long 2017). Co-optation can apply to any piece of structure and as such is not tied to grammaticalization. Nonetheless, the repeated occurrence of a phrase in a non-canonical position will strengthen its desemanticization and internal decategorialization (Heine, Kaltenböck, Kuteva & Long 2021:36). Thus, this pattern is not unambiguous, but it favors a reading of (it) could/might be as an epistemic adverbial.
(32) did you ever do at school where you had to take a boiled egg in and like decorate it? You know like as something like it could be like Santa Claus or like [. . .] the Easter Bunny or like a person (Spoken BNC2014 SCYY 386)
(33) oh unless it’s on the—ANONplace it’s the other side of the cars could be (Spoken BNC2014 S73U 77)
(34) [. . .] and the server comes in and puts a plate of it might be venison and a plate of (unclear) which is a kind of erm corn and er (unclear) in front of you. (BNC1994 S unclassified JTE)
(35) oh I didn’t realise how hungry I am might be (Spoken BNC2014 S38V 2356)
G. Isolated phrase: (it) could/might be occurs without an overt complement, typically in response to a question or in response to a prior proposition, as in (36)-(39). An adverbial reading is possible in analogy to other epistemic adverbs that can occur in isolation, such as maybe, perhaps, and probably.
(36) And er, the pubs are open. [PS2MU:] Yeah, well it could be. (BNC1994 S_speech_unscripted HM2)
(37) [. . .] they don’t know when even when Israel was established and why you know and it’s like they have this kind of nice European left wing kind of approach to things [S0519:] but [S0521:] oh could be yeah (Spoken BNC2014 SZP6 552)
(38) cos I’m out of the probationary period [S0592:] they can still fire you [S0597:] I think it might be (Spoken BNC2014 S6Q6 2393)
(39) probably the same guys who erm (.) did the er what’s it called? the one the climbing [S0687:] Go Ape? [S0688:] Go Ape yeah [S0690:] dunno might be (Spoken BNC2014 SLJW 743)
Patterns (E)-(G) produce ambiguity between an adverbial and a phrasal reading. In the respective examples, (it) could/might be is clearly interchangeable with an adverb (e.g., maybe, probably, perhaps) in (E)-(G) but not, or less clearly, in the other patterns. Thus, they are critical contexts for adverbialization, and they represent the most grammaticalized uses of (it) could/might be. This is irrespective of whether adverbialization is actually in progress. The structural ambiguity of these contexts can initiate change, but it may as well constitute stable variation (see Vartiainen [2016] for a discussion of ambiguity as a constraining factor).
It is to be acknowledged that it is not always possible to identify the patterns with absolute certainty. For example, one could often construe a referent for the pronoun it, compromising its expletive character; consider, for example, (40) and (41). In (40), it is ambiguous between an expletive and reference to the poorest area of London, while in (41), “The time when I’m going” could be a possible inferred referent, but the phrase is also interchangeable with maybe.
(40) [. . .] because it’s about the poorest area of London probably [S0368:] it could be (.) (Spoken BNC2014 SXDQ 481)
(41) at the moment I don’t know when I am going it could be Tuesday night could be Wednesday morning [. . .] (Spoken BNC2014 SUVS 1031)
Often, however, there are contextual clues as to which reading is more likely (as in 42 and 43). In (42), there is no possible anaphoric referent for it, and the following maybe suggests that it was understood as expletive. In (43), the expletive reading is ruled out by coordination with the negated couldn’t be shingles, which clearly has an implied referential subject.
(42) I dunno why she um gave them this number though (.) she’s never lived here has she [S0084:] no (.) it might be before mum got her landline [S0086:] yeah maybe (Spoken BNC2014 SJW4 315)
(43) [. . .] What is it? I don’t know, she says. (SP:PS20G) Yeah. (SP:PS20H) It could be and then again couldn’t be shingles. (BNC S_consult H50)
When in doubt, tokens were rather coded as
(44) [. . .] do you reckon ET's a word? [S0328:] could be (.) there’s a lot of really weird two letter words (Spoken BNC2014 SUWR 1376)
(45) the landlords in Saigon are if you like more commercial, more capitalist, it might be they were better organized than landlords in the north. (BNC1994 S_tutorial JJN)
With the above patterns, we can address the question whether the structures and forms that emerge in synchronic spoken language are in line with the grammaticalization mechanisms of diachronic trajectories. 5 As a formal reduction or “morphological erosion” (Heine & Kuteva 2007:43), omission of expletive it is a characteristic of grammaticalization (as has been shown for maybe). If the connection between functional and formal aspects of grammaticalization holds, we would expect to find more it-omission in the more grammaticalized structures, that is, patterns (E)-(G).
4.3. Results
The frequencies of the patterns (A)-(G) in the BNC corpora are shown in Tables 2 and 3.
Complementation Patterns of (it) Could Be (Normalization per Million Words)
Complementation Patterns of (it) Might Be (Normalization per Million Words)
We see, firstly, that the critical contexts (E)-(G) occur consistently; especially the isolated phrase seems to be a typical pattern. Secondly, the relatively complex patterns of extraposition and that-complementation occur predominantly in the more formal, context-governed samples; but the critical contexts (except isolated phrases) are no less likely in this set than in the conversation data. Thirdly, there is no clear diachronic change in distributions as far as is discernible from the data. So, while the epistemic phrases are available for contexts that promote their adverbialization, there is no clear development of a frequency increase or shift towards these contexts. This does not look like grammaticalization in progress, but rather like a stable base potential of these items to occur in possibly grammaticalizing uses.
Is omission of the expletive pronoun it tied to specific contexts? Not strictly, as Tables 4 and 5 show, but there are tendencies. The numbers are low and have to be inspected with caution (and, at this point, without statistical machinery). Yet, some results point to a “grammaticalization effect”. A straightforward one is that omission is very rare where it was least expected, that is, in extraposition constructions (pattern A), and downright inexistent with passive verbs (A2). Moreover, if we take the suggested adverbialization cline from that-clause complement to bare clause complement to parenthetical (Thompson & Mulac 1991; López-Couso & Méndez-Naya 2016), that is, patterns (C)-(E)-(F), each step does come with an increase in omission rate (at least in the totals across subcorpora). Finally, occurrence as an isolated phrase is not only frequent but also often produced without the pronoun. This context can be seen as a possible source of entrenchment of adverbial items could be/might be. On the other hand, some of the results cannot be explained by grammaticalizing contexts. Pronoun omission in idiomatic uses of could be is relatively frequent, which we might put down to their nature as potentially non-compositional fixed expressions. Pattern (D) also shows high omission rates; note that in these positions, the phrase can have adverbial/modifying function, and sometimes an adverb could fill the same paradigmatic slot (see examples 24-27 and 41 above). This makes it plausible that this structure is influenced by adverbial forms like maybe, and in turn can play a role in adverbialization, as suggested by López-Couso and Méndez-Naya (2016:173).
Complementation Patterns and Pronoun Omission Rates of (it) Could Be
Complementation Patterns and Pronoun Omission Rates of (it) Might Be
Overall, the results suggest that critical contexts for grammaticalization can affect speakers’ realization of an epistemic phrase (here promoting pronoun omission), but also that pronoun omission may depend on other factors as well, such as frequency and the holistic parsing of idiomatic fixed phrases. As the relevant tokens are sparse, the results come with a degree of uncertainty. To get a better perspective on these tentative findings, they are tested against a data set of informal written language in section 5.
5. Comparison with Informal Writing (Blog Postings)
The data for this analysis are sourced from the “Blog” component of the British section of the Corpus of Global Web-based Englishes (GloWbE-GB-B; Davies 2013), containing data from 2012-2013. Compared to the spoken BNC corpora, this is a much larger resource (131.7 million words); its informal written mode has the advantage (to the analyst) that it is unedited and reflects features of spontaneous language use, but with more easily identifiable structures than speech (as disruptions, re-starts, etc. do not occur). The blogs included in the corpus vary in the casualness of linguistic style, but overall, they represent a more homogeneous and informal genre than the “General” component of GloWbE (Davies & Fuchs 2015). 6
Due to the large size of the corpus, queries were targeted to the string could/might be with subject it or omitted, with subsequent manual filtering. Query terms were it could/might be and could/might be preceded by punctuation mark, conjunction, or adverbial; from the output, items with expletive subject reference were selected manually; false and double hits were excluded. The method has very high precision but cannot guarantee complete recall, especially of co-opted could/might be in unexpected syntactic positions. A very small number of tokens were uninterpretable, and excluded, such as (46).
(46) [. . .] Mauge would do not take care of this safari, nonetheless it could be became the fingers associated with Individuals, isabel marant shoes and boots [. . .] (GloWbE GB-B).
GloWbE-GB-B yields 1322 relevant tokens of (it) could be with expletive subject reference, and 1443 of (it) might be. All tokens were manually annotated for the type of complement (patterns A-G, as defined in section 4.2). The results are shown in Table 6.
Complementation Patterns and Pronoun Omission Rates of (it) Could/Might be in GloWbE-GB-B
The first thing to note is that structures with extraposition (pattern A) make up a very large part of the data, and in these the pronoun it is rarely omitted. Many of these tokens appear to represent quasi-fixed phrases such as it could be argued (that) (n = 292) or (it) might be worth V-ing (n = 161), as in (47) and (48). Yet, these high frequencies do not promote reduction in the form of it-omission. As in the spoken data, the rare occurrences of it-omission in extraposition are with noun or adjective/adverb complements (A1, as in 49 and 50), and never with passive verbs (argued, suggested, said, etc.), where be is an auxiliary (A2, as in 47). Possibly then, auxiliary be is more strongly attached to the following verb than to the preceding could/might. If what promotes it-omission is an increased autonomy of the chunk could be/might be, then the retention of it with be + Vpassive falls into place.
(47) It could be argued that the fans of Wimbledon had endured a more tumultuous period than FC United [. . .] (GloWbE-GB-B)
(48) Have you got much furniture in the room? It might be worth trying some bass traps for the corners. (GloWbE-GB-B)
(49) Might be an idea to start rotating the squad before we loan out more players [. . .] (GloWbE-GB-B)
(50) hope the dutch rat rots in hell (city) could be quite funny if we get 25 mil for him and he has a injury [. . .] (GloWbE-GB-B)
Another difference from the spoken data is the relative infrequency of pattern (G). Isolated phrases clearly have a dialogic function and are used in written text to represent dialogue or as rhetorical device (as in 51). The high rates of it-omission in this pattern in writing cannot be ascribed to a reducing effect of frequency within this register, but they parallel the finding from spoken language.
(51) Was it the calibre of contestants? Could be, they can all sing to a passable talent show standard. (GloWbE-GB-B)
The most important result is in the contrast between patterns (C) and (E). These patterns are illustrated in (52)-(57). More clearly than in the spoken data, there is significantly more it-omission with bare clause complements than with that-clauses. The omission rates are strikingly similar for could be and might be, and in both cases they are higher in the critical adverbialization context. This cannot be due to frequency, as pattern (C) is more frequent than (E), and neither is particularly frequent overall. Informativity cannot explain it either, since the pronoun it is expletive (and hence uninformative) in both patterns. The result rather suggests an influence of sentence structures with an initial adverbial; pattern (E) can be likened to such structures, and it seems that the potential reanalysis of it could/might be as an adverb promotes the shortening to could/might be.
(52) Now, it might be that you saw lots of smiling Aussies in Sydney, but that’s because of how mainstream media gets distorted by PR and being lazy about getting a more complete picture. (GloWbE-GB-B)
(53) As Two Gallants head comfortably towards new and heavier ground, it could be that the interplay between this and their old quasi-emo style is wearing thin. (GloWbE-GB-B)
(54) Could be that years of exaggerating the profits could bite them in the arse as the whole country goes hydroponic. (GloWbE-GB-B)
(55) Make sure they have the kit they need: It might be their lives will be easier if they have certain things. (GloWbE-GB-B)
(56) (If) and when this highly overbought dollar goes into reverse will Gold also? Could be the US dollar holds the key to Golds next direction. (GloWbE-GB-B)
(57) I could shift some o’ those sacks around and make a little cot for you. Might be there’s a blanket in the back there. (GloWbE-GB-B)
Moreover, there are a number of tokens with it-omission and phrasal complements (pattern D, shown in 58 and 59). This roughly matches the observation for this pattern in spoken language above, and the same explanation probably applies (i.e., the phrase’s modifying function and—in some cases—interchangeability with an adverb). Yet in the blog data, the rate of it-omission in pattern (D) is consistently lower than in the more adverbial-like patterns (E) and (G).
(58) [. . .] some say there is one hundred years left, some say fifty years left, I tend to think it is nearer the fifty years, could be a lot less? (GloWbE-GB-B)
(59) You also get 766 3x a year from SFE, might be more if you’re in London. (GloWbE-GB-B)
6. Summary
We have seen that omission of the expletive pronoun it in epistemic phrases (it) could/might be varies with the complement of the phrase. Moreover, the potential for a grammaticalized, adverbial reading of the phrase is tied to the syntactic form of its complement. As a coarse synopsis, Figure 3 gives an overview of the results where the complementation patterns are grouped by their status with respect to grammaticalization (see section 4.2): patterns (A)-(B) as non-grammaticalizing (“non-grxizing”), patterns (C)-(D) as “pre-stage,” and patterns (E)-(F) as “critical.”

It-omission Rates by Grouped Contexts in the SpokenBNC Corpora and GloWbE-GB-B (Blog Writing) (“Non-grxizing” = “Non-grammaticalizing”)
Summarized like this, the data can be analyzed statistically. This is done in the—relatively simple—logistic regression models in Tables 7 and 8, with the independent variables
Logistic Regression Model of Effects on it-omission (
Note: Model Specification:
Logistic Regression Model of Effects on it-omission (
Note: Model Specification:
7. Discussion
The findings from spoken language and informal writing converge on a main point: omission of the pronoun it correlates with the degree to which the usage context promotes a grammaticalized, adverbial reading.
Given the occurrences in critical contexts and their correlation with formal reduction, we might be tempted to infer a budding diachronic process. But this is not warranted. In the twenty-year span of the BNC corpora, there is no general increase of occurrence in critical contexts, nor of it-omission (see section 3). The tendencies we see are (probably) not the long-term effects of shifts in usage frequencies (as often observed with grammaticalization, though typically at a delay; Mair 2004). Rather, they are preferences that emerge in synchronic usage events. These preferences are determined in part by the paradigmatic similarity of the epistemic phrase (it) could/might be to a main clause or to an adverbial—if it is more like an adverbial, the dummy subject it is more readily omitted. Thus, the formal reduction of epistemic phrases correlates with their occurrence in critical contexts for grammaticalization, even when other factors (frequency, informativity) are not at play.
However, there is more to it in the detail. We have seen that it-omission is not simply a function of “more grammaticalized” uses but also often occurs in idiomatic phrases like could be worse (pattern B, in spoken language) and with phrasal constituents denoting a quantity, time or distance (pattern D). In idiomatic phrases, the non-compositionality of fixed expressions seems to promote the omission, which would be irrespective of any grammaticalization tendencies. The phrasal complements often produce contexts in which the epistemic phrase is interchangeable with an adverb, albeit with limited scope; an analogy effect of adverbs, especially maybe, is therefore a plausible explanation. Moreover, these structures often express guesswork, estimates, or uncertainty about a number, so the epistemic qualification is foregrounded (as was seen in examples 24-27).
In spoken language, the most prominent type of use in a critical context is it-omission in an isolated phrase could be/might be (pattern G). This form is frequent in dialogic discourse as it adds an epistemic stance to a preceding proposition. In a grammaticalization scenario, this type is a possible catalyst for reanalysis, as it strengthens the representation of could be/might be as a syntactically autonomous element that can be adapted as an adverb.
Comparing the items, we see that could be has higher rates of it-omission than might be, across context types in the spoken data and substantially in critical contexts in blog writing. The share of tokens in critical contexts is also larger with could be (56 percent in speech and 6 percent in writing, compared to 29 percent and 2 percent, respectively, for might be). From this, it seems that could be is more readily adapted to adverbial-like function and form, and hence the more likely candidate for adverbialization. However, the data are not fully conclusive on this matter.
Finally, the adverbial-like uses of could be/might be raise questions about their function. On the one hand they might be “attracted” to this syntactic role by the analogy with maybe; on the other hand, we may expect a differentiation in semantic or pragmatic aspects. It seems that especially could be often expresses a positive acknowledgement of a possibility, rather than casting a proposition into doubt (see example 60). While maybe and perhaps are often used for hedging or attenuating the certainty of a statement (Fraser 2010:22-23), a preferred usage context of could be/might be appears to be in discourse about possibilities or when new scenarios are brought into the conversation (61; see also examples 28, 30, 44, and 51 above). Expressions of epistemic uncertainty are often placed after a proposition as adding an element of doubt (Aijmer 1997:21; Kaltenböck 2013:295). The proposed dispreference of could be/might be for this function would then also explain why they are so rarely co-opted in non-initial positions (pattern F)—unlike other epistemic phrases such as I think and I suppose (which indeed occur clause-finally more often than confirmatory phrases like I’m sure or I believe; see Van Bogaert 2011; Kaatari & Larsson 2019). This differentiation is admittedly impressionistic at this point, but at least in examples like (60) and (62) the epistemic stance is clearly more affirmative than it would be with maybe or perhaps. That said, in other cases they are practically or explicitly interchangeable (as in 63).
(60) [S0376:] she was acting really strange first of all it was jumper she liked cos I think she knows it’s a cat on it [S0104:] yeah yeah could be (Spoken BNC2014 SPHU 367)
(61) So please, think about it and do it, and get that enjoyment out of what we’re offering. Longer term, well I don't know it, it could be you want your bigger house, your detached house, it could be that you want erm, status within the company. (Spoken BNC1994 S_speech_unscripted KM5)
(62) You sure you did not just make that up? Even unconsciously? Perhaps you just proved my point? Might be . . . (GloWbE GB-B)
(63) He will know. Not maybe, or could be. (GloWbE GB-B)
8. Conclusion
This study examined the usage of epistemic phrases with could be and might be in spoken English (Spoken BNC) and informal writing (GloWbE-GB-B). Its aim was to test the correlation of grammaticalization and formal reduction in synchronic language use, that is, the hypothesis that more grammaticalized uses of an item are more susceptible to reduction. This was approached through complementation patterns that allow for an interpretation of (it) could be/might be as sentence adverbials. The step in complementation from that-clause to bare clause can be seen as the “main entrance” to adverbialization (in line with Thompson & Mulac 1991; López-Couso & Méndez-Naya 2016; Kaatari & Larsson 2019). As that-omission does not as such constitute a fundamentally new kind of structural embedding, the (diachronic) process of grammaticalization crucially depends on the frequency of use in this critical context. In this respect, we can say that (it) could be and (it) might be are lingering on the doorstep—they occur regularly with bare clause complements, but less frequently than with that-clauses. This may serve to show that constructions like could be/might be have a kind of base potential for grammaticalization—they can be found in critical contexts which allow for grammatical reanalysis and which would in retrospect be identified as indicators of the “beginning” or “early stage” of a grammaticalization process. More importantly though, aspects of grammaticalization are at work even in the absence of an observable diachronic development. In the present case, the tendency to omit the expletive pronoun it is overall stronger in contexts where (it) could be/might be can be parsed as an adverbial. Thus, the correlation of (more) grammatical function and shorter form emerges in spontaneous usage, that is, at a stage of synchronic variation that may or may not develop a diachronic dynamic.
This finding ties in with the notion that grammaticalization is propelled by individual speech events produced by individual speakers who are by and large unaware of ongoing long-term developments in the language (Fischer 2010a; Petré & Van de Velde 2018). There must then be general cognitive preferences that nudge speakers toward producing (and reproducing) the forms and structures that eventually lead to a grammaticalization process as a “unified” change of (semantic and syntactic) function and (phonetic and morphological) form. Omission of expletive it is plausibly motivated any time by production economy, but it is constrained by the syntactic embedding of the phrase. In addition, speakers may model the epistemic phrase in analogy to maybe when they perceive it as functionally similar (i.e., adverb-like).
Spoken language provides more opportunities than writing for spontaneous innovation and structural neo-analysis (here seen in the higher token numbers in critical contexts, in particular in non-initial position), and perhaps the flexibility of spoken language and its resulting openness for structural re-interpretation are still underrated given the bias toward written text in many areas of grammatical analysis (cf. Linell 2005; Haselow 2017:3-8). The present study is based on observations in spoken data and a comparison with informal writing. The comparison has shown that the more innovative uses of (it) could be/might be occurring in speech are not well-established in writing, and that pronoun omission is generally more prevalent in the spoken data. This supports the notion that the mechanisms we observe in grammaticalization emerge mainly from spoken language. Still, both data sets support the main finding that contexts which invite grammaticalization promote a reduction of form, even when these contexts are infrequent. The “fluid patterns of language use” (Hopper & Traugott 2003:2) from which grammaticalization emerges already show the traits of unified processes of function and form.
From a usage-based, “emergentist” perspective, speakers do not intentionally innovate new configurations such as could be as an adverb. Rather, they “recontextualize” an existing piece of structure as a “dynamic and flexible transfer of linguistic patterns/means from one utterance-in-context to another” (von Mengden & Kuhle 2020:265). Thus, it-omission in could be/might be is not a reaction that follows an innovation, as in the actualization of a change. Perhaps it rather reflects an intuition or sensitivity on the part of speakers for how (syntactic) context affects function and can make an item more “discursively secondary” (Boye & Harder 2012:7-8). Reduction of form would then be directly related to the item’s syntactic embedding.
Finally, the findings also shed light on the role and timing of erosion in grammaticalization. Erosion has mostly been regarded as a consequence of semantic or functional change; that is, erosion is part of a secondary stage and is not expected to occur at the onset of grammaticalization (Traugott 2002; Norde 2012:83). In the case presented here, a tendency towards morphological erosion correlates with a critical context for functional reanalysis, while no category change is completed or even progressing diachronically. This would mean that there is morphological erosion right from the beginning. While this finding is based on syntactic decategorialization, Dehé and Stathi (2016) found a similar correspondence between semantic change and prosodic prominence. Evidence is needed from more cases and with more gradient aspects of phonetic form such as vowel qualities, duration, and consonant lenition. As it stands, the results of the present study suggest that speakers tend to adapt form to function immediately, with or without on-going change.
Footnotes
Acknowledgements
I would like to thank the journal editors and three anonymous reviewers for their insightful criticism, constructive comments, and keen eye for detail. Their feedback has prompted significant improvements on this paper. I have also profited from discussions with the audiences at BICLCE 8 (Bamberg) and VALP 5 (Copenhagen). All remaining errors and obscurities are entirely my own.
Declaration of Conflicting Interests
The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author received no financial support for the research, authorship, and/or publication of this article.
