Abstract

The following interview with Douglas Biber was conducted via email by Bethany Gray during April and May 2013. The text of the interview has been edited only minimally from the original, primarily for organization.
During your career, you have used corpus linguistics to investigate a wide range of language varieties, both synchronically and diachronically, cross-linguistically, and in terms of their use of varying linguistic structures (including grammar, lexico-grammar, phraseology, and discourse structure). But underlying most of your research is the theme of register variation—that variation in language corresponds to differences in the characteristics of the situations in which it is used. How and when did you first become interested in register variation?
Once you’ve asked it, this is an obvious question. But oddly, it’s not a question that I’ve asked myself before. And it’s kind of hard to come up with a definitive answer.
I guess the simplest answer is that my interest in register differences grew out of my early multi-dimensional studies. When I started out on these studies, I was mostly interested in the relationship between speech and writing. But what I ended up finding out was that the whole range of register differences matters. Based on my studies over the years, I have come to believe that the distinction between speech and writing is the most important parameter of register variation for predicting linguistic variation. But at the same time, these studies have consistently shown that all register differences matter; that there will be systematic patterns of linguistic variation associated with any situational parameter of register, including purpose, topic, production circumstances, relation between audience and addressee, etc. So, from this perspective, my interest in register variation emerged from the corpus studies themselves—I was repeatedly confronted with findings that demonstrated the importance of register differences, so I became “interested”!
But when I think more about it, there were certainly earlier influences. When I first began to study linguistics, I developed interests in two areas that were not usually taught together: I enjoyed formal linguistics—analyzing phonological and grammatical structures and patterns—but I was also especially interested in the structure of texts and the analysis of natural discourse. I began my graduate studies at the University of Texas at Austin, where I was exposed to many courses that provided me with a strong background in formal linguistics—phonology, syntax, formal semantics, etc. But no one there was very interested in applying linguistic analysis to the description of natural discourse. When I transferred to the University of Texas at Arlington, I was exposed to the other side of the coin. In particular, I can remember a course taught by Robert Longacre on the grammar of discourse. In the course, we learned how texts from different text types had fundamentally different kinds of textual organization and structure, with different grammatical devices used to signal those organizations. I don’t think Bob ever used the terms “register” or “register variation,” but looking back on the course, I realize many of the same core concepts were there.
And looking back on my early publications, I now realize that this perspective obviously shaped my research interests in those years. For example, I carried out a study of focus markers in Central Somali in the early 1980s. The standard methodology at the time for pragmatic studies of this type was to use controlled question-answer sequences, to show how the focus marker was used to identify the new or contrastive information in the response (e.g., Who did you see last night? - I saw
Much of your research focuses on comparative analyses between registers as a primary way to tell what’s distinctive about a register or discourse context. Register is even a consideration when you look at learner language, when you apply findings to teaching materials as you do with Susan Conrad in Real Grammar ( Conrad & Biber 2009 ), and as you study diachronic change (to name a few examples). What do you see as the major contribution of accounting for register in language studies, and how does the importance of register influence your own research goals?
It’s kind of scary to allow me to get started on this topic. Every time that I’ve compared the patterns of language use across registers, I have observed systematic, interpretable differences. This has led to the very strong belief that register always matters, that it is not possible to fully describe patterns of linguistic variation and change in a language without consideration of register differences.
One reason that I’m so outspoken about this claim is that it is at odds with normal theory and practice in descriptive linguistics, sociolinguistics, and even to some extent, corpus linguistics. Descriptive linguists have traditionally compared the inventory of words/structures across languages/dialects. This is usually not a quantitative enterprise, and there is usually little concern with language use or linguistic variation within a variety; so register differences are typically not considered to be relevant.
The disregard for register differences within quantitative sociolinguistics is more baffling to me. A typical textbook on sociolinguistics starts out with an introduction to the speech community and speech events—similar to a register perspective. But then we turn to the next chapter, and the focus shifts to social dialects (investigated through sociolinguistic interviews); most of the register variation in a community is simply excluded from consideration without comment. The linguistic characteristics discussed in these studies are also constrained to a few “meaning-preserving” linguistic variables (usually phonetic variants); functionally motivated linguistic variation is simply excluded from inquiry. The register perspective differs from these traditional sociolinguistic studies in both regards: it describes patterns of language use in all spoken (and written) registers, and it describes patterns of language use in terms of the full inventory of lexico-grammatical characteristics. In my view, linguistic variation is inherently functional. We tend to use different linguistic features and variants in different registers, because we choose the linguistic features that are best suited functionally to the purposes and situational characteristics of those registers. So it makes no sense to me to artificially constrain the domain of inquiry to include only nonfunctional, meaning-preserving linguistic choices.
Finally, the disregard for register differences within corpus linguistics has a different source. Most corpus linguists would agree that linguistic variation is functional. But there has sometimes been a blind faith in the corpus: that a very large sample of language must surely represent everything of interest. Over the years, we have come to realize that “large” must be much larger than we previously thought, especially to capture the patterns of use for individual words. But the overall message has been that “bigger is better” (kind of like our attitudes toward apples, hamburgers, and phone networks)—with little consideration of the quality or specific composition of that large sample. As a result, corpus researchers often pay little attention to the representation of register distinctions in corpora.
So, because register has often been disregarded by linguists of many persuasions, I have been pretty persistent in arguing for the importance of register. Register differences are at least equally important to dialect differences in any community of language users, and the representation of register differences is at least equally important to corpus size for any empirical corpus-based study of linguistic variation and use. Quantitative linguistic descriptions that disregard register are always incomplete, and often even misleading about the importance of other factors. I believe that all linguistic descriptions—ranging from sociolinguistic descriptions of variation in a particular community, to collocational studies of particular words, to pedagogical corpus-based teaching materials—must include consideration of register differences as a central organizing parameter, if they hope to achieve an accurate account of the patterns of use.
How did you arrive at corpus linguistics as your approach to studying language variation? Did your interest in corpus linguistics and register variation develop simultaneously, or did one lead to the other?
The short answer is that I was lucky to have mentors who pointed me to the possibilities of corpus analysis and trained me in the skills that I could apply to corpus analysis. When I first began to do register analyses, I had never even heard of corpora. So, for example, when I was doing the pilot study for my dissertation, I spent considerable time counting the occurrence of grammatical features in texts by hand! It was only later that I came to realize that the analysis of corpora provided an ideal research approach for investigating the use of linguistic features across texts and registers.
I was extremely lucky to be in the right place at the right time. As an undergraduate, I had developed a strong background in science (with a degree in geophysics from Penn State University), including courses in Fortran computer programming, and some research experience working on the computer modeling of earthquake fault zones. However, I did not really build on that experience after graduation. Rather, I spent time drawing seismic maps as a geophysicist; then went back to graduate school in theoretical linguistics; then supervised a Somali adult literacy program in northeast Kenya; and eventually ended up in the PhD program in linguistics at the University of Southern California (USC), where I initially focused my research efforts on phonology and historical linguistics. I gradually shifted my interests to issues in sociolinguistics, focusing especially on the analysis of spoken versus written discourse. But I had no intention of building on my background in science and technology. Two mentors at USC helped me to change that.
First, Ed Purcell helped me to develop stronger technical skills. Ed taught me both statistical analysis as well as advanced computer programming skills. Through Ed’s courses, I learned how to carry out univariate and multivariate statistical analyses, with extensive discussion of how those techniques could be applied to linguistic research questions. And my development in computer programming skills occurred mostly as on-the-job training, when Ed hired me to work in a computer lab on campus. For example, I worked on a project to translate an acoustic analysis software package from Fortran into EDL (a programming language used on IBM Series/1 minicomputers), and in the process, I learned how to write software for linguistic analysis. That job led to a full-time position as a programmer in the university computing center, which eventually placed me in an ideal position for doing my own corpus analyses.
A second Ed—Ed Finegan—was central to my development as a corpus linguist, and as a researcher and writer in general. Ed was my dissertation chair and completely supportive of my general interests in spoken and written discourse. But then one day in 1983, Ed told me that he had read an article about an electronic collection of texts (the Brown Corpus). I had never heard of a “corpus” before, so didn’t really know what it could do for me. But Ed suggested that I could apply my programming skills to corpus analysis, radically changing the methodology that I had intended to apply in my dissertation research on spoken and written discourse. Ed helped me obtain university funding to purchase the Brown Corpus, LOB (Lancaster-Oslo/Bergen) Corpus, and London-Lund Corpus, and so he is the one who really got me into corpus linguistics.
The technique of multi-dimensional analysis that you developed during your graduate work at the University of Southern California has now been applied to general and specialized registers, diachronically and synchronically, and across languages. What led you to pursue this innovative approach in the context of language studies?
It’s hard to really remember what all truly influenced me during this period, versus what has become a kind of revisionist history. There had been a lot of interest in the linguistic comparison of speech and writing during the 1970s and early 1980s, in large part carried out by researchers in psychology, communication, and anthropology. At the same time, some American linguists were breaking away from the dominant preoccupation with formal syntax and bringing a linguistic perspective to the comparison of speech and writing. I know that I found the work of Wally Chafe and Deborah Tannen to be especially influential—especially their emphasis on task (a similar notion to register). Previous research had usually compared a few spoken texts to a few written texts, with no consideration of whether comparable tasks were being compared in speech and writing. Chafe and Tannen addressed this problem, for example by comparing narrative recounts of a movie in speech and writing. This line of research made me aware of the importance of task (register) in addition to the spoken/written modes.
The key linguistic innovation of multi-dimensional (MD) analysis is the comparison of registers with respect to sets of co-occurring linguistic features (the “dimensions”), in contrast to the more traditional approach that considers only one linguistic characteristic at a time. But this idea had actually been floating around for a couple of decades before I developed the MD approach. First, there were theoretical discussions by linguists like Ervin-Tripp (1972), Hymes (1974), and Brown and Fraser (1979) who emphasized the importance of linguistic co-occurrence for the analysis of differences among registers (or “speech styles”). So, for example, Brown and Fraser (1979:38-39) argued that it can be “misleading to concentrate on specific, isolated [linguistic] markers without taking into account systematic variations which involve the co-occurrence of sets of markers.” Chafe (1982) applied this concept to the comparison of speech and writing, proposing two parameters of linguistic variation: “integration/fragmentation” and “detachment/involvement.” Each of these parameters was composed of a set of related linguistic features. For example, the “integration/fragmentation” parameter was composed of features like nominalizations, participles, and attributive adjectives versus clause coordination. Chafe identified these sets of linguistic features on an intuitive basis and did not empirically investigate whether the features actually co-occurred in texts, but the notion that linguistic features work together as related sets was clearly evident in his work.
The distinctive statistical innovation of MD analysis was the application of factor analysis, to empirically identify sets of linguistic features that tend to co-occur in texts. This innovation had its roots in Carroll (1960)—a truly amazing study for its time, although I’m not sure I fully appreciated that fact in the early 1980s. Although the paper provides essentially no information on the methods for the linguistic analysis, we can only assume that it was done entirely by hand: counting the occurrence of 39 linguistic variables in 150 text passages (each 300 words in length). These counts were then subjected to a statistical factor analysis, carried out with “the aid of high-speed electronic computing machines” (Carroll 1960:288)—presumably an early version of a mainframe computer. Regardless of the methodological details, the resulting analysis identified six major “vectors of prose style.” Each of these vectors was composed of subjective, perceptual variables co-occurring with objective, linguistic variables. Conceptually and methodologically, these vectors are very similar to the dimensions in MD analysis. This seems to have been Carroll’s only foray into the domain of linguistic stylistics (he was much more interested in language testing, human cognition, and psychometrics). However, the 1960 paper must have had a huge influence on my own thinking, helping me to realize that statistical factor analysis could be used to empirically identify the linguistic co-occurrence patterns that linguists had been positing on theoretical grounds.
So, there were at least four major influences on my thinking at this time: the need to control for task (register) in analyses of speech versus writing; the availability of corpora that incorporated a large range of different registers; theoretical research emphasizing the importance of linguistic co-occurrence; and the application of factor analysis to textual variation. MD analysis was an attempt to integrate these considerations in a large-scale analysis of linguistic variation among spoken and written registers.
Why do you think multi-dimensional analysis has proven so effective as a means of uncovering register variation?
I think this is because the MD approach integrates the theoretical considerations identified above, while also exploiting the incredible potential of large-scale corpus analysis, applied to a theoretically important research issue. And MD analysis does not merely present quantitative findings—rather, the numbers are interpreted in functional terms by direct reference to actual texts and specific linguistic characteristics.
In the 1980s, most corpus-based research was highly exploratory, often without a specific motivating research question. Even the simplest corpus investigations—like generating a frequency-sorted list of words—were highly labor-intensive. As a result, corpus research during this period (and to some extent, up to the present day) was often constrained by considerations of computational feasibility rather than theoretical interest.
MD analysis was developed with a fundamentally different agenda. It started with a hotly debated theoretical question: what are the linguistic similarities and differences between speech and writing? The order of priorities was reversed from most other research during that period: the corpus analysis was motivated by the research question, rather than the research question being motivated by what could reasonably be extracted from a corpus.
This is one reason why MD analysis has remained so popular across the decades. It turned out to be a very powerful methodological approach, designed to address genuine research questions, uncovering important linguistic patterns of use that would not be noticed otherwise.
A related reason for the continuing importance of MD analyses is that they are carried out as linguistic investigations, with an emphasis on the functional interpretation of the linguistic patterns of register variation. This emphasis can be contrasted with many quantitative corpus investigations, which present numbers, quantitative findings, or statistical models as the final product. I believe that corpora should be analyzed in quantitative terms using the best statistical methods available. But I also believe that corpus analyses should fundamentally be linguistic investigations, that quantitative findings must be interpreted and explained in linguistic terms, and that those quantitative patterns should be exemplified in specific texts. MD analyses have always incorporated these steps. I think this is a major reason why the MD approach has persisted in being so useful and why the findings from earlier MD analyses have continued to be so influential.
From your perspective, what has been the most meaningful contribution of these MD studies to the field?
I see two major contributions of MD research—one methodological, and the second contributing to what we know about language use.
The most obvious methodological contribution is simply the development of an approach for describing the complex patterns of variation and use among a set of registers, in terms that can be interpreted both linguistically and functionally. The numerous studies that have applied MD analysis, across discourse domains in English as well as other languages, is testimony to this methodological contribution.
But when we consider the history of corpus linguistics over the past 30 years, I think there have been additional, more general, influences of the MD approach. The first is to provide an early model of corpus-based research designed to address a major theoretical research question (rather than simply exploring the contents of the corpus itself). A second is the application of multivariate statistical analysis to the study of linguistic variation. In the 1980s, VARBRUL was well established as a multivariate statistical technique used to study the choice among variants for a linguistic variable. But beyond that, multivariate statistical analyses were generally not employed for linguistic investigations.
Probably a more important, but less often noticed, methodological contribution is the development of the “text-linguistic” perspective on variation and use, in contrast to the dominant “variationist” perspective. Simply put, variationist research studies investigate the proportional preferences for the different variants of a linguistic variable (e.g., of-genitives versus ’s-genitives), while text-linguistic studies investigate the rates of occurrence of linguistic features in texts. The two perspectives have fundamentally different research goals: the goal of the variationist perspective is to predict when speakers use one or another linguistic variant. In contrast, the goal of the text-linguistic perspective is to provide a comprehensive linguistic description of a register, and to compare the linguistic characteristics of different registers. Variationist studies were well-established by the 1980s, but the early MD analyses were among the first large-scale text-linguistic studies. As such, those studies helped to establish the comprehensive linguistic description of registers as an important domain of corpus inquiry. (Regional and social dialects could also be described and compared from a text-linguistic perspective, and I have previously advocated such research. However, there has been little interest in such research questions within mainstream variationist sociolinguistics.)
A final general methodological contribution of MD analysis is also usually overlooked: the focus on grammatical characteristics. Most large-scale corpus research in the 1980s was restricted to the study of individual words and word lists: characteristics that could easily be identified using concordancing software and an untagged corpus. Such research was important, especially for developing the notion of collocation. But to some extent, this was also a case of doing the research that was feasible—it was simply not feasible for most linguists to investigate grammatical patterns of use in a large corpus at that time. Early MD studies showed that such analyses were possible. They weren’t easy—they required programming expertise, lots of time and effort for the development of accurate taggers, and lots of time and effort for the editing and hand-correction of tags. But they were feasible. We have seen an increase in the number of large-scale corpus-based grammatical and lexico-grammatical studies over the last two decades, and I like to think in part that this trend has been influenced by the early MD studies.
In addition to the methodological contributions, I believe that there have been major theoretical contributions of MD research. I see that you ask about those in the following questions, so I’ll put that off until then.
You once commented to me that you have been amazed at the universality of the 1988 Dimension 1: involved versus informational discourse (as illustrated, for example, by the similar dimensions that have come out of MD analyses of speech and writing [Biber 2001, 2008; Reppen 2001] languages other than English [Biber 1995; Biber et al. 2006; Parodi 2007]; language in academic settings [Biber 2006]; learner production [Biber & Gray 2013]; across disciplines within academic writing [Gray 2013]; in specialized discourse settings [Friginal 2008; Egbert 2012], to name a few). First, how would you summarize what these similar findings illustrate about the spoken-written or oral-literate distinction? Second, did you ever imagine that all of these varied studies would reveal such consistent patterns?
The two major extensions of MD analysis since the 1980s have been its application to languages other than English (including many non-European languages), and its application to specialized discourse domains. These have been important for understanding the patterns of register variation within those languages/cultures/domains. Further, the cumulative results of these studies are theoretically important because they provide strong evidence for the existence of universal parameters of register variation, especially the existence of a fundamental oral/literate opposition, and the persistent importance of a narrative dimension across discourse domains.
MD studies of register variation have uncovered both surprising similarities and notable differences in the underlying dimensions of variation. From both theoretical and methodological perspectives, it is not surprising that each MD analysis would uncover specialized dimensions that are peculiar to a given language and/or discourse domain. After all, each of these studies differs with respect to the set of linguistic features included in the analysis, and the set of registers represented in the corpus for analysis. Given those differences, it would be reasonable to expect that the parameters of variation that emerge from each analysis would be fundamentally different.
For that reason, it has been surprising to discover dimensions of variation that occur across languages and discourse domains. Two such dimensions have emerged in nearly all MD studies, making them candidates for universal parameters of register variation: a dimension associated with oral versus literate discourse, and a dimension associated with narrative discourse (see also Biber forthcoming).
The robustness of narrative dimensions across studies indicates that this rhetorical mode is truly basic to human communication. Apparently, whether we’re speaking or writing, for personal or informational purposes, in literate languages/cultures (e.g., English, Spanish) or “oral” languages/cultures (e.g., Tuvaluan, Bagdani), there is always need for narrative discourse that describes past events, contrasted with discourse focused more on the present.
The more surprising finding is the oral/literate opposition, which emerges as the very first dimension in nearly all MD studies. In MD studies based on general corpora of spoken and written registers, this oral/literate dimension clearly distinguishes between speech and writing. However, MD studies of specialized discourse domains show that this is not a simple opposition between speech and writing. In fact, this dimension emerges in studies restricted exclusively to spoken registers (e.g., Biber 2008; Friginal 2008), as well as studies restricted to written registers (e.g., Egbert 2012; Goźdź-Roszkowski 2011; Gray 2013).
In terms of communicative purpose, the oral registers characterized by this dimension focus mostly on personal concerns and the expression of stance. These registers are usually produced in real time, with little or no opportunity for planning, revising, or editing. In contrast, literate registers focus on the presentation of propositional information, with little overt acknowledgement of the audience or the personal feelings of the speaker/writer. These registers usually allow for extensive planning and even editing and revising of the discourse.
Linguistically, this first dimension opposes two discourse styles: an oral style that relies on pronouns, verbs, and adverbs, versus a literate style that relies on nouns and nominal modifiers. The oral style also relies on clauses to construct discourse—including a dense use of dependent clauses. In contrast, the complexity of the literate style is phrasal. Thus, spoken registers (and oral written registers) rely on clausal discourse styles, including a dense use of dependent clauses; written registers (and literate spoken registers) rely on phrasal discourse styles, especially the dense use of phrasal modifiers embedded in noun phrases (see also Biber & Gray 2011; Biber, Gray & Poonpon 2011). This finding, replicated across languages, is especially surprising, because it runs counter to assumptions about syntactic complexity held by many linguists.
This fundamental distinction has been replicated in nearly every MD study, across cultures/languages, and across specialized discourse domains. It suggests a fundamental reorientation to how we should view linguistic complexity: many types of syntactic elaboration—especially the dense use of finite dependent clauses—are typical of normal face-to-face conversation, and so these structures must not be especially complex from a production perspective. In contrast, phrasal modification— multiple layers of embedded phrases without verbs—is the hallmark of complex written discourse, which can only be produced in circumstances that allow planning and manipulation of the text. This finding has major implications for theoretical discussions of linguistic complexity as well as applications in the domains of language learning, teaching, and testing (see also Biber & Gray 2010, 2011; Biber, Gray & Poonpon 2011, 2013).
I certainly did not anticipate the universal nature of the oral-literate (and clausal-phrasal) dimension. In fact, I don’t think I had any concrete hypothesis when I first carried out MD analyses of spoken and written registers in English. Previous scholars were all over the map with their conclusions about the relationship between speech and writing; thus, my 1986 MD article published in Language had the subtitle “Resolving the Contradictory Findings.” When we started to carry out MD studies of other languages, it was against the backdrop of the English MD findings, but I still had no strong reason to expect that we would discover essentially the same oral-literate Dimension 1.
But the findings that have most surprised me have emerged from the studies of restricted discourse domains (e.g., job interviews, conversations, academic research articles—see the survey in Biber forthcoming). At first, I was skeptical of these enterprises, because I didn’t believe there would be sufficient variation within those discourse domains for a stable statistical analysis. However, my doubts were proven wrong: all applications of MD analysis to such domains have resulted in stable and interpretable dimensions of variation. And the even less expected finding is that these studies have all uncovered an oral-literate dimension, usually as the first dimension in the analysis, with a remarkably similar linguistic composition across studies.
Apart from the major methodological contribution of MD analysis, you have also published extensively on core methodological issues in corpus linguistics, including corpus design and representativeness (Biber 1990, 1993), quantifying situational characteristics of registers (Biber 1994), quantitative designs (Biber & Jones 2009), measures of phraseological patterns (Biber 2009), and so on. What do you see as the essential methodological consideration(s) that corpus linguists today need to pay particular attention to?
When I first saw this question, I thought it was going to be easy—How many “essential” methodological considerations could I come up with? But then I started to make my list, and it kept on getting longer and longer . . . although those considerations might all boil down to the need for greater awareness of the foundations of quantitative (social) science research.
Some of the most important recent advances in corpus linguistics involve improvements in technology and the application of more sophisticated research methods. Probably the most important of these is the increasing availability of corpora that represent different varieties, which are also becoming larger. Grammatically tagged corpora are also becoming more widely available. The online corpora developed by Mark Davies at BYU (Brigham Young University) are especially noteworthy in all of these respects (Davies 2004-, 2007-, 2008-, 2010-, 2011-). A second important trend is the availability of corpus-analysis tools that permit meaningful linguistic investigations of those corpora. These tools include the CLAWS grammatical tagger developed at UCREL (University of Lancaster; see Garside & Smith 1997); Antconc, a freeware concordancing package developed by Laurence Anthony (Waseda University; Anthony 2011); and the suite of online tools developed by Davies for analysis of the BYU corpora. And a third recent trend is the increasing use of multivariate statistical techniques by corpus linguists, allowing sophisticated analyses of the ways in which sets of variables interact with one another to predict patterns of linguistic variation. I think corpus linguists would generally agree that all three of these are important methodological developments.
But the essential methodological considerations that came to mind for me are much more basic issues which are often disregarded by both expert and novice corpus linguists. First, my responses to the questions above emphasize my belief that register differences should be an essential component of any investigation of language use. This is not current practice in corpus-based research, so it’s high on my list of essential considerations.
Beyond that, there are five additional basic considerations that I consider essential methodological concerns:
critical evaluation of the corpora that we use more awareness of research design considerations more accountability in the reporting of quantitative findings more accountability in the linguistic interpretation of quantitative findings the development of new methods for the comparison of type distributions across corpora Critical Evaluation of the Corpora That We Use Quantitative social science research begins with research questions asked in relation to some “population.” One of the first steps in the analysis is to collect a sample of observations from the population. That sample is evaluated for the extent to which it represents the population, because in the end, the researcher intends to make generalizations about the population based on analysis of the sample. Corpora are samples. But in corpus linguistic research, researchers often bypass the step of collecting the sample, because they use existing corpora. This obviously facilitates a lot of research that would not be done otherwise—it would not be possible for most individuals to construct a large corpus of their own every time they wanted to ask a research question! But the downside is that most researchers are not really aware of what’s actually included in the corpora that they are analyzing, so they don’t bother to ask what population has actually been represented in that corpus—and most importantly, they don’t ask how corpus findings might be influenced by the nature of the corpus. This is especially problematic for large corpora that are taken to represent the population of “general English,” such as the British National Corpus (BNC) and the Corpus of Contemporary American English (COCA). So, for example, how should we account for the fact that government is one of the most common nouns in the BNC, or that problem is one of the most common nouns in COCA? Are these meaningful generalizations about general English? Or do they reflect the composition of those particular corpora? These same kinds of questions can be asked at a more specific level. So, for example, what does it mean that court is one of the most common nouns in the spoken part of COCA? Should that be interpreted as a generalization about spoken language in English? Or about the particular sample of speech included in COCA? The point here is not to criticize the representativeness of large corpora. These are amazing resources, allowing countless research projects that would not be done otherwise. Rather, the methodological point relates to the end user, who needs to be trained in an awareness of corpus design, composition, and representativeness—providing the foundation for meaningful interpretations of quantitative findings from corpora. More Awareness of Research Design Considerations A related consideration is awareness of alternative research designs, and the implications of design for the types of research question that can be investigated. I’ve written about these considerations several times, but many end users continue to be unaware of the issues. For me, the main problem here is that the “texts” are invisible in most corpora. So, because of the availability of preconstructed corpora, and the widespread utilization of concordancing software, most end users do not even consider the possibility of a research design where each text is an observation (the text-linguistic perspective—see above). As a result, the only quantitative findings that most end-users see are frequency counts for the entire corpus (or subcorpus). Because I’ve written about this in other places (see Biber & Jones 2009), I won’t go into a detailed discussion of the implications here. But in brief, this design choice makes it difficult to incorporate analyses of dispersion (see below), and it ends up favoring research questions about the proportional preferences for a linguistic variant (the variationist perspective) rather than text-linguistic questions that describe texts and varieties in terms of the rates of occurrence for linguistic features. Although it might seem to be arcane, this distinction actually matters. The two design types answer different kinds of research questions. But researchers who are unaware of the distinction have sometimes come to inaccurate conclusions, assuming that their quantitative findings allow them to answer questions from both design types. For example, ’s-genitives compose c. 32 percent of all genitives in conversation, but they account for only c. 7 percent of all genitives in academic writing (see Biber et al. 1999:302). Researchers have looked at quantitative findings like this and concluded that ’s-genitives are more common in conversation than in academic writing. This is an inappropriate conclusion, because it tries to give a text-linguistic interpretation to data obtained from a variationist perspective. In this case, the conclusion is in fact incorrect. The percentages presented above represent proportional preference: a large proportion of all genitives (including both of- and ’s-genitives) are ’s-genitives in conversation, while only a small proportion of all genitives are ’s-genitives in academic writing. But despite that fact, it turns out that ’s-genitives are actually more common in academic writing than in conversation! This is because genitives overall (both of- and ’s-genitives) are relatively rare in conversation (only c. 2.5 per 1000 words of text). So, even though 32 percent of all genitives (in raw figures) are ’s-genitives (the variationist perspective), the actual rate of occurrence is quite low at only 0.8 per 1000 words (the text-linguistic perspective). In contrast, genitives overall are extremely common in academic writing (c. 37 per 1000 words of text). As a result, ’s-genitives have a higher rate of occurrence in academic writing (2.5 per 1000 words), even though they represent only a small proportion of all genitives. The methodological point is that research design matters. More Accountability in the Reporting of Quantitative Findings This is a related consideration—in general, we all need a greater understanding of what the numbers mean when we report them in our research. And the first step for that is to understand the research design that we are employing. But there are two additional considerations here. First of all, we need to report measures of dispersion in addition to central tendencies. Normal practice in corpus research studies is to report simple descriptive statistics—but in general, those include only simple frequencies or rates of occurrence for the entire corpus. This is another consequence of our reliance on concordancing software and preavailable corpora; because the texts are invisible, we tend to assume that the corpus is just a homogenous blob. And so we also assume that a frequency count represents the extent to which a feature is used throughout the corpus. This assumption is completely unwarranted. Rather, a high rate of occurrence for a corpus could reflect extremely frequent use in just a single text, coupled with minimal use in the other texts. Analyzing the extent to which a feature is dispersed across a corpus will help us know the extent to which a quantitative pattern is generalizable. Ideally, this would be done with a research design that recognizes the texts in a corpus, so we’d have a rate of occurrence in each text, and a mean and standard deviation across all texts. But such analyses are difficult with concordancing software. So a simple alternative would be to split the corpus into several subcorpora, and then use concordancing software to analyze and compare the rates of occurrence across subcorpora. The second essential methodological consideration here is the need to employ simple tests of statistical significance and statistical measures of strength. It’s actually hard to believe that corpus linguists have persisted for forty years of technologically sophisticated research coupled with our practice of simply reporting frequencies and rates of occurrence. (I live in a glass house here!—There are hundreds of quantitative findings reported in the Longman Grammar of Spoken and Written English [LGSWE], with no measures of dispersion, no significance testing, and no measures of statistical strength.) There are a few corpus linguists advocating the use of advanced multivariate statistical techniques, and I support that effort. But I think the far more important consideration is to change the normal way we do business, so that descriptive statistics include measures of dispersion as well as measures of central tendency, and at least univariate tests of statistical significance and measures of strength are routinely reported with corpus findings. More Accountability in the Linguistic Interpretation of Quantitative Findings In a sense, this point also boils down to the need to bring texts back into corpus analysis. For me, quantitative findings should always be interpreted, explained, and illustrated with examples from actual texts. But I’m surprised by how often that does not happen. In this regard, the studies that employ the most sophisticated statistical analyses are sometimes the most likely to be problematic—reporting the quantitative results of regression analyses or other advanced statistical analyses as the end product. My main point here is that we are engaging in linguistic description when we do corpus analysis—and so the final step must always be interpretation and illustration in linguistic terms. This point applies to all corpus analyses. If I see a simple frequency-ordered list of the most common words in a corpus, I want to know why those particular words are frequent. And if I see a regression model showing that one factor is especially important in predicting the use of a linguistic variant, I want to know why that’s the case—I want examples of those structures; and I also want to understand the functions of the other interacting factors that were shown to be less important. The bottom line here is that I think corpus analyses should in the end always return to their linguistic roots and consideration of language use in actual texts. The Development of New Methods for the Comparison of Type Distributions across Corpora I see this as a huge problem for corpus research—and one that is almost never recognized as a problem. We have developed solid methods for comparing rates of occurrence for linguistic features across corpora. But to date, we do not have adequate methods for comparing type distributions across corpora. There are many of us who are interested in research questions that involve type distributions. These are research studies like listing the 1000 most frequent words in English; or listing the 50 most frequent lexical phrases in academic writing; or comparing the list of most frequent words in student writing to the most frequent words in textbooks; or concluding that a student could understand 90 percent of the discourse in a corpus by knowing a set of 2000 words. Such studies all involve analysis of the set of different “types” in a corpus—different words, phrases, constructions, or whatever. The problem is that we don’t really understand the quantitative properties of type distributions in relation to corpus design parameters. Corpora obviously differ in their total size (number of words). But they also differ in the number of texts; the lengths of texts; and the variability among texts. As a result, it is difficult to interpret the results when we compare type distributions across two corpora. We are almost certain to find differences in the most frequent types and in the number of different types between any two corpora. But do those differences really represent meaningful differences? Or do they simply represent differences having to do with different corpus designs? (And once again, I live in a glass house here!—just see my research on lexical bundles . . . [e.g., Biber & Barbieri 2007; Biber, Conrad & Cortes 2004; Biber 2009]). In fact, we don’t even know much about the stability of type distributions across subcorpora sampled from exactly the same register. In addition to everything else, type distributions reflect the particular topics of texts. When we are talking or writing about different topics, we will use different words and phrases. So, we could compare two subcorpora from exactly the same register, and still find differences in type distributions—just because the specific topics discussed in those texts were different (in addition to any differences in the design and composition of the two corpora). I don’t have proposed solutions to this methodological problem. But at this point, I don’t think we even understand the problem. In my view, we need more basic research here, documenting the ways in which type distributions are affected by corpus design and composition; and we then need a series of experiments testing the adequacy of new methods designed to address those problems. Summary: Raising the Bar for the General Level of Technical Proficiency Beyond those considerations, I have a general concern that we are developing a sharply stratified community of researchers in corpus linguistics: a few with highly technical expertise in computer programming, database construction, multivariate statistics, etc., versus the large majority who are minimally informed consumers. So, the typical corpus linguist uses a corpus designed and collected by someone else, with little awareness of how the corpus was collected or what kinds of texts are actually included. The typical corpus linguist carries out a corpus search or analysis using an available tool, and so asks the research questions that the tool is able to answer, and believes the results returned by the tool, with little understanding of how the tool was developed, what the tool is actually analyzing, or even whether the results really address the research questions that they care about. Even worse, the typical corpus linguist reads published papers that employ sophisticated computational/statistical analyses, and simply accepts the reported findings and conclusions, with no ability to independently evaluate the methods and resulting claims. It’s maybe not realistic to argue that all corpus linguists need to develop skills in computer programming and advanced statistical analysis (although it would be good for the field if these skills were more widespread than at present). But I do think the typical corpus linguist should be at least an informed consumer, able to read and critically evaluate research, with a critical awareness of the composition of the corpora that they are exploring, and a critical awareness of how to understand and interpret the quantitative results from corpus analysis tools.
What is your view on how corpus linguistics fits within the broader realm of linguistics (theoretical or applied)? Has this changed over time?
I think this is a personal issue, with no right or wrong response. Some researchers self-identify as “corpus linguists,” and see corpus linguistics as a separate subdiscipline. I have no quarrel with them taking that stance. But I see myself primarily as a linguist. For me, the corpus is a sample, intended to represent some larger population of language use—and it seems a little odd to brand a subfield based on the fact that it analyzes a sample. I’m happy to be labeled as a “descriptive linguist” (as opposed to a “theoretical linguist”) or an “empirical linguist” (as opposed to an “arm-chair linguist”). And I don’t actively dislike being called a corpus linguist—it’s just that it seems odd to me when I really think about it.
I would argue that empirically analyzing a large and representative sample of natural discourse—a corpus—is clearly the best way to describe patterns of language use. I think this issue would not arise in the natural sciences. For example, could you imagine describing the mating behavior of coyotes without empirically observing their behavior in a representative sample? The problem is that linguists did try to accomplish analogous research tasks for decades in the 1960s-1980s, analyzing linguistic behavior (or at least linguistic knowledge) with no actual sample and no empirical data. Corpus linguistics was a reaction to that practice—and thus we have come to emphasize the sample itself. I think in this sense, corpus linguistics has changed the way we do linguistics; it has become more acceptable to analyze patterns of language use within theoretical linguistics, and analyses of corpora have become the standard way of carrying out such analyses. But this also makes me wonder if we no longer need the label corpus linguistics—because investigating language patterns in a corpus is becoming the normal way of just doing linguistics.
On the topic of large-scale projects that have focused on describing language patterns in large samples of natural language, you have led several major collaborative corpus projects. These have included the grammatical analysis of the Longman Corpus of Spoken and Written English (for the LGSWE), and construction and analysis of the TOEFL 2000 Spoken and Written Academic Language (T2K-SWAL) Corpus (Biber et al. 2004). Can you tell us a little about how you got involved with these projects, and what the process of working with a large collaborative team was like?
I like collaborating with a coauthor and working as part of a collaborative team. But surprisingly, I’m not sure why. I don’t think of myself as a social kind of guy, so I don’t crave being around other people. In fact, I really enjoy long days in the mountains, away from any trail, completely by myself.
But I somehow have ended up doing as much collaborative research as individual research. Part of this has been due to my philosophy of teaching and mentoring PhD students—I believe that the best way to help a student become a productive researcher is to involve them firsthand in your own research projects. This typically has involved data coding, writing computer programs for automatic corpus analyses, and mostly hand-editing the results of automatically annotated corpus output. I believe that students should be listed as coauthors on a paper if they have been active in the research process, regardless of their involvement in the actual interpretation of data or writing the paper itself. But in most cases, these have been genuine collaborations. The first author listed on my books and articles is the person who takes the lead in the analysis, and who actually does most of the writing. But all authors are usually fully involved in the analysis process and help with commenting and revising the paper.
Beyond my collaborations with students, I have also collaborated quite a bit with colleagues—most notably as part of the LGSWE project and the T2K-SWAL project. These both resulted in achievements that could not have been accomplished by an individual. I feel really lucky in both cases to have had the opportunity to work with such dedicated and hard-working teams, including in both cases several “expert” colleagues as well as a much larger team of student workers. There are of course difficulties in managing collaborative teams—people work at different speeds, and adjustments need to be made in the distribution of labor to keep the project as a whole moving ahead. Some of my most difficult personal interactions have involved the redistribution of tasks in large collaborative projects—I could never have been a successful corporate manager! But at the same time, I have to say that the results of these large projects have always been much better because of the collaborative input from a team of researchers—it was not always easy, but it turned out to always be worth the effort.
To wrap up this discussion, let’s turn to one of your most noteworthy projects: the LGSWE (Biber et al. 1999). Now that the LGSWE has been out in the world for a decade, how would you position this work in relationship to descriptive grammar studies more generally? What unique contributions does it offer? What are the next steps in corpus-based, descriptive grammars?
I think the LGSWE continues to be a unique contribution to linguistic research, and it is the work that I’m most proud of in my career. I’m obviously biased—but in my view, the LGSWE accomplishes what no other reference work has even attempted: a comprehensive grammar of a language that gives as much attention to the description of language use as it does the description of grammatical structure. There’s tons of stuff in there that you won’t even find in specialist studies of particular grammatical features. We consistently documented the use of grammatical features and variants in each of four major registers, interpreted those quantitative patterns in functional terms, and provided extensive illustrations of all patterns with authentic examples from those registers. The grammar also consistently adopted a lexico-grammatical perspective, which was again approached empirically, identifying the sets of words that occurred most frequently in association with most grammatical features and constructions.
There are obviously ways in which the LGSWE could be extended and improved. For example, there could be more discussion of the semantic classes of nouns, and more analysis of prepositional phrases as modifiers of noun phrases. We could have carried out more corpus investigations of valency patterns for verbs. And we certainly could have included a more systematic comparison of grammatical use in British English versus American English (versus other national varieties).
I would actually love to be involved in a revision of the LGSWE. The student version of the LGSWE (Conrad, Biber & Leech 2002) has also done well as a textbook used in grammar courses around the world, and I’d also really enjoy the opportunity to revise that book. But publishers seem to have become more conservative over the last decade, and less willing to embark on truly innovative reference works and textbooks. I’m sure there will eventually be future bigger-and-better reference grammars that will take the place of the LGSWE—but they will require a visionary publisher to help get them off the ground.
I’m sure all of us would enjoy seeing that next step for the LGSWE. It’s certainly been a valuable work for me and many others. Doug, thank you for taking the time to share your reflections and visions for the field moving forward. I’ve certainly enjoyed them, and I hope that readers will too!
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
