Abstract

Where there is any perception of digital humanities (DH) within the field of English linguistics, it may be seen as a technical practice preoccupied with digitizing texts and producing digital editions. The one occurrence of digital humanities in JEngL’s archival content is a passing reference to digital editions and the role of editors—in an interview rather than a research article (Grant 2014). Within DH, there is a more vigorous conversation about what DH fundamentally is, alongside creative methodological questions about what it can be. That is because DH—from within—is largely viewed as a methodological challenge, driven by meaningful, even urgent research questions originating not only in the humanities but also in the social sciences which can be most effectively addressed via the development of new digital methods and tools. If DH is a methodological practice, it is in the sense of methods and epistemology: asking and debating how it is that we can know what we need to know, and testing the efficacy of selected digital methods in the service of specific research questions. As a corpus linguist based in a DH center, I present here a view of DH and English linguistics from within both disciplines. My discussion begins with a focus on corpus linguistics, but also includes English linguistics more generally, and linguistics as a whole. I argue that we as linguists should care about DH, not only because much of what we do is DH (even if we do not always recognize it as such), but also because collaborations between English linguistics and DH will be fruitful for all of us.
Research questions in DH are wide-ranging; recent major DH projects that encompass humanities and social sciences include:
An inquiry into power structures and recording practices in intellectual and political correspondence from the sixteenth through eighteenth centuries, analyzing nearly half a million European letters and employing digital network graphs to explore patterns of dissemination across space and time; 1
An examination of how countries worldwide categorize and count ethnicities, how this has changed over time, and what factors relate to that change, combining thirty years of digitized census questionnaires and social and political data from over 200 countries; 2
An investigation into lexicalization trends in the history of English, employing the Historical thesaurus of English digital lexical ontology to identify historical points when specific concepts have been expressible using an unusually large range of words and expressions, and seeking social and cultural explanations for those peaks; 3
The development of a theoretically informed approach to emerging technical and ethical challenges in big data, examining case studies of contemporary big data issues interpreted via theoretical frameworks from archival studies. 4
DH tends to approach resources in the humanities or social sciences as unstructured data, and/or develops humanities-based approaches to structuring, curating, managing, preserving, and sustaining that data in ways that are meaningful to humanities or social science scholars who need to observe and analyze the complexities, ambiguities, and vaguenesses of human experience. Unstructured data come in many forms, including those above, as well as literary texts, 5 oral history interviews, 6 or written historical records (such as records of convicts and sentencing). 7 Such data are very often linguistic, and it is linguistic data that I would like to focus on in this column—whether in the form of the individual written text or the comprehensive body of an author’s works; very large, arbitrary (and messy) text archives; or indeed systematically sampled, well-curated language corpora. In fact, methodologically minded corpus linguists will likely already see similarities between DH as I have defined it and corpus linguistics. Corpus linguistics is very often a methodological challenge driven by research questions about the nature of language; it asks how we can discover what we need to know about the nature of language; relies on samples of natural language data as its object of analysis; and frequently develops new digital tools, techniques, and technologies to address its research questions.
This methodological affinity is precisely the reason that most DH practitioners (or DHers) readily include corpus and computational linguistics within the DH umbrella, and at least some linguists already consider themselves DH practitioners: conversations and Q&A sessions at the Digital Humanities Congress certainly affirm this affinity. As the European Association for Digital Humanities points out, when the association was formed, “linguistics was understood as one of the main pillars of humanities computing”; the association argues that an integration of DH and linguistics must continue, for the sake of both disciplines. 8 It is worth noting here that one of the leading journals in DH, Digital Scholarship in the Humanities, was (until 2015) Literary and Linguistic Computing, one of the key journals in corpus linguistics; a perusal of the table of contents from issues then and now shows significant disciplinary overlap and supports the idea of a broad DH umbrella.
As with data science in general, DH is a particularly valuable practice when dealing with large amounts of information—quantities so large that individual human beings struggle to process them in their entirety. The problem of data size is what led Franco Moretti (2013) to coin the term—in jest at first—“distant reading.” Distant reading has become an essential term in DH, a shorthand for any kind of large-scale text analytics. Moretti, however, described it more specifically in a kind of dialectic with established senses of “close reading” in the humanities: for Moretti, close reading is a kind of open-ended, exploratory “listening” to an individual text as a complete and coherent object; it is this “listening” that “produces” interpretation (Moretti 2013:53). Distant reading, in contrast, “allows you to focus on units that are much smaller or much larger than the text” (Moretti 2013:48-49), that is, to define variables at the level of word, phrase, or discourse unit (smaller than the text), and to investigate them across entire text collections (larger than the text). For Moretti, close reading yields interpretations but does not test them. When I teach corpus linguistics to literature students, I forward a similar perspective, suggesting that literary study traditionally has the privilege of spending most or all of its time on an exploratory stage and hypothesis formation, whereas corpus linguistics necessarily spends more time on the subsequent stage of testing hypotheses. Thus, a literature student might closely read a small selection of 1960s science fiction texts, exploring each individual text as a whole, in an open-ended way, and “listening” for possible interpretations, allowing the student to forward conclusions about science fiction, or about the 1960s, or science fiction in the 1960s. For corpus linguists, those conclusions are hypotheses that can be tested by systematically designing a much larger corpus of 1960s science fiction, which might never be practically readable in its entirety by a human reader, identifying and analyzing only the specific variables relevant to the hypotheses, and comparing the corpus to samples of other contemporaneous fiction, or science fiction from other decades, or some other comparator data set. Coming from different backgrounds, DH practitioners and corpus linguists both understand this step from forming hypotheses to testing them.
Objections to DH still often come from the established humanities perspective of exploratory close reading. One riposte is that the digital (and the quantitative) can never accommodate the vagueness of meaning and vagaries of culture that close reading can. Ted Underwood (who prefers distant reading as the name of his own field of inquiry, rather than digital humanities; cf. Dinsman 2016) has described this objection cleverly: [. . .] it has become folk wisdom that computers can only handle crisp binary logic. If you tell a computer that a novel both is, and is not, an example of detective fiction, the computer is supposed to stammer desperately and emit wisps of smoke. In reality, the whole point of numbers is to handle questions of degree that don’t admit simple yes or no answers. Statistical models are especially well adapted to represent fuzzy boundaries [. . .] (Underwood 2014:9)
For corpus linguists, arguments around distant reading may almost seem like a red herring. We are well trained in defining linguistic variables, and identifying and analyzing examples of them in (potentially very large) data sets. We rarely, if ever, situate our work in relation—or opposition—to exploratory close reading, “listening” to a text in an open-ended way toward whatever interpretation may come. Indeed, it may sometimes seem to corpus linguists that DH practitioners who employ text analytics methods are torturously re-interpreting corpus linguistics tools in response to this apparent red herring. It may be off-putting to corpus linguists to hear DH practitioners explaining purportedly novel methods that are eerily similar to decades-old corpus linguistic practice. But the recycling, re-interpretation, and adaptation of these techniques are not just a response to the sociology or history of close reading in literary studies; they are also in response to a scope of research questions that is broader than—while overlapping with—that of linguistics. Unlike linguists, many DH practitioners are not trying to understand the nature of language, but rather using language data to understand other humanities and social sciences questions around, for example, the nature of the present and the past, the self and the community, or internal and external worlds, among other areas. In some ways, there are parallels in the history of corpus linguistics, when we corpus linguists have appropriated and re-interpreted tools that were originally designed in the field of computational linguistics to meet particular end-user needs and complete specific engineering tasks such as information retrieval; if this is torturous (and it sometimes seems so to computational linguists), and if we sometimes re-present these decades-old techniques as novel, it is because we are applying those pre-existing tools to ask questions about the nature of language, and we cannot simply transfer them into our field in unexamined ways without intellectual integrity. So, too, for DH practitioners experimenting with the toolboxes of corpus linguistics.
With colleagues from other (sub-)disciplines, I employ an epidemiology metaphor to situate corpus linguistics within language study more broadly, including DH. At one level, when observing the nature of a disease in a human population, there are social workers and carers who engage with a small number of individuals, and know a great deal about their particular circumstances, peculiar symptoms, life histories, and home environments. This is parallel to scholarly humanities expertise built on close reading of an individual text or an author’s oeuvre as a single, coherent whole, and it is indispensable. Moving up a level, there are local or regional public health practitioners, who cannot possibly know every idiosyncrasy of each individual case, but must instead understand general trends and exceptions in the manifestation of a disease across a well-defined population. This is like corpus linguistics with relatively small, systematically sampled corpora, or DH studies with restricted, well-curated data—and it is equally indispensable. Finally, at the highest level, is national or international epidemiological research, which maps only the largest scale patterns, ignoring sometimes surprisingly large-scale exceptions to trends and necessarily oblivious to individual idiosyncrasies; this is akin to corpus and computational linguistics, or DH research, with very large, very messy data (also indispensable). Corpus linguists and DH practitioners are likely to understand this metaphor in similar ways, and to be able to reflectively situate their work along this continuum, even if—unlike social workers, local public health practitioners, or international epidemiologists—corpus linguists and DH practitioners are likely to move back and forth along this continuum over the course of the careers, or even within a single research project.
If Moretti felt compelled to define his work against a frame of exploratory close reading, corpus linguistics, and English linguistics more generally, have historically felt compelled to define themselves in relation to disciplinary spheres of generative linguistics, sociolinguistics, cognitive linguistics, or computational linguistics—but rarely, if ever, in relation to exploratory close reading. And if exploratory close reading feels like a red herring to corpus linguists, the disciplinary debates in linguistics may very well seem a pedantic distraction to many DH practitioners. It is a common observation in linguistics that some of those intra-disciplinary debates are disappearing among younger scholars; that may be happening in parallel for DH within the humanities. And it may be that any disciplinary barriers between DH and linguistics will dissolve as well.
Of course, there is a danger, given funding landscapes, that humanities or social science scholars—including those in English linguistics—will tack on digital elements to otherwise very strong (and already complete) project proposals, or that funders will favor bids with arbitrarily added digital components. I would argue that this is contrary to the nature of DH as I have defined it: DH begins with humanities or social science research questions and then designs research programs that can best address those questions. If a digital component is an unnecessary accessory, rather than an innovative means of solving a real research problem, then it should be dropped—and that principle should guide both scholars and funders. The DH practitioners that I collaborate with adhere to this principle stringently and decline to participate in projects where they would be a superfluous appendage.
Linguists are indispensable to DH because we have a deep understanding of the unique features of language data, which differentiate language data from other data, even within the humanities. Words, grammatical categories, or semantic or pragmatic elements of language, as data points, have features that differ from any other data points: such as, for example, personal information in eighteenth century convict records or relationships between early modern correspondents. We should not expect DH practitioners to suddenly rush to investigate selection preferences among partial near-synonyms or changing grammatical meaning in progressive constructions. But linguists’ understanding of semantic relations and grammatical meaning can inform the development of more sophisticated DH tools to address broader research questions. For example, the Linguistic DNA project—a collaboration that included DH practitioners and corpus linguists—investigated prominent discourses across tens of thousands of Early Modern English texts by developing a new model of lexical co-occurrence. That model was grounded in corpus linguistic methodology and modes of thought, including an innovative grammatical baseline for ranking sets of three or four non-adjacent co-occurring lemmas, as a means of representing discursive meaning. 8
In addition, other humanities and social sciences data have features that linguists are not adept at handling. DH scholars often discuss the experience of working with data scientists from other specialist areas, who are surprised and fascinated by the complexity of DH data, and by DH expertise in dealing with both extremely fuzzy categories and messy data. For example, the Intoxicants and Early Modernity project interrogated the increase in production, traffic, and consumption of intoxicants in sixteenth- and seventeenth-century England, collecting data on people, places, objects, organizations, terminology, and events from port books, depositions, inventories, acts, orders, and licenses, and created a new data ontology to query this mass of information in pursuit of its research aims. 9 Such DH expertise in extremely complex data sets from multiple non-aligned sources, which reflect human experience, has led to collaborations between DH practitioners and data scientists from, for example, corporate equity, to navigate complex characteristics of tens of thousands of small businesses alongside corresponding complex variables of pandemic rescue financing. 10 And DH practitioners’ experience and expertise in managing the complexities of humanities data, and building software systems that accommodate those complexities to support humanities and social sciences research questions can help linguists mine the intersections of linguistic and non-linguistic variables such as grammatical meaning and complex human relationships. Such expertise is particularly strong among DH research software engineers (RSEs). Moreover, DH RSEs can assist linguists in analyzing linguistic data alongside other humanities data such as historical photos, for example, in multimodal analysis of texts and images in newspaper archives. Indeed, DH has been marked in recent years by ever-expanding multi-disciplinary collaborations, between humanities scholars from all sub-disciplines, social scientists, computer and data scientists, statisticians, library and information scientists, and RSEs. Linguistics, too, has benefited from such ever-expanding multi-disciplinary collaborations.
It may be that DH practitioners hold up the term “humanities” as a point of pride precisely because of this contrast between data scientists from other specialist areas and DH RSEs who specialize in messy, fuzzy data rooted in human experience; or it may simply be disciplinary habit. It may also be that as English linguistics (particularly in the USA) is increasingly situated within the social sciences rather than the humanities, and as linguistics more broadly may be situated anywhere from social sciences to data science to engineering to medicine, and beyond, that the “humanities” in “digital humanities” is itself a term that puts off potential collaborators from linguistics. This would be a shame, as such collaborations have been patently beneficial already.
It would also be a shame if DH practitioners and linguists did not strategize to include each other in their collaborations. The way to do that—as is often the case with burgeoning collaborative work, particularly when it is multi-disciplinary—is to lead with our pressing research questions, and to set aside dedicated time and space to workshop effective approaches for addressing those urgent questions. Conferences offer such an opportunity, and we as English linguists might proactively invite DH practitioners to attend our disciplinary conferences, with dedicated workshops or panels exploring collaborative possibilities. Perhaps even more practical is for us to contact our nearest DH practitioners or DH center and invite them to a closed session to explore potential collaborative projects and learn about and from each other’s work. I expect that the benefits would be significant.
Footnotes
Declaration of Conflicting Interests
The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author received no financial support for the research, authorship, and/or publication of this article.
