Abstract

Christian Jones and Daniel Waller’s Corpus linguistics for grammar: A guide for research is an introductory textbook designed for undergraduates or postgraduates with little or no experience in corpus linguistics, but with a basic knowledge of English grammar. In particular, the book is presented for use in advanced classes on applied linguistics and Teaching English to Speakers of Other Languages (TESOL), though it would certainly not be limited to those areas. It aims to fill a gap in existing pedagogical resources by guiding students through the first steps of exploring language with the help of corpora. The book’s approach is to analyze lexical and morphosyntactic features of corpus data and then to draw conclusions about semantics, pragmatics, discourse, register, and sociolinguistics. The authors carefully underline their emphasis on corpus applications for TESOL, and for English language and literature, but they also address discourse analysis and even data-driven journalism.
The book is composed of three sections, with three chapters each. Each chapter is neatly accessible, with a useful introduction and conclusion for students, a substantial number of activities with example responses, and a list of further reading. Part 1, “Defining Grammar and Using Corpora,” provides a knowledge base for understanding corpus linguistics and grammar, beginning with chapter 1, which wisely addresses the questions of why we use corpora, introduces key terms in corpus linguistics, and illustrates uses of corpora that match a general emphasis in the book on TESOL. Chapter 2 defines “grammar,” with useful references to Carter and McCarthy (2006) and other prominent researchers. Grammar is approached descriptively, and presented in terms of form and function, morphology and syntax, and its integral connection to lexis. Furthermore, grammar is defined as a system for making meaning. Relating to the last point, the book is generally concerned with the relationships between morphosyntax and meaning. The former is seen as a means to understand the latter; both are included within the scope of grammar.
Chapter 3 introduces specific corpora, corpus software tools (including LexTutor and AntConc), and online statistical calculators. The authors have restricted themselves to open-access corpora, which will be beneficial to many students and instructors. Several of Mark Davies’s corpora are presented (including BYU-BNC, COCA, and COHA; http://corpus.byu.edu), but there is also a strong collection of spoken corpora, including MICASE (http://quod.lib.umich.edu/cgi/c/corpus/corpus), VOICE (https://www.univie.ac.at/voice/), and HKCSE (http://rcpce.engl.polyu.edu.hk/HKCSE/). Altogether, this constitutes a reasonable balance of large and small datasets, spoken and written, from around the world.
Part 2, “Corpus Linguistics for Grammar: Areas of Investigation,” forwards three areas of investigation in corpus linguistics: frequency, chunks and colligation, and semantic prosody, in one chapter each. In chapter 4, frequency is explored, and observed corpus frequencies are appropriately contrasted against intuition and established assumptions. Sample exercises reinforce this contrast. Examples illustrate, in an accessible way for beginners, the iterative steps involved in frequency investigations, from basic word searches, to considerations of inflectional forms and parts of speech, to examinations of multi-word combinations. The discussion moves from an exploratory stage of lexis in corpora, to the development of research questions and the testing of hypotheses against data. For example, an engaging initial exploration of modal verb frequencies in the Prevention of Terrorism Act (UK) gives rise to an empirical question on the relationship between text types (such as legislative documents) and authorial use of modal verbs. A final exercise on the frequency of verb tenses in reported speech is designed to test students’ intuitions about frequency of use, and constitutes a well-defined introductory corpus study.
Chapter 5 introduces “collocations,” “colligations,” and “chunks,” and, like much of the book, emphasizes their importance in TESOL: frequent chunks are presented primarily as being useful for students in TESOL contexts. Examples are drawn from speech and writing, and more meaningful chunks (such as I was wondering) are contrasted against less meaningful ones (such as a break and go check). Again, an array of simple investigations is illustrated, which moves iteratively from simple to complex corpus searches, and from an exploratory stage (including the use of LexTutor to identify five-word chunks) to the formation of research questions related to meaning and use of particular chunks in particular contexts.
An entire chapter dedicated to “semantic prosody,” chapter 6 is a rare feature that will appeal to many instructors of introductory courses. Semantic prosody is neatly defined as the positive, negative, or neutral semantic “shadings” or “tone” carried by specific expressions, as analyzable in written texts and speech transcripts (it is not related to phonological prosody). The chapter reflects the book’s general approach to corpus data, and grammar, as tools for studying meaning. The chapter includes strong demonstrations of the difficulties in analyzing meaning in corpora—as, for example, with the identification of potentially negative semantic prosody underlying make in expressions like be made to do something. The discussion thus facilitates the introduction of basic semantic and pragmatic analysis.
In part 3, “Applications of Research,” corpus methods are modeled efficiently via an array of excellent activities that are practically executable as small, introductory investigations, but also expandable into much larger studies. Chapter 7 presents corpus studies in TESOL. Examples effectively reinforce the key concepts and methods already presented, though with expanded scope. For example, one excellent sample exercise compares TESOL textbook descriptions of grammatical meaning (such as the meaning of the past simple tense) to corpus examples of grammatical structures, in order to test and challenge established pedagogical assumptions.
Chapter 8 relates to corpora for discourse analysis and data-driven journalism, maintaining the book’s approach of examining lexis and morphosyntax to draw conclusions about meaning. The chapter includes the book’s first explicit outline of research design: from the formulation of research questions, to the identification of lexical and grammatical evidence, the use of an additional data set as a control, and the interpretation of findings.
Chapter 9 concludes by bringing together the types of exercises presented throughout the book, under the umbrella of a very well-explained principled research process. Method and methodology are summarized here in terms of formulating research questions and designing experiments. The chapter is well organized, and will function as a valuable reference for beginning researchers.
The book is an effective, accessible introductory resource that can be confidently recommended to students encountering corpus linguistics for the first time, and can also serve as a stepping stone toward intermediate level corpus linguistics textbooks. The book is largely directed toward using lexical and morphosyntactic queries to learn about variation in semantics, pragmatics, discourse, register, or sociolinguistics. Indeed, the authors carefully state their definition of grammar as intrinsically linked to those other areas. This deliberate approach is valuable given how often students fail to see how or why they might use grammatical annotation or grammatical features in corpora for anything beyond morphosyntactic questions.
The book’s great strength is its large collection of exercises and activities (twenty-two formal and many more informal ones), arranged in manageable steps that will be accessible even to beginners. Many exercises constitute simple but powerful comparisons between linguistic intuition, prescriptive rules, and corpus data, reinforcing the “why” of corpus linguistics and modeling the modes of thought and theoretical underpinnings of corpus linguistic research. These exercises are particularly important for students who easily find themselves lost amidst a labyrinth of tools and techniques, and who thus lose track of the underlying benefits of corpora as sources of linguistic data. With these exercises and the accessible explanations surrounding them, the book fulfills its aim, and most chapters can be endorsed as exploratory discussions spotlighting the key reasons for using corpora.
Along with the extremely effective exercises, most chapters include an insightful section called “Limitations,” reminding readers that the exercises have only scratched the surface of the complex issues at stake in corpus linguistics. This is a wise addition to an introductory textbook, and reflects a thoughtful balance between easing students into the subject, and acknowledging the scope of complexities in the field.
The arrangement of the book raises the question whether the best pedagogical approach is to explore tools (corpora) first and later to practice formulating research questions based on those tools, or to formulate linguistically meaningful questions first, and then to practice selecting appropriate data such as corpora. These alternatives constitute a reasonable pedagogical debate, and for instructors who seek the former option, this book will be especially useful.
There are a few key concepts in corpus linguistics that are underemphasized in the book, including “representativeness” and “normalization.” In a related way, statistical analysis is given only summary attention. Representativeness is not mentioned as such, though the authors regularly refer to “context” of corpus data to refer to the nature of the population from which a corpus was drawn. A clear and explicit discussion of representativeness can avoid the common problem of discussing representativeness in general, rather than as a relationship between a specific corpus and a particular, larger body of language in use. Normalization is discussed only in terms of normalizing per million words. This is problematic, but not a problem unique to this book, as such normalization is the default for a vast amount of corpus research—indeed, there may not be a pedagogical resource that adequately describes normalization options. Nonetheless, a book on corpus linguistics for grammar would have constituted a good opportunity to explain alternative grammatical normalizations, including normalizing features per phrase (e.g., per noun phrase), per clause (e.g., per main clause, per subordinate clause), or per word class (e.g., per noun, per verb). Finally, statistical analysis in the book is limited to three calculations (log likelihood, t tests, and mutual information), and an online calculator is recommended, but there is no clear method presented for selecting an appropriate test, and fundamental concepts like the notion of “occurrence by chance” are underexamined. That said, the book does not aim to teach statistical analysis, and instructors might choose to employ other resources, such as Oakes (1998), to illustrate statistical analysis as it relates to corpus experimental design.
Despite these shortcomings, Jones and Waller’s book effectively fills a clear gap in the pedagogical literature by presenting in an elementary and engaging way the real value of corpora as data sources, and of corpus research in linguistics, for an audience new to the field. As a pedagogical tool, a textbook can be expected to model ways of thinking and practice in the field, and to inspire students to engage more fully with the subject matter. Jones and Waller’s book does just that.
