Abstract

This Special Issue of Discourse & Communication contains five articles which examine corpus-based approaches to the analysis of media discourse. These articles have been chosen because they represent advances in the growing and somewhat diverse field of corpus assisted (critical) discourse analysis. While each article is based around the analysis of a newsworthy topic or concept (Hurricane Katrina, Edward Snowden, the killing of Lee Rigby, nuclear proliferation, masculinity), these are not merely five ‘by-the-numbers’ analyses which simply aim to tell us how the media represented that topic or concept. Instead, there is a focus on methodological innovation within the field and techniques that could be applied to other studies. Potts et al., for example, examine how grammatical and semantic annotation can be used in order to identify (changing) news values. Along similar lines, MacDonald et al. consider how discursive strategies like normalisation and reification can be identified using corpus methods, while Branum and Charteris-Black show how keywords can be grouped into categories which demonstrate reporting style, referential strategies and personalisation. Baker and Levon’s article is more evaluative of corpus techniques, involving an experiment where a corpus analysis is compared against a qualitative analysis of a down-sampled set of articles from the same corpus. Finally, McEnery et al. discuss how corpus analysis can be carried out on a corpus of tweets, taking into account issues like non-standard spellings and repetition brought about by retweets which may skew frequencies. These articles ought to reach audiences that go beyond those who are only interested in the news stories themselves, as they all have something useful to tell us about how corpus assisted discourse analysis can be carried out effectively.
The application of corpus linguistics to discourse analysis was first mooted in the early 1990s in ground-breaking research by Caldas-Coulthard (1993, 1995) and Hardt-Mautner (1995) among others. Such research involved the application of what are now commonly recognised techniques among corpus linguists: analysis of frequency lists, concordances and collocates. As corpus tools became more sophisticated, the range of analytical techniques expanded. For example, Mike Scott’s WordSmith Tools allowed users to create, among other things, keyword lists – a special kind of frequency list based upon comparing one corpus against another ‘reference’ corpus in order to identify words which were statistically more frequent than expected in the corpus under analysis. The development of software which can automatically attach tags that give grammatical and/or semantic information about words or phrases (e.g. Wmatrix – see the contribution by Potts et al. (pp. 149–172)) has also opened up new analytical avenues. Another piece of software, Sketch Engine, used by McEnery et al. (pp. 237–259), expands on the notion of lexical collocation to consider frequent or salient grammatical relationships between words; for example, when a particular verb collocates with a noun, does it place the noun in subject or object position? Sketch Engine thus has a useful application for considering different representations of agency in corpus data.
In 2008, an article by Baker et al. spoke of combining critical discourse analysis (namely, Reisigl and Wodak’s (2001) Discourse Historical Approach) and corpus linguistics to form a ‘useful synergy’. And in 2010, the journal Corpora published a Special Issue on a series of articles, led by Alan Partington, that used the Corpus Assisted Discourse Studies (CADS) approach with diachronic media data. The two approaches share a lot in common in the use of similar corpus-based methods, although they differ somewhat in terms of motivations for conducting research – with CADS placing more emphasis on an exploratory approach that is not necessarily inspired by concern over a social problem, and that taken by Baker et al. more likely to incorporate a critical perspective to discourse throughout. The two approaches indicate that there is more than one mind-set with which to approach a corpus in order to carry out discourse analysis, although corpus linguists would tend to agree that their methods are likely to reduce (although not remove) various analytical biases, hopefully to an acceptable degree.
While it has become fashionable in recent years to characterise any collection of texts (however small and opportunistically collected) as a ‘corpus’, people working within the field of corpus linguistics would maintain that a true corpus (as opposed to a dataset) should be thoughtfully balanced and sampled in order to be representative of the language variety that we wish to be able to make reliable generalisations about. The use of large amounts of data, along with dispassionate computational procedures, helps to address Widdowson’s (2004) concern about ‘cherry-picking’ texts or techniques that are convenient if the analyst wishes to arrive at a pre-chosen destination. For example, a list of corpus-derived keywords can stubbornly occupy the analyst’s screen, quietly demanding that those words which were not expected to be statistically salient be concordanced and explained, driving the analysis in unexpected directions. That is not to say, of course, that a corpus analysis completely removes bias. With a growing range of tools and techniques to choose from, as well as various algorithms and their attendant cut-off points, a beginning corpus linguist may not know where to start and might simply stick with the default settings of the first tool they become aware of. Yet using a different tool, a different technique, different settings and cut-offs may produce different results altogether. We thus talk of reducing (to an acceptable degree) rather than removing bias. We want our research to be an analysis rather than a polemic, although, importantly, it needs to retain a sense of humanity, and human analysts need to reflect on the choices they made and how these may have impacted on their findings.
It could be argued that corpus approaches are ideally suited for discourse analysis of media sources, due to the sheer amount of textual data that are generated by news outlets and social media in particular. Attempting to analyse such data from a qualitative perspective can feel like placing one’s hand in a constantly flowing stream, so a corpus approach can enable researchers to cope with otherwise overwhelmingly large amounts of text – often encompassing millions of words. Much of the media discourse that we consume is now already in electronic format and encountered online, offering opportunities for software that can ‘scrape’ the Internet and compile large media corpora relatively quickly. All five of the articles in this Special Issue contain some analysis of news media from the press, and while it has been argued that people are consuming such ‘traditional’ forms of media (e.g. buying newspapers) less frequently than they used to, we do not believe that this spells the death of news. If anything, consumption is transferring to online sources where stories are published almost immediately, with multiple updates throughout the day, rather than daily, so such sites are usually more up-to-date than their print-based counterparts. Additionally, such sources of news are increasingly interactive – many news sites allow readers to comment on stories, while sites like Twitter offer newer ways of interaction, taking the ‘ownership’ of news out of the hands of a small number of privileged and controlling elites and instead enabling opportunities for mass participation. As McEnery et al. argue, however, the relationship between press and social media is porous, so, for example, many prominent journalists have Twitter accounts and their tweets tend to be frequently shared (retweeted) by other users.
There is also a comparative aspect (within and between media) running through this Special Issue, something which corpus linguistics methods are particularly well-situated to carry out. McEnery et al.’s article compares Twitter and press reaction to the killing of British solider Lee Rigby, while MacDonald et al. examine discourses around nuclear proliferation across newspaper texts and United Nations Security Council (UNSC) resolutions. Branum and Charteris-Black focus solely on the press, but their comparison is made between reporting strategies around Edward Snowden of three ideologically and stylistically very different newspapers: The Mail, The Sun and The Guardian. Potts et al. also focus on the press, but their dimension of comparison is time, showing how news values in stories about Hurricane Katrina changed during the three months which followed the event. Finally, the article by Baker and Levon views the press as a single entity, but then focuses on a comparison of how different types of men (based on distinctions within social class and ethnicity) are constructed within it.
Two of the articles in this collection are specifically concerned with methods of making sense of large amounts of corpus data, incorporating different techniques. Potts et al. discuss the potential for using grammatical and semantic tagging in order to group frequent words or collocates together, allowing more general patterns to emerge. Baker and Levon discuss an experiment whereby frequency criteria were used to down-sample a small set of texts from a larger corpus; these texts were qualitatively analysed, and the results compared against a collocational analysis conducted on the whole corpus. The two techniques yielded overlapping and distinct findings, and are helpful in indicating what a corpus approach can do well but may also miss.
It is important to acknowledge that we should position a corpus analysis of media discourse within a larger remit that also looks outside the corpus to consider issues of social context. A poorly executed corpus analysis is purely descriptive (e.g. feature x occurs y times in corpus z), and there is a danger, particularly when initially approaching corpus linguistics, that we could mistake a table of numbers as an analysis in itself, rather than a starting point. Even an analysis of a concordance table can be limiting if the analyst simply says what they see. Corpus software cannot interpret or explain the contents of a corpus, it can only identify patterns that humans may or may not have already noticed. But it is up to the human analyst to make sense of those tables of numbers and concordance lines, to explain what they actually mean and why they mean what they mean. This can often result in the analyst needing to step away from the corpus and confer with other sources of information, opinion polls, rules and guidelines relating to what can and cannot be said or written in a society, viewing figures, the political situation, the historical situation, the legal situation. In other words, a whole range of social contexts need to be considered in order to take a corpus analysis beyond description and into the more interesting realms of interpretation, explanation and social critique, a point which Branum and Charteris-Black make well at the end of their article.
Therefore, it is certainly not the aim of this Special Issue to encourage qualitative analysts to abandon their existing methods. Instead, it is hoped that they would be encouraged to consider taking up some of the principles of corpus linguistics. These involve working with larger datasets, allowing computer software to perhaps shift their focus away from what they expect (or dare I say it, want) to find in their analysis and instead show them patterns that they would not have thought to look for otherwise, as well as incorporating some statistical rigour into their results sections, all which are ultimately likely to make their findings more convincing.
Footnotes
Funding
The research presented in this paper was supported by the ESRC Centre for Corpus Approaches to Social Science, ESRC grant reference ES/K002155/1.
