Abstract
When journalists publish work based on data, they often appear to be working with secondary sources, such as leaked internal corporate communications or information derived from publicly available Internet sources. However, they are relying on a source of information that varies greatly from other secondary sources. Among the differences is the process by which the data is verified, particularly given that datasets are often very large and unprocessed. How, for example, does a journalist determine the authenticity of data such as The Paradise Papers, the largest leak in history, where more than 13.4 million files revealed the workings of the tax haven industry? The issue of authenticity is further complicated by the processes journalists use to prepare data for delivery to a wide audience. In this article, the authors describe how the model of critical reflection (Sheridan Burns, 2002, 2013) can be used to develop data literacy in first year journalism students as the first step in developing their sense of efficacy in dealing with the complexities of data journalism. Using a scenario based on a large, easily accessible dataset, the authors provide a model through which students can come to understand working with data as a core journalism skill. The model draws on Schon’s (1983) theory of reflective practice, which posits that professionals think by doing and on what Schon calls ‘the conversations we have with ourselves’.
Keywords
Introduction
During the ‘post-industrial era’ of journalism (Anderson, Bell, & Shirky, 2014), digital technologies have led to a rise in availability of data and the development of techniques for processing it into meaningful formats in a timely fashion. As a result, journalists who specialize in interpreting data have become increasingly common, and most journalists use some of the techniques of data journalism as part of their everyday practice. According to the American Press Institute, the consensus among journalists and educators in 2016 was that ‘by the time journalism students graduate they should have some experience with data’ (Sunne, 2016). They will need skills in processing and presenting data as a scientist would, as Phillip Meyer (2012) explains:
When information was scarce, most of our efforts were devoted to hunting and gathering. Now that information is abundant, processing is more important. We process at two levels: (1) analysis to bring sense and structure out of the never-ending flow of data and (2) presentation to get what’s important and relevant into the consumer’s head. Like science, data journalism discloses its methods and presents its findings in a way that can be verified by replication.
Meyer stresses the importance of the scientific approach, with its emphasis on transparency and replicability, as journalists filter flows of large quantities of data in a reliable, effective manner and share their findings with an audience.
In this way, journalists will be able to quickly ‘produce transparent, credible and exclusive narratives that can have enormous social and political impact’ (Graham, 2017). But these processes also have an ethical function in a world where addressing the sheer scale of available data, and presenting it as reliable, have become central features of serving the ‘public interest’. Fries (2012) argues
Information asymmetry—not the lack of information, but the inability to take in and process it with the speed and volume that it comes to us—is one of the most significant problems that citizens face in making choices about how to live their lives. Information taken in from print, visual and audio media influence citizen’s choices and actions. Good data journalism helps to combat information asymmetry.
Information asymmetry presents challenges to journalist and citizen alike, but also the exciting potential to reveal hidden things in the public interest. For instance, data journalists can work in multidisciplinary teams to tackle datasets, particularly large amounts of data, as scientists do, in order to achieve a coherent ‘sense and structure’ and bring it to the audience.
High profile examples have certainly added to the momentum that is driving the growth of data journalism and exploring the potential of ‘collaborative journalism’. The Panama Papers project is investigating a leak of over 11.5 million financial and legal records for publication in a standalone website, and various points of publication in mainstream media outlets. The project, which sets out to expose ‘crime, corruption and wrongdoing, hidden by secretive offshore companies’, won a Pulitzer Prize in 2017 for ‘Explanatory Journalism’ (The Panama Papers, 2017). Meanwhile, in August 2016, The Guardian published the ‘Nauru Files’ report based on analysis of 2116 leaked documents that described ‘assaults, sexual abuse, self-harm attempts, child abuse and living conditions endured by asylum seekers held by the Australian government, painting a picture of routine dysfunction and cruelty’ (Farrell, Evershed, Davidson, & Wall, 2016). Then there was the Paradise Papers project, which began to yield publications in 2017 for a number of media outlets. It relied on an international consortium of investigative journalists to process the largest leak in history with more than 13.4 million files revealing the workings of the tax haven industry (ABC, 2017). These cases all involved large datasets, made available by major leaks, that resulted in investigations into corruption and inhumane activity by government, business and individuals.
A growing body of literature maps the state and impact of data journalism as a development of growing significance (Fink & Anderson, 2014; Knight, 2015; Parasie & Dagiral, 2013). There is also growing consensus around the importance of data journalism to legitimacy of contemporaneous journalistic practice, and indeed, to its sustainability in the post-industrial marketplace (Sheridan Burns & Matthews, 2017; Stoneman, 2015; Tong, 2018). It follows that the training and education of journalists must incorporate digital literacy and familiarity with data in general and assist some journalists toward becoming specialists in the knowledge and skills that public interest data journalism calls upon (Howard, 2014; Larrondo Ureta & Peña Fernández, 2017; Stoneman, 2015).
Journalists traditionally seek out information and verify its reliability through tracking down primary and secondary sources (Sheridan Burns, 2013). The most orthodox means of verifying a secondary source is to check that it has been published in a rigorous form, while primary sources provide first-person substantiation for stories, such as direct comments from witnesses or official representatives of public and private organizations or businesses. These sources are regarded as highly authentic because they can be verified by the journalist.
But the flow of data is both primary and secondary, and neither. When journalists publish work based on data, for a number of reasons they are relying on a source of information that varies greatly from other secondary sources. How, for example, does a journalist determine the authenticity of data such as a cache of thousands of individual ‘official’ reports of abuse and mistreatment created by Government employees, such as those relied upon by The Guardian for the ‘Nauru Files’? Or the 11.5 million financial and legal records that made up the data relied upon by the journalists working on the Panama Papers? The processes by which the data is edited, interpreted, shaped and filtered alters it to a new form to evidence the claims made by the journalist. New York Times journalist Chase Davis, an expert in the use of data as a source, described this range of processes and uses as follows: ‘[d]ata journalists report and write, craft interactives and visualizations, develop storytelling platforms, run predictive models, build open source software, and much, much more’ (Howard, 2014, p. 99).
In other words, rather than focus on conducting research for information that has been published by reliable sources, data journalists also interpret, synthesize, analyse and evaluate data in a range of creative modes. Students need to learn to work mindfully with this source from the beginning of their studies, just as they do with primary and secondary sources. Furthermore, they need to understand that data journalism is likely to require them to work collaboratively, with other journalists, data scientists and information designers. To prepare for these experiences, students need to understand the limitations of data and become comfortable with manipulating it. Working with statistics is a good way to introduce students to the complexities of data and demonstrate that it is not value free. More complex data journalism can be tackled in later years and enabled by students learning to use data visualization through self-paced software tutorials, such as those provided by Lynda.com.
The most important thing for students to understand in first year is that working with data is as core a skill as interviewing or visual literacy. Students often have anxiety about working with numbers and spreadsheets and need to understand that data is a source that can be taken apart like any other.
Teaching Students to Interrogate Data
Students often think that data is simply numerical—large amounts of intractable, unthinking, value free information. New York Times Journalist and Yale University lecturer David Brooks coined the term ‘data-ism’ to describe this kind of thinking, writing:
If you asked me to describe the rising philosophy of the day, I’d say it is data-ism. We now have the ability to gather huge amounts of data. This ability seems to carry with it certain cultural assumptions—that everything that can be measured should be measured; that data is a transparent and reliable lens that allows us to filter out emotionalism and ideology; that data will help us do remarkable things—like foretell the future. (Brooks, 2013)
Brooks asserts that our increased ability to gather data appears to have brought with it a growing reliance on certain dangerous assumptions. For instance, the assumption that measuring things is implicitly useful because the resulting data permits an escape from the emotion and value-driven bias in our analyses and interpretations of the present. Or the assumption that data can permit us to overcome complexity and accurately predict what is to come.
In the twenty-first century, the empirical processes used to generate and analyse data have rapidly become more affordable, and appear to provide certain evidence to offset the shortcomings of opinions offered by primary sources—individuals whose perspective may be biased—and offer plain ‘facts’. A growing body of literature highlights the problematic assumption of authority and objectivity on behalf of data, particularly ‘Big Data’, which can be biased, like any primary human source (Boyd & Crawford, 2012; Howard, 2014; Schrock, 2017). Howard has argued the process by which data is interpreted and represented, by ethical and well-meaning journalists, it can still mislead or misinform:
Bad data, biased data, and flawed experiments can and will be used ignorantly or cynically to twist the truth, mislead, or misinform, even by journalists who wish to do the opposite. Even good data and solid research may be misrepresented or mistaken, a risk that will grow if journalists are pushed to create data visualizations or analyses without training in information design, statistics, and social science (Howard, 2014, p. 80).
Journalists need to inspect data and identify the means by which it may be corrupted, made biased or how it is derived from flawed processes of experimentation. They know data is not value free. It has been created by people whose processes deserve investigation, verification and interrogation the same as any other journalistic source. Students need to learn to accurately interpret and represent data in summary form such as visualizations, and to draw on techniques and understandings that fall outside the traditional domain of media studies.
The skillset required to begin interrogating data is wide-ranging and rapidly changing, and data journalism continues to yield new subspecialties. In addition, journalists will need to collaborate with experts who are capable of interrogating data in ways that exceed their expertise. This means journalists are expected to be increasingly familiar with, and to stay abreast of, the language and processes that gathering and representing data rely on. It is not surprising that research indicates there is a ‘data science skills gap in journalism’ and university-level education is struggling to keep up (Howard, 2014, p. 44). Other organizations have sprung up to fill the gap, such as the USA-based National Institute for Computer-Assisted Reporting (NICAR) which has been operated since 1989 by the not-for-profit organization Investigative Reporters and Editors (IRE). NICAR ‘employs journalism students, and trains journalists in the practical skills of getting and analyzing electronic information’ (NICAR, 2017).
First-year students need to understand the basic concepts and processes that relate to data journalism. Thinking about data from the beginning reinforces for students that it is a core skill, just like interviewing or checking facts. Howard (2014) argues the basics can be grouped into five key areas—data collection, cleaning, analysis, presentation and publishing (p. 58). Today, data journalists carry out these processes by working with a range of tools that save labour, such as web-based technologies that allow more rapid and reliable collection, interpretation and representation of data through sophisticated visualization techniques. Students can be introduced to freely available versions of these, such as open source software and free web services that assist in preparing simple data visualizations. For example, MIT’s SIMILE project has led to the creation of open source ‘widgets’ such as ‘Timeline’ (
It is important to remind students that data is not uniform, and there are no universal tools, applicable in all contexts. Furthermore, the tools professionals use often demand expert skills and knowledge, such as the creation of bespoke coding solutions using programming languages that respond to the challenges peculiar to individual datasets. Data journalists also work at expert levels of engagement over longer periods of time, and often create multimodal, interactive content in communication with multidisciplinary teams. These teams may include developers, designers, broadcast designers and project managers and processes that take large periods time. The digital context of publication in which such teams work has, for example, inspired a new energy for the publication of long-form journalism, and for the creation of highly sophisticated, resource intensive multimodal journalism that take months or years to deliver. The New York Times has innovated to become a leader in delivering multimodal long-form content and published the 2013 Pulitzer Prize winning interactive piece ‘Snow Fall: The Avalanche at Tunnel Creek’ (Branch, 2012). So impactful on the commercial and creative fates of journalism was the piece, that ‘snowfall’ is now used as a verb ‘by editors who want to create similarly glitzy and high-profile projects’ (Dowling & Vogan, 2014, p. 9).
While these are significant changes, data journalism is still everyday journalism, and that students must understand that through these processes they still aim to find out, report and relate stories for an audience. In a sense, working with data is like interviewing a human source. Students must ask questions of the data and invite it to reveal the answers. But just as a source can only give answers about which he or she has information, a dataset can only answer questions for which it has the right records and the proper variables. Students must consider carefully what questions they want to answer even before they acquire their data, then look for data-evidenced statements to make in the story—gathering any information about the data needed to verify those statements.
Guided by these assumptions, students need to think about the data from within a strong critical framework, and to interrogate the data based on questions that are meaningful. Emily Bell of Columbia University argues that working with data is about this process of critical thinking and discovery and concludes ‘that’s something which is actually available to all journalists’ (cited in Howard, 2014, p. 58). In other words, the tools may have changed, but good journalism still relies on the same determination to tell ethically and fairly. Students need to ask questions of the data source such as ‘Is this data ethically framed?’, ‘What processes were used?’, ‘What processes am I using?’ and ‘Why?’.
In the end, the data journalist is tackling problems that journalism has always faced, for example, data always comes from questions that are framed somewhere, somehow, as part of the human process of interpretation. Using the model of critical reflection (Sheridan Burns, 2002, 2013), students first need to consider the analytic processes that yield the data.
This model is based on Schon’s 1983 and 1986 books considering the ways various professional groups exercised their professional knowledge. He called this process ‘thinking in action’. Schon (1986) found that it is common for professionals to find it difficult to articulate explicitly what is implicit in their practice. Applied to journalism, the conscious use of critical reflection provides a structure by which decision-making skills are learned along with, and as part of, writing and research skills. Journalism requires active learning, critical and creative thinking. Journalists gather information of significance to the task at hand, assessing its credibility and its validity. In writing a story that is at once ethical, accurate and attractive to the audience, journalists are held to high standards of thinking. Reflection is the bridge between journalism theory and professional practice.
It is through critical self-reflection that journalists develop self-reliance, confidence, problem-solving abilities, cooperation and adaptability while simultaneously gaining knowledge. Reflection is also the process by which journalists learn to recognize their own assumptions and understand their place in the wider social context (Sheridan Burns, 1996, p. 95). The model is well suited to consideration of one of the basic issues in working with data, namely, that it can reflect biases depending on how it is manipulated.
Often journalists start with a summary of data, but it is a good idea to provide students with an accessible set where they can see all the variables and records in the database rather than the subset that could answer the questions for the immediate story. Having access to the full data allows them to answer new questions that may come up in reporting, and even produce new ideas for follow-up stories. They should also have access to the ‘data dictionary’ that explains any codes being used by particular variables. Below, we illustrate our approach to teaching students to interrogate data, and the model of critical reflection, by introducing an example of data analysis and presentation where a careful process of data collection and cleaning have been executed—the 2016 Australian Population Census.
Excerpt from Population Summary Data—Australia (ABS 2017)

A Model for Working with Data
The 2016 Australian Bureau of Statistics Population Census (ABS, 2017) provides a useful dataset to use to demonstrate working with data. The statistics in Table 1 are drawn from the summary table of a subset data from the census results published in 2017. These are national statistics relating to the whole of Australia. The ABS is an authoritative source of statistics, but media coverage of the results of the census, released in August 2017, show how interpretations of data can vary. For example, The Conversation published 30 stories based on different aspects of the national statistics (Ketchell, 2017). A census is an important source of information about trends in society but there is so much data that not all stories are told and this makes a rich source of ideas for students.
A good place for students to begin is by reflecting on the statistics for their own suburb in the Australian census dataset, or its equivalent wherever they are. It reminds them that census statistics are about people like them, and that we can all be reduced to numbers even though everyone has a story to tell. And the average person does not really exist, only statistically. Ask them to ask themselves:
What stands out to me? What does the average/mean reveal? Is this what I expected?
Asking students to create a basic visualization of part of the data, such as a pie chart, provides the opportunity to reflect on whether the visualization reveals something that the numbers did not. Ask them to reflect on their expectations and assumptions about the data. Usually there is a reason why they are drawn a particular data subset and documenting these initial thoughts help students to identify bias and reduce the risk of misinterpretation of the data.
The dataset provided in Table 1 contains just 10 items from the summary national data collected in the 2016 Australian Population Census. A dataset of this size allows students to break down and ‘interview’ each of the 10 items. Of course, any summary set from the census could be used to engage students. It is a summary of a summary prepared by the Australian Bureau of Statistics, so it is not nearly the whole story. Students need to understand the basics in reading data, such as the difference between the mean and median. Students need to interview the data if you are to use it as a source of stories. Do the median figures tell the whole story? For example, the median weekly household income is $1438 but how can you work out how much of that is going on the mortgage or rent? First you need a formula to turn the weekly rent into a number that can be compared to the mortgage payments. One way is to multiply the weekly rent by 52.14 (365 days divided by 7 days per week). How does the situation for renters compare with those paying mortgages?
The summary data about marital status in the dataset is a good example of why you do not take statistics at face value, and the importance of reflecting on them first. This datasets the parameter of ‘aged 15 years and over’ which could actually skew your reading of the data. As the age of consent in Australia is 16, it is likely that there are children in the ‘never married category’. To know more, students would need to sort the data to remove the youngest age group. Manipulating this data in smaller age increments should also give insight into the age by which most people marry, and whether there is an increase in older people who never marry. They would need to return to the full dataset to extract information about the number of people in same sex relationships, who may currently be counted as ‘never married’. A sample set of questions that students can ask themselves in relation to Table 1 is outlined in Figure 1.
Introducing Visualizing Data
In order to make sense of data, journalists often need to visualize it. The question is not whether journalists need to visualize data or not, but which kind of visualization may be the most useful in which situation: ‘Visualization is critical to data analysis. It provides a front line of attack, revealing intricate structure in data that cannot be absorbed in any other way. We discover unimagined effects, and we challenge imagined ones’ (Aisch, 2012).
It is said that ‘data is not information, information is not knowledge and knowledge is not understanding’—and tables alone would not give you an overview of a dataset. Tables do not allow us to immediately identify patterns within data, so it is unrealistic to expect that data visualization tools and techniques will provide ready-made stories from datasets. It makes more sense to look for ‘insights’ which can become stories with further investigation. At the introductory level, visualization need not be complex or require advanced software skills. For example, tables are very powerful when you are dealing with a relatively small number of data points. They are useful to demonstrate one-dimensional outliers (such as the top 10), but they are poor when it comes to comparing multiple dimensions at the same time (for instance population per country over time). Charts, in general, allow mapping dimensions in data to visual properties of geometric shapes. Line charts are especially suited for showing temporal evolutions, while bar charts are perfect for comparing categorical data. Chart elements can be stacked on top of each other. Graphs show the interconnections between different data and are very good at illustrating changes over time. Moreover, like other forms of narrative journalism, data visualization can be effective for both breaking news—quickly imparting new information like the location of an accident and the number of casualties—and for feature stories, where it can go deeper into a topic and offer a new perspective. Asking students to visualize a subset of their data using the ‘Tables and Charts’ function in a word-processing program is a good introduction to how visualization can yield different perspectives.
Conclusion
Data is neither a primary source or a secondary one. It is ‘live’ in the sense you can ask it questions and arrive at answers, unlike secondary sources, which must be taken at face value. Data is also as significant and as valid as primary and secondary sources. But it is not value free. It has been created by people whose processes deserve investigation, verification and interrogation the same as any other journalistic source. Someone has set parameters on the information collected which in turn affects what is presented. Data can be a powerful source of research information for journalists who know how to read and interpret data and how to use the many tools available to visualize data and make it more easily understood.
These are not specialist skills to be learned after grasping the ‘basics’—such skills are as basic a part of the journalist’s tool kit as asking questions. A capacity for effective collaboration with multidisciplinary teams and teams of other journalists in ways that move outside traditional industrial settings are also becoming more commonplace. Contexts and formats of publication that are emerging and gaining attention, such as high profile collaborative journalism that includes multiple media outlets and teams of investigative journalists, and or multimodal long-form, are bringing a new energy to data journalism. Such practices produce impactful publications that have a genuine role to play in renewing the public interest function of journalism and are sure to play a role in the future life of graduates. Providing a process through which students can begin to think through the issues associated with working with data is an important first step in them developing self-efficacy in their capacity to use data as a basis for reliable, ethical journalism that serves the public interest.
