Abstract
This study explores the level of scrutiny data journalists from national, local, traditional and digital outlets apply to data sets and data categories, and reasons that scrutiny varies. The study applies a sociology of quantification framework that assumes a tendency for data categories to become “black-boxed,” or taken-for-granted and unquestioned. Results of in-depth interviews with 15 data journalists suggested these journalists were more concerned with data accessibility and ease of use than validity of data categories, though this varied across outlet size and level of story complexity.
Keywords
Most recent studies of data journalism—informational, often graphical accounts of current public affairs based on quantitative data—have focused on the changing normative environment that data work brings to newsrooms, 1 on the phenomenon of big data, and on the behavior of, or input from, online users (Borges-Rey, 2016; Bucher, 2017; Coddington, 2015; Hermida & Young, 2017; Karlsen & Stavelin, 2014; Lewis & Usher, 2014; Lewis & Westlund, 2015; Parasie & Dagiral, 2013). Scholars have paid less attention to the data themselves (Messner & Garrison, 2006, p. 54). An exception is the scholarship on the sociology of quantification, which helps reveal the role that powerful institutions and officials play in the construction of data sets and categories used by journalists, the quiet authority of the numbers themselves and the way statistical measures tend to become unquestioned and naturalized over time. Especially important is the initial construction of data categories, in which discrete, unique events, people and objects are commensurated—in other words, conveyed as similar so they may be aggregated, or calculated together (Espeland & Stevens, 1998). Government agencies do most of this commensuration work for data sets used by newspaper journalists, as they have the necessary human and economic capital. Also, data categories, data measures and data sets are more likely to be produced at the national level because local agencies have fewer resources and less expertise.
It is unclear how often data journalists question how data categories, which are cooked into the data sets, were originally conceptualized and constructed, and why. Lightning-fast processing speed and user-friendly software make it relatively easy for data workers at official source organizations to construct new data indices and categories, and increasingly, data journalists are doing the same. Examples include gross national product and body mass index, but also recent creations such as gun violence costs and government integrity scores. Sociology of quantification scholars note that while such constructs can be effective ways to understand public issue trends if constructed with care, they can also take us further away from real, living incidents and people (Desrosières, 2002; Espeland & Stevens, 1998)—the raw material for narrative reporting. Finally, it is not clear how systematically journalists check the accuracy of data that fill those constructed categories or publicly report on their findings about accuracy.
These dynamics raise a number of questions: To what extent do data journalists question the categories that come pre-built in government data sets? By what processes do journalists vet data sets for quality and report data set limitations? How do powerful institutions and officials shape data journalism content and practice? To what degree, and within what contexts, are data projects based on government data sets, and on national-level data rather than local-level data? This study explores these questions through in-depth interviews with data journalists.
Literature on Data Journalism
Much scholarship on data and computational journalism has explored struggles and negotiation over journalistic control: for example, disagreement between advocates of traditional narrative reporting and advocates of raw data accounts (boyd & Crawford, 2012; Bucher, 2017; Coddington, 2015; Lewis & Westlund, 2015; Parasie & Dagiral, 2013) and how journalists and data experts negotiate authority in news work (Borges-Rey, 2106; Bucher, 2017; dal Zotto, Schenker, & Lugmayr, 2015; Hermida & Young, 2017).
Recently, scholars have turned attention to computational journalism, or the “finding, telling and disseminating of news stories with, by, or about algorithms” (Diakopoulos & Koliska, 2017). Studies have explored the opaque “black box” nature of algorithms—that is, the fact that few understand how algorithms work internally. Findings suggest that a lack of news outlet resources and expertise and the inherent complexity of algorithmic processes make it less likely that data construction processes will be explained to end users. Also, the “value systems” of computational systems emphasize effective “end use” of the system, as opposed to transparency about the system’s processes (Diakopoulos & Koliska, 2017). This means end users are often in the dark, unable to distinguish between human-generated and automated content (Dörr & Hollnbuchner, 2017).
The data and computational journalism literature published over the last few years has paid little attention to the construction of data and data categories (but see Lowrey & Hou, 2018). However, some research in the mid-2000s focused on problems of “poor quality of public sector data” and journalists’ lack of expertise in recognizing and improving this quality (Messner & Garrison, 2006, 2007, p. 55), as well as mathematical errors in reporting (Maier, 2003). These studies noted that while textbooks at the time were beginning to address these challenges, scholars were paying no attention (Messner & Garrison, 2006). Recently, some scholars have called for journalists to question the “naturalized” quality of algorithmic data (Bucher, 2017), and to question assumptions built into data (Gynnild, 2014). Scholars have also explored the impact of limited resources on data quality, as data processes can be expensive (Gynnild, 2014) given diminishing news organization resources (Parasie, 2015), especially at the local level (Veglis & Bratsas, 2017). Larger news outlets and outlets with strong investigative teams tend to have access to more money, time and expertise for data projects (Fink & Anderson, 2015; Uskali & Kuutti, 2015). Smaller local outlets are more likely to choose projects with less depth and projects that rely heavily on official sources (Fink & Anderson, 2015; Knight, 2015; Parasie, 2015). Yet journalists, regardless of outlet, often have had no choice but to use institutionally prepared data (Splendore et al., 2016).
Although journalists, especially at larger outlets, often scrutinize data to see if officials have altered it (Parasie, 2015), journalists are less likely to question the fundamental structure of data sets, such as measurement assumptions underlying categories and indices. Previous data journalism research has found uncritical dependence on official sources who provide these data sets (Cushion, Lewis, & Callaghan, 2017). A recent analysis of data journalism projects in the United States and the United Kingdom found limited reporting of information about potential data set problems and data categories, and a strong preference for government sources and national sources over local sources (Lowrey & Hou, 2018). The relative lack of data expertise and resources in both local news outlets and local source institutions may encourage the use of national and international data sources, to the detriment of local reporting (Fink & Anderson, 2015) as well as a general lack of dataset scrutiny.
Sociology of Quantification
Historically, the state has been the main producer of large data sets, given the considerable number of required work hours, the high level of expertise and (potentially) expensive technological tools needed to produce them (Porter, 1996). The advent of statistical “commensuration” was a necessary step in the common production and use of statistics. Commensuration, or “the transformation of different [and discrete] qualities into a common metric” that can be aggregated (Espeland & Stevens, 1998), revolutionized administrative labor, allowing efficient production of reports that were relatively easy to read. The original logic behind statistical production is often bound up in political-economic needs, and sociologists of quantification provide numerous historical examples. One such example: The census efforts following the French Revolution derived from the perceived importance of national unity. The government attempted to commensurate unique individuals of varied cultural backgrounds who had never before thought of themselves as similar (Porter, 1986). The nascent French government was then too weak to make these processes work, but in time, these diverse individuals would begin to think of themselves as similar, and national-level political efforts for the sake of producing commensurable, manageable statistics was a driving force behind this conception (Espeland & Stevens, 1998, 2008). Commensuration eases decision-making about the ways resources are to be allotted (Starr, 1987) and allows those in charge to “manage uncertainty, impose control [and] secure legitimacy” (Espeland & Stevens, 1998) and to do this from afar.
However, commensuration can contribute to a skewed tendency to see similarities across people and situations and to overlook differences (Porter, 1986), encouraging generality. Sociologists of quantification note that numerical evidence has symbolic power (Porter, 1986), and over time, the processes by which distinct particulars are commensurated and calculated together tend to become forgotten and opaque. Porter (1996) describes the persistence of commensurated statistical constructs:
Legions of statistical employees collect and process numbers on the presumption that the categories are valid. Newspapers and public officials wanting to discuss the numerical characteristics of a population have very limited ability to rework the numbers . . . Having become official, then, they become increasingly real.
Statistics are born in “the deep recesses of the bureaucracies” and others are not able “to unscramble the categories [and] change the assumptions” (March & Simon, 1993). Assumptions that shaped the original data collection disappear as polished, simplified presentations are created for the easy consumption of higher-ups. Such presentations possess a level of legitimacy that defies contradiction (March & Simon, 1993).
While sociologists of quantification take a critical view of the processes of quantification, they recognize the value of using statistics to identify important trends and patterns. Their point is not to urge avoidance of quantification, which would only “bereave social sciences not only from their analytical power but also from their potential to engage for fairer forms of quantification and coordination” (Diaz-Bone & Didier, 2016). Rather, the point is to encourage those who rely on data to stop seeing statistical composites as “natural,” and to recognize the potential problems of commensuration as well as the political-economic needs that data production serves. These scholars urge caution and awareness about the use of a powerful tool, and the increasing prevalence of data use in journalism makes this advice relevant.
Research Questions
Research suggests that the uncritical adoption of statistical metrics produced by large government agencies tends to go hand-in-hand with a decrease in on-the-ground reporting and local source cultivation, as well as with an adherence to official accounts (Cushion et al., 2017). However, there is no reason to think these tendencies are inevitable or invariable. For example, one recent case study revealed how data journalists reworked map data to be consistent with information developed from traditional “on the ground” narrative reporting (Parasie, 2015). Variability in these tendencies may result from the diverse ways data journalism is defined normatively and practically, and from differences in the level of, and nature of, support for data journalism across news outlets. Uncritical adoption of large, government-produced data sets and categories could be especially likely in news organizations with fewer resources.
The following research questions guide the study:
Method
Over a two-month span, 15 in-depth interviews were conducted with nine working data journalists and six former data journalists in the United States now working for professional organizations or university programs that focus on data journalism. Data journalists from both local and national publications were sampled given the analysis here of local versus national reporting. Also, diversity of region and range in years of experience were priorities in the selection. Both younger data journalists and long-time data journalists—even retired from professional journalism in some cases—were sought. Following Institutional Review Board approval, interviews were conducted by telephone and lasted 30 to 45 min, on average. Interviews were semistructured, guided by pre-planned questions that corresponded with research questions, but allowing for unplanned follow-up prompts and questions on unexpected topics introduced by the respondent.
The nine data journalists interviewed ranged from 3.5 to 41 years in overall journalism experience, and from 3 to 28 years in data journalism experience. All but one has attended more than one data journalism training event, with seven of the nine attending training events annually or nearly annually. Seven were male, and two were female. Four had titles of “reporter,” one was a graphics director, two were data visualists, one was a data/health editor and one was an app developer. Three worked with local news outlets (two newspapers, one TV), three worked with national legacy newspaper outlets and three worked with prominent digital-only news publications. Seven outlets were supported primarily by advertising, and two by foundations. All data journalists worked in the United States, with four from the Northeast, two from the Midwest, two from the Southwest and one from the South.
As mentioned, six additional interviews were conducted with former full-time data journalists who still produced some data journalism work at the time of the interviews. Three of these held full-time positions in professional organizations, and three held full-time positions at university programs. Years of experience in journalism for these six interviewees ranged from 25 to 47 years, and all are active in professional data journalism associations—most have led training sessions multiple times, or regularly. All six worked with newspapers prior to their current positions. Five were male, and one was female. Two worked in the Northeastern United States, two in the Midwest and two in the South. Names of all interviewees and news outlets were kept confidential.
Interviewees were digitally recorded and transcribed for coding. A codebook was created, containing pre-determined codes deriving from previous theory on the sociology of quantification and literature on data journalism, and linked with interview questions. (The sociology of quantification approach [Desrosières, 2002; Espeland & Stevens, 1998; Porter, 1996; Starr, 1987] emphasizes the processes involved in constructing datasets and categories, as well as the role of political-economic power [typically government institutions] in this construction. This approach as well as literature on data journalism [e.g., Fink & Anderson, 2015; Knight, 2015] indicates that substantial resources are needed for dataset construction. This suggests the likelihood of a greater emphasis on national-level issues than local issues. Research also suggests the legitimacy of the categories and data sets produced by government sources tends to be taken for granted, thereby limiting scrutiny of the categories and data sets. Codes derived from this literature were used to identify relevant passages from interview transcripts. Codes included the following: Common data sources, Challenges with government data, Influence on data journalism processes from powerful institutions [government, owning corporations, foundations], Local vs. national sources, Journalists’ questioning of data categories, Checking data accuracy, Reporting data accuracy, Processes for checking and reporting data accuracy). Two researchers coded the transcripts. Prior to coding, researchers read through the codebook. For each interview, researchers first read through the entire transcript, for a holistic understanding. Researchers read transcripts a second time, noting (in margin notes) passages and paragraphs that fit with predetermined categories as well as other passages that corresponded with categories that were not predetermined, which emerged from the analysis. The overall analysis involved “ascending” from the interview details that are grounded in the analytical categories to more general observations and conclusions (McCracken, 1988).
Following Creswell’s (2014) recommendations, two methods were used to encourage trustworthiness of findings (trustworthiness has been used as a qualitative analog to reliability; see Guba, 1981). Two researchers first coded the same three transcripts. They then met to discuss the coding, noting differences and talking through reasons behind the differences. The two researchers then coded the remaining transcripts and discussed differences. As an additional check on coding trustworthiness, a “peer debriefing” was conducted in which a third individual not involved in the coding reviewed a summary of the study’s purpose, read an annotated transcript and provided feedback on the annotated interpretation, leading to minor adjustments in the coding.
Results
It was expected that data journalists would find steeper challenges in using local-level data than federal-level data, and, with some exceptions, this was the case. Many agreed that federal data are more readily available for use than state and local data and that federal agencies are “really good about publishing data and disclosing information about their practices” (News App Developer, personal communication, January 17, 2018) as a journalist with a digital publication said. A data and health editor at a local newspaper noted that the “feds are a lot better about posting things online” and about “letting you know where the data is” (Data and Health Editor, personal communication, January 30, 2018).
Respondents tended to focus on differences in records laws, availability of those records and the challenge of formatting those documents across particular cities, counties and states. As a journalist from a national legacy outlet noted, there are “50 states, 50 different public record laws, sometimes 50 different ways they treat the same data, so it’s definitely more challenging at the state and local level” (Data Journalist, personal communication, January 25, 2018). Another journalist at a national legacy outlet added that “one city government may produce a city government budget that may look entirely different from the next city down the road,” and that “it’s not standardized” (Data Journalist, personal communication, February 1, 2018). A local TV producer agreed: “It’s hard to get any type of aggregate information [from different states] because they’re all different” (Data Journalist, personal communication, February 7, 2018). However, this journalist also added that he sometimes uses differences in state and local data to reveal patterns. A data editor at a local newspaper said state and local governments are worse than the federal government in telling journalists where data can be found, calling these agencies’ responses to his requests “schizophrenic” (Data and Health Editor, personal communication, January 30, 2018). He said he sometimes resorts to submitting or threatening to file Freedom of Information Act requests.
A reporter with an online-only publication said local officials have given her charts but not the raw data because they say they worry about her “messing up the data.” This leads to complications: “We have to copy it by hand . . . instead of them sending it to us [and] it’s possible that we introduce errors along the way” (Data Visualist, personal communication, February 6, 2018).
However, the preference for federal-level data over local data was not unanimous. According to a journalist at a digital-only publication, “there are plenty of examples where the local level has amazing data and it’s actually much easier to get because there’s people there you can contact and there’s fewer steps to go through” (Data Visualist, personal communication, February 6, 2018). She cited an instance of a state worker sending her a data portal with complete data, and offering to walk her through the data. According to another reporter at a digital-only publication, when data are limited to a particular locale, there are fewer ways “that things can go wrong because you’re not worrying about cross-jurisdictional issues . . .” (News App Developer, personal communication, January 17, 2018).
One local reporter stressed that the federal government increasingly limits what data journalists can get their hands on, and that this tendency is nonpartisan: “It started with the Bush Administration, got worse during the Obama Administration, and what happens is that local state and local government take their cue from Washington . . . [and state and local officials] are taking things offline that are normally public” (Data Journalist, personal communication, January 18, 2018). This same reporter added that federal data may be 1 to 2 years old. However, the reporter noted that while local data are often more up to date than federal data, technological and user constraints can restrict the types of local/state data he can get, while federal data are “either already digital or very easily digitized so you can manipulate it” (Data Journalist, personal communication, January 18, 2018); he added that at the local/state level,
I get a lot of PDFs. Then I have to go back and say, “I want it in its original electronic form,” and they say, “It’s a proprietary system and we can’t give it to you that way”—and I say, “It’s all ones and zeros and it can’t be that hard.” That’s the challenge.
Many respondents misinterpreted the question, even after follow-up clarification, suggesting that category construction was not an issue that was much on their minds. Several answered in terms of how forthcoming the source was about data—that is, the completeness of the sample or of data entry rather than the underlying logic of the categories. A local journalist interpreted the question as a problem of NGOs advocating through the use of incomplete data sets—for example, data sets that were misleading because they ignored rival explanations: “Law firms will . . . put out a news release that says [the city] has the most crashes out of any other city in the state” (Data Journalist, personal communication, February 7, 2018) and fail to account for factors like flooding events. One local journalist responded by recounting problems with missing information from government agencies about sexual harassment: “There is a separate file for discipline—they gave it to us, but then they redacted all the names” (Data and Health Editor, personal communication, January 30, 2018).
In only a few cases did responses indicate that data journalists were actually questioning the way data categories were constructed, or scrutinizing “approaches to producing data” (Professional Association Director/Former Data Journalist, personal communication, January 30, 2018). In all cases, these were data journalists with prominent national outlets. A journalist (Data Journalist, personal communication, January 25, 2018) with a national legacy outlet said data journalists at his outlet always questioned the logic behind data categories:
Even if it’s not questionable, we have to know how did you gather this information, whoever you are . . . even if it’s somebody we’re inclined to trust, like the Census Bureau. We ask what was the question that was asked of whom? How was the survey administered?
A journalist with a national-level online publication said data journalists were encouraged to look at documentation for the parameters of data sets, but that this was not always possible: “You can get documentation [for category construction], but that can be out of date” (News App Developer, personal communication, January 17, 2018). An association director said processes for checking data limitations and caveats are commonly taught at their national conventions—conventions that journalists from large news outlets are most likely to attend (Professional Association Director/Former Data Journalist, personal communication, January 12, 2018).
There is some differentiation across outlets for how these journalists check the accuracy of their own work, and this seems to depend on the size and scope of the news organization. According to one local news outlet reporter, “There’s only one editor here that has the same level of skill I do [and], at least over the last couple years, there has been a lot of trust in me here,” adding that, for one complex story, “there was no real critical questioning behind my data” (Data Journalist, personal communication, January 18, 2018). However, all respondents from the large national news outlets, traditional and digital, said they use some version of a team-based approach to check accuracy. As one academic program director said, “team work is crucial” to data journalism projects (Director of University Data Journalism Program/Former Data Journalist, personal communication, January 31, 2018). A national outlet reporter noted that all reporters are responsible for double-checking their work, and that they also peer-review. According to a journalist with a prestigious national news organization, “We do spot checking and confirm source material, . . . and we verify our findings by that to make sure it makes sense” (Graphics Director, personal communication, February 5, 2018). A reporter from a national digital-only organization said they “do internal checks . . . do some data analysis and then . . . other people who are not working directly on the project but who have data analysis skills will attempt to replicate their work” (News App Developer, personal communication, January 17, 2018). One respondent said academic researchers were consulted about the methodology, and were even asked to replicate the findings, as a way of checking data and data processes. A data journalist from a digital outlet said her outlet hired individuals with academic backgrounds: “It feels like almost half the staff comes from academics” (Data Journalist, personal communication, January 17, 2018).
As for reporting data accuracy to readers, a local reporter said they have tended to explain their process only “if it was a fairly complicated or complex process to come up with it . . .” (Data Journalist, personal communication, January 18, 2018). Several others echoed this tendency. A few reporters talked about employing a small “nerd box” outside of the story text that explains methodology for complex projects (Data Journalists, personal communication, January 25 and February 1, 2018). According to one reporter (Data Journalist, personal communication, February 1, 2018) at a national outlet,
In that box, I try to explain to the reader as fundamentally as possible where we got the data from, what source or sources [we used], and what we did in terms of preparing it for public consumption if there was any additional preparation, and essentially what we did in terms of analyzing it.
That same reporter noted that he occasionally receives pushback from academics who question the method and analysis: “You’ve got your cards on the table there and people know what you did and how you did it” (Data Journalist, personal communication, February 1, 2018). Another reporter, from a top national news organization, said he routinely scraps stories when major limitations with the data surface. Another mentioned sharing the processes used for their analyses, so others can replicate them. In both cases, respondents were referring to larger, more complex projects.
Conclusion
Findings indicated that data journalists may peer into the top of the data “black box,” but they are unlikely to look deeply—though this varied somewhat across the type of news outlet, the level of data journalist expertise and resources, and the complexity of the data project.
Data journalist respondents were not hesitant to express grievances about government data, especially about holes in the data, unreliable access to data and relative ease of use of data. However, they were unlikely to question the underlying structure of data sets and the ways that categories of “commensurable” events and individuals were conceived and aggregated in the first place. This varied to some degree by whether the journalist worked at a national-level or local-level publication, as national-level journalists were more likely to question categories. Presumably higher level of expertise and greater time and resources help explain this difference.
In general, data journalists found federal-level data to be most readily available and thorough, if sometimes dated, and they also made it clear there were challenges in comparing or aggregating data across local areas and states. They also reported varying levels of expertise and cooperation at local source organizations to be a problem. Again, the primary concerns related to ease of access and use of the data rather than the accuracy or underlying structure of the data set. Typically, the underlying motives of officials who provide data are not deeply questioned, so long as the “feeding” is regular, reliable and accessible. This finding is consistent with prior studies’ findings on constraints of time and workload for understaffed news operations: Constraints likely reshape the hierarchy of needs for data journalism work.
Several data journalists indicated they always scrutinized data but also frequent were comments that government data were trusted implicitly. Several respondents said they assumed government agencies were data competent—more competent than journalists—and that they thought data validity was therefore not an issue—an alarming conclusion, given that data quality from public sources can be highly inconsistent (Messner & Garrison, 2007). Again, there was a national/local divide for responses about data checking, as national-level journalists for both newspapers and newer digital publications were more likely to have systems in place for checking data, typically involving multiple employees.
Reporting of data problems was common across outlets, though more common at larger outlets, primarily because reporting was seen as necessary only for large, complex projects. Such projects are more common at larger, better resourced outlets. The fact that journalists generally did not think less complex uses of data warranted reporting of limitations contradicts the adage voiced by one prominent former data journalist: “Every dataset has flaws and omissions, and if I don’t find any, I get worried” (Director of University Data Journalism Program/Former Data Journalist, January 16, 2018). It should be noted that findings about inconsistent scrutiny of data quality are not new; studies dating back to the mid-2000s suggested similar problems (Messner & Garrison, 2007).
This inconsistency of data scrutiny across project types suggests the news industry needs stronger routines for reporting data quality. Other factors suggest this need as well, including insufficient staff expertise, and pressure from perceived competition with other outlets, especially outlets with lower standards. The practice of including a “nerd box” with information about the data set, sample and analytical method is a promising routine adopted by some organizations. Tips and guidelines provided by professional associations such as Investigative Reporters and Editors and the Global Investigative Journalism Network suggest other practices that may be regularized. Noteworthy were several mentions of academics checking data journalist methods and analysis; transparency of the data information made this possible, and comments from respondents suggested that the norm of transparency was embedded within the work they did, serving as a sort of safety net for data journalists. The data journalist–academic relationship is worth exploring further. Data journalists are time and resource stressed, and aid from outside the data journalism subfield may be beneficial. Open data across digital online platforms could make this aid more likely.
Study findings cannot be generalized beyond the interviews, and follow-up survey analysis of the issues explored here would be helpful. Ethnographic work involving newsroom observation of data journalism work structures, practices and interactions would also be worthwhile. Observation would provide additional trustworthiness for these findings, as well as helpful context for explaining perceptions that emerged in interviews.
Results suggest that data journalists at top national outlets are more likely to look deeply into the black box of data sets and categories. They are somewhat more likely to question the motives of officials, though most often the problems that prompt questions—that is, relative accessibility and ease of use of data—relate to the ability of journalists to get work done. They are also more likely to produce complex data projects, and this complexity triggered journalists’ latent attention to checking and reporting data limitations. However, only a couple of the journalists interviewed seemed to readily grasp the idea that data category construction should be subject to scrutiny. The use of data is gaining increasing prominence in the journalism space across diverse areas of reporting, including local government, crime and public safety, health and the environment, and the increasingly divisive area of election polling. It is important that critical, careful thinking about data and data categories, and the expertise to apply this thinking, keep pace.
