Abstract

The past decades have seen an extraordinary growth in the number of large, open-access portals making data available for cross-national comparative social-science research and a corresponding increase in the use of such data. In particular, international organizations such as The World Bank, the United Nations Office on Drugs and Crime (UNODC), the United Nations Children’s Fund (UNICEF), and the World Health Organization (WHO) provide users with comprehensive indicator systems on, for example, the economy, the environment, education, health, child well-being, gender inequality, and conflict and violence. If adequately used, data retrieved from such portals can help researchers to address important research questions. However, they can also lead to bad science if researchers fail to pay close attention to how the data were generated. We illustrate this risk using a real-life example, namely data on cross-national homicide rates.
Homicide Rates
Homicide rates, the number of recorded intentional killings standardized by population, are the most widely used indicator in large-sample, cross-national comparative research on interpersonal violence and its predictors. Cross-national studies have examined, for example, whether levels and trends in national homicide rates can be predicted by social inequality, trust in fellow citizens, poverty, family instability, ethnic cleavages, climate change, or urbanization (Nivette, 2011; Trent & Pridemore, 2012). Researchers working in this field typically use either of two types of data. The first are national vital-registration statistics that record the number of people killed according to death certificates. They are collected by public-health authorities and compiled internationally by WHO (https://www.who.int/data/gho). The second are criminal justice statistics that report the number of homicides recorded by the police. They are compiled internationally by UNODC (https://dataunodc.un.org/).
A Sudden Growth in Data Coverage
Both data sources have their limitations, but data derived from national vital-registration systems have widely come to be considered the gold standard in comparative homicide research (Andersson & Kazemian, 2018). However, until around 2010, limited sample sizes due to the lack of national cause-of-death data were a major impediment to research. Generally, earlier research included fewer than 70, mostly high-income countries, leading to concerns about whether findings are generalizable.
Around 2010, the situation seemingly improved radically. Between 2009 and 2016, at least 10 articles published in peer-reviewed journals reported findings of cross-sectional or panel regression models that included 160 or more countries, meaning that virtually all countries were considered (Kanis et al., 2017). How was this sudden growth in data coverage achieved, and can the results reported in these studies be considered valid and reliable?
A Cautionary Note: The Kanis et al. (2017) Study
In 2017, a group of researchers including one of the authors of this Data Brief examined the origin of the data used in these studies and cautioned researchers about their appropriate use (Kanis et al., 2017). They showed that studies with large sample sizes had relied on data directly retrieved from either the WHO Global Health Estimates data portal (https://www.who.int/data/gho) or the UNODC Global Study on Homicide data portal (UNODC, 2011). The latter was partly based on data from the WHO Global Health Estimates program.
Since 2004, the Global Health Estimates program has generated cause-specific estimates of the severity of health problems for societies across the globe and tracked progress toward global health targets (Mathers, 2020). In particular, they use complex statistical modeling techniques to quantify the mortality and the loss of health in countries where no or limited data are currently available. For homicide rates, the WHO researchers test different prediction models with a large pool of covariates, iteratively optimizing their models until they arrive at the final model. For the 2012 estimates, for example, six predictor variables were included in the final models: the gender-inequality index, alcohol-consumptions patterns, the percentage of people residing in urban areas, the male proportion of the population between 15 and 30 years old, the infant mortality rate, and religious fractionalization (i.e. the degree of religious heterogeneity in a society). The prediction model is adapted for each year for which estimates are produced.
As part of their Global Health Estimates program, WHO has released estimate-based homicide rates every 4 years since 2004. Kanis et al. (2017) showed that for a large number of countries, these data are not based on reports by national public-health authorities to WHO but on models computed by scientists at WHO. For the 2012 estimates, for example, national vital-registration data were used for 54 out of 173 countries for which estimates were published (31% of total). For 17 countries, criminal justice data were used (10%), and for 30 countries, national criminal justice data were adjusted by WHO (17%). However, for 72 countries (42%), the data were modeled by WHO researchers in the complete or partial absence of empirical data supplied by national authorities. This includes almost all countries on the African continent as well as a substantial number of countries in Asia and the Middle East (for a list, see Kanis et al., 2017).
Kanis et al. (2017) note that the uncritical usage of regression-based estimates as though they were derived from national data-collection efforts can lead to serious problems. First, the predictors used for modeling purposes include variables that are conceptually and empirically related to constructs used by researchers in their regression models. The same information is hence used on both sides of the equation. Second, because WHO adapts estimation parameters in each estimation round, studies that estimate panel models wrongly attribute change in the estimation procedure to true change in violence levels. Third, missingness of homicide data is strongly concentrated geographically, with especially large gaps across the African continent. The estimated data therefore imply that covariates of homicide found in some parts of the world generate unbiased predictions in other regions, an assumption that has been shown to not always hold (Nivette, 2012).
Why Did Researchers Fail to Notice?
WHO describes the estimation procedure and warns against the uncritical use of estimated data (WHO, 2014). Yet at least 10 peer-reviewed articles published between 2009 and 2017 present findings based on an uncritical use of the WHO estimates (Kanis et al., 2017, p. 319). Several other articles have been appeared since the publication of Kanis et al.’s study.
Various factors may contribute to this problem. First, the ease of downloading data from large portals may have led researchers to not pay enough attention to documentation on the generation of the data. Second, several data portals, including The World Bank’s World Development Indicators (https://databank.worldbank.org/reports.aspx?source=world-development-indicators), the 2011 UNODC Homicide Statistics (UNODC, 2011), and university-led initiatives such as Clio Infra (https://clio-infra.eu), disseminate homicide data generated through WHO homicide-estimation procedures. In their meta-data, portals generally recommend that users consult the original source of the data. However, researchers may too frequently assume that downloading data from a respectable portal is sufficient. Third, reviewers may not be aware of the underlying issues and fail to alert researchers to the challenges.
Conclusions
The past two decades have seen an unprecedented increase in the availability of open-access data that can easily be downloaded from the Internet. Using secondary data sets for social-science research requires careful scrutiny of how the data were generated at the level of the primary data source and whether they are adequate for the research purposes (Atkinson & Brandolini, 2001). Failure to do so can result in poor science based on inadequate data. Several measures can help to reduce such problems. First, the administrators of data portals should consider additional efforts to document their data and alert readers to possible limitations. Second, researchers should always track secondary data downloaded from portals to the primary data source and carefully document any limitations. And, finally, reviewers of manuscripts in which the authors used secondary data should ideally be familiar with the data so that they can assess whether they have been used appropriately.
Footnotes
Transparency
Action Editor: Patricia J. Bauer
Editor: Patricia J. Bauer
Author Contributions
Both authors contributed equally to this Data Brief. M. Eisner wrote the text, and P. Fearon commented on draft versions of the manuscript. Both authors approved the final manuscript for submission.
