Abstract

A consensus has formed within the federal statistical data community that a proactive approach is needed to make use of new sources of data to supplement data collection through random sample surveys and censuses. Data collected through surveys are challenged by less cooperation from survey respondents, which results in both lower response rates and greater expenditures. Data users want more complex data faster and with greater geographic specificity without sacrificing quality.
However, funding constraints on federal agencies make it unfeasible for the agencies to keep pace with the increasing costs of addressing these issues. Major technology-enabled methodological changes, requiring additional research and development, need to be incorporated quickly into ongoing data collection and production efforts. Under the Paperwork Reduction Act (PRA), the U.S. Office of Management and Budget (OMB) is required to ensure the integrity, objectivity, impartiality, utility, and confidentiality of information collected for statistical purposes. This charge, assigned in the PRA to the chief statistician, is central to the maintenance of high-quality data as the federal statistical system moves to put in place significant and necessary changes to our national data infrastructure in coming years.
Many nonsurvey sources of data could be used to reduce costs and meet the growing needs of data consumers. Although data are widely available from multiple sources on the Internet and through purchase from private sector companies, numerous problems arise when these data are the sole source of important policy decisions and government and business planning activities. The lack of transparency in how these data are produced can lead to a lack of user understanding when companies change proprietary models or have a vested interest in data results. Because businesses are interested in profitability, their business models may change or companies may change hands, eliminating sources of data and creating uncertainty around long-term availability. Open data scraped from the web may be biased in ways the casual user may not detect. By contrast, federally produced statistics are transparent, reliable, high quality, and objective. Nonetheless, if agencies that produce these valuable statistics cannot respond to today’s challenges, data users may turn to suboptimal solutions to meet their evolving needs.
Data sources that federal statistical agencies can make more use of include administrative records (federal and state records collected for program administration), curated commercial data (such as aerial photos, aggregated credit card records, etc.), and limited researched and analyzed open source data (such as consumer goods prices scraped from retail store websites). These data may be combined with survey data, used in lieu of survey data, or used to create new products. Although some statistical data have a long history of coming from multiple sources, the uses of combined data are still being developed.
OMB fulfills its statistical data quality responsibilities under the PRA through the development of policies, principles, standards, and guidelines concerning statistical collection procedures and methods, among other things. To this end, the OMB has issued multiple statistical directives. In particular, Statistical Policy Directive No. 2: Standards and Guidelines for Statistical Surveys (OMB 2006) contains in-depth guidance on how to conduct a rigorous and methodologically sound random sample survey to gather federal statistical data. However, no standards or guidelines have been issued by OMB that address statistical data generated from combining administrative, commercial, and survey data. The lack of a standard leaves producers and users with little consistent information about assessing the quality and utility of future combined federal statistical data. For the OMB to issue such guidelines, additional research is needed to examine the different dimensions of quality around combined data. These are briefly described here.
Transparency: How can an agency openly inform data users about how the data were collected and for what purposes? What are consistent ways of creating and providing such documentation? How can the strengths and limitations of the data be conveyed to the data user? As recommended by the National Academy of Sciences, Engineering and Medicine (2017), openness also means that a statistical agency should describe how decisions on methods and procedures were made for a data collection program and provide ready access to research results that entered into such decisions. Such transparency is essential for credibility with data users and trust of data providers.
Fitness for various purposes: How can a determination be made about whether various datasets are appropriate for specific uses? Data may be good enough for some purposes but not for others. Should there be a quality rating system developed to guide the user? As we know from survey data, some surveys have too small of a sample size to be used to develop estimates for small populations or small geographic areas. Similarly, nonsurvey data may not be suitable for certain uses. For example, aggregated credit card data may be missing commodities that are not normally purchased with a credit card. Depending on what research or policy questions you are trying to answer, missing data could make some datasets unsuitable.
Privacy: Data may have been collected for particular program purposes. Are additional permissions needed from providers to combine that information with other data for different purposes? When data are available commercially does permission for use come from the vendor? How can the government do a better job of making sure that de-identified data collected from individuals and businesses cannot be reidentified once the statistical data are released to the public? Private sector companies collect a lot of data about individuals through mechanisms such as store frequent buyer programs, app terms of service agreements, bank loans, and other interactions with customers. What sort of consent from those customers would be needed if the government were to start buying data from these companies to create new statistical products?
Disclosure avoidance: When data are combined and then used for multiple purposes, what new techniques are needed for protecting privacy and confidentiality, especially when researchers want to replicate research results? It is very important to make sure that statistical data cannot be easily linked to other data that is publicly available to reidentify businesses or individuals whose data have been aggregated to produce the datasets.
Microdata: What type of access can be granted to microdata coming from multiple sources, some of which may be proprietary? How are privacy and legal agreements protected? Microdata are the individual records of people and businesses, whether they have been collected through surveys or through program records. Some examples are Social Security payment records, tax filings, registrations for programs such as the Supplemental Nutrition Assistance Program (SNAP) or housing assistance, as well as individual responses to statistical surveys or censuses. The confidentiality of the information contained in these sources needs to be vigorously protected while giving limited access for approved statistical purposes.
Ownership: What rights do original owners, including statistical agencies conducting surveys and governments providing administrative records, retain for future uses? What happens when commercial data are procured under a licensing agreement?
Quality: How does one measure traditional aspects of quality such as accuracy, coherence, comparability, reproducibility, bias, and coverage when combining data from multiple sources with varied collection methods?
Break in series: What is the responsibility of and appropriate methodology for the statistical agency to bridge a break in a longitudinal data series when the sources of data are dramatically changing? For example, monthly retail sales are currently calculated using sample survey data collected each month from businesses. If, instead, the information were to be gathered using consumer spending records, such as aggregated credit card purchases, that would change the retail sales number going forward. The gap in the data series would need to be bridged in an understandable way for researchers and others looking to understand changes that occur over time.
Risk: What are the mitigation procedures needed for an agency to reduce the risk of discontinued availability of commercial or other data that it is acquiring but not responsible for collecting?
Postcollection processing: What changes in methodology are needed in production activities such as editing, imputation, weighting, and modeling when data are coming from multiple sources? To what extent do these methods need to be consistently applied across sources?
As research in these areas proceeds, more issues will certainly be identified. They must be addressed if the federal statistical system is to continue as the gold standard for providing high-quality statistical data. As more statistical data are derived from combined data sources, new research-driven OMB standards and guidance will help to ensure that data continue to be available for business and civic purposes, as well as for evidence-driven policymaking arising out of federal and state program evaluation and academic research.
The research on statistical methodologies and quality control moving forward will require extensive collaboration between federal statistical agencies and program agencies, academia, states, and other stakeholders to expeditiously advance our learning on combined statistical data. Such collaboration should take many forms. Intergovernmental projects that bring together city, state, and federal partners can be especially valuable, particularly when universities can assist with training, data analytics, and meaningful insights. Many such pilot projects are currently underway, funded by foundations interested in evidence-based policy at all levels of government. For example, the State of Rhode Island has been working with Brown University to enhance the analytics applied to state data to improve state program operations.
Private sector data providers can work with federal partners in statistical agencies to increase the transparency of commercial data, as well as to identify ways to standardize legal and operational approaches to incorporating commercial data into official statistics.
In addition, the Federal Committee on Statistical Methodology (FCSM) could sponsor multiple workshops on some of these research topics and invite members of the National Academy of Sciences Committee on National Statistics (CNSTAT) and other academics to participate. Academic researchers are also making contributions to the field. These are just a few of the opportunities available for collaboration, and the OMB can play a critical role in coordinating and corralling the learning to develop new, much-needed standards governing combined datasets. Although much work is already under way, much remains to be done. Exciting times indeed.
Footnotes
Notes
Nancy Potok is chief statistician of the United States and chief of Statistical and Science Policy at the U.S. Office of Management and Budget. Previously, she served as the deputy director and chief operating officer of the U.S. Census Bureau. She is also an adjunct professor at The George Washington University.
