Abstract
The Statistics of Income (SOI) Division of the U.S. Internal Revenue Service (IRS) uses administrative data from tax and information returns to collect information for its statistical studies. This paper reviews fundamental “Big Data” issues with respect to tax data and highlights initiatives to better leverage administrative tax data for statistical purposes. Among the topics addressed are the current uses of administrative datasets in developing statistical samples and the benefits of large administrative data sets for producing small-area estimates or public-use files while ensuring statistics satisfy applicable data quality standards. New processes to create geographic statistics and public-use data files are also discussed. Also explored are ways to efficiently access large datasets and overcome limitations of filing requirements on statistical uses of administrative data. Finally, future applications of administrative data to produce tax statistics, as well as efforts to improve metadata in support of longitudinal analyses, are examined.
Introduction
The 2017 report by the Congressionally chartered Commission on Evidence-Based Policymaking (CEP) [1] and guidance from the U.S. Office of Management and Budget (OMB) emphasize the role of high-quality and reliable statistics in providing a foundation for research, program evaluation, and policy design. Data collected to administer government programs are increasingly regarded as “a valuable national resource and a strategic asset, to the Federal Government, its partners, and the American public in promoting important goals and targeting resources toward priorities ranging from economic growth and education to scientific discovery and the very functioning of our democracy” (M-14-06).1 The report and guidance make clear that in some cases, administrative data provide an opportunity to create such statistical information more efficiently, reducing both cost and public burden relative to conducting surveys.2 To maximize the value of administrative data for analytical purposes, OMB has sought to establish “a framework to help institutionalize the principles of effective information management at each stage of the information’s life cycle to promote interoperability and openness,” in order to promote the use of administrative data for statistical purposes, while fully protecting the privacy and confidentiality afforded to the individuals, businesses, and institutions providing the data (M-13-13). This framework requires agencies to (i) foster greater collaboration between program and statistical offices; (ii) develop strong data stewardship policies and practices around the statistical use of administrative data; (iii) require the documentation of quality control measures and key attributes of important administrative datasets; and (iv) require the designation of responsibilities and practices using agreements among these offices (M-14-06). Through this guidance, agencies are encouraged to actively promote the use of administrative data for statistical purposes. The CEP report goes even further by recommending a National Secure Data Service to facilitate linking data from multiple agencies while providing increased transparency and privacy protection.
The Statistics of Income (SOI) Division of the U.S. Internal Revenue Service has long been a leader in the Federal statistical community in its use of administrative data as the foundation of its statistical data products. This generally has included using administrative data to create sampling frames for tax filing populations of interest and as a primary source of raw data, as well as for developing sample weights for use in creating unbiased population estimates. As administrative data have become more accessible, SOI’s use of these data has also evolved. This paper will discuss current uses of administrative data in the SOI program, examine some of the inherent challenges of using administrative data for statistical purposes, describe steps SOI is taking to overcome those challenges, and outline potential future uses of administrative data in producing tax statistics.
Background
SOI was established almost immediately after the adoption of a Federal income tax in 1916 and charged with the annual preparation of statistics with respect to the operation of the tax law. The first SOI report, based on income tax returns filed by individuals and corporations for Tax Year 1916, was published in 1918. From the very beginning, SOI reports were used extensively for tax research and for estimating revenue, especially by officials in what are now the Office of Tax Analysis of the Department of the Treasury and the Congressional Joint Committee on Taxation. In the 1930s, a third major user of SOI data, the Bureau of Economic Analysis (BEA) in the Department of Commerce was added. BEA uses SOI data extensively in constructing the National Income and Product Accounts.
As the SOI program and products have expanded, users in other Government agencies, such as the Census Bureau, and many private and academic researchers have come to rely on tax data produced by SOI for evaluating tax policy initiatives (see [3], for a complete history of the SOI program). To fulfill its charge, SOI created a structured mechanism for transforming administrative data into statistical files, using its own data collection systems, completely autonomous of main IRS tax return processing. SOI currently conducts approximately 110 different projects involving data collection from tax returns and information documents. Data content is developed working closely with data users to ensure both continuity and usefulness. For most studies, data are extracted from stratified random samples of returns as they are filed to ensure timeliness. Specially trained field employees located in IRS submissions processing centers collect the data under the supervision of subject-matter experts (SMEs) from SOI headquarters. These SMEs supply data editing instructions, conduct training classes, and review difficult cases. Field employees enter data into computer databases and check them using embedded tests that verify coded values and key mathematical relationships. In addition, subsamples of edited returns are subjected to field-by-field quality review. SMEs carefully review all files for accuracy before producing statistics for public release.
Statistical uses of administrative data
Administrative records have a long history of use in the production of U.S. Government statistics [4, 5]. In recent years, technological advances have made it easier for statistical agencies to process large datasets, encouraging even greater use of administrative records for research purposes. As a research tool, administrative records have many potential uses, including direct tabulation and indirect estimation of models or other statistics, as well as construction of survey frames [6]. In the best situations, administrative data may have several advantages over traditional samples, including more complete coverage of a population (sufficient for regional statistics), low data collection costs, and reduced respondent burden.
The potential problems of using administrative data for statistical purposes include the stability of a program over time, conceptual issues relative to the population and items collected, and costs of transforming the data into a form useful for research purposes. Large population files pose practical challenges, sometimes associated with the broad concept of “Big Data.” The sheer volume of data can sometimes exceed processing or storage limitations. Even with sufficient processing power, traditional methods of managing data, for example, using commercial off-the-shelf software or queries to manipulate relational databases, may mean that even relatively straightforward analyses can require a great deal of time. Validating and resolving record linkage problems (missing records, duplicate records, incorrect matches) when working with administrative data from different sources can introduce additional challenges.
Sampling frames
SOI has relied on statistical samples of tax returns, drawn from administrative tax data, for producing official statistics on the tax system for decades. In early years, samples for statistical processing were drawn manually from bundles of processed paper returns. In more recent decades, SOI sample selection protocols have been automated so that records are selected for inclusion in SOI processes when the data “post” to IRS administrative data sets.3 Samples use stratified Bernoulli sampling where sample rates are preset based on population projections for each stratum. Sample strata are established to support a broad range of analysis at the national level, and stratifications differ based on the taxpayer population. For example, samples for individual filers are stratified based on size of total gross positive or negative income, taxability, business receipts, and utility for tax policy modeling. Business tax return sample stratifications include tax form type, size of total assets, and, for certain form types, measures of income.4 SOI stratifies other samples, such as those for exempt organizations, gifts, and estates, by measures of economic size, including amount of assets or income, size of gift, and total gross estate, respectively [7]. Each record in the filing population is assigned a “transform number”, which is a random number generated based on the filer’s Social Security Number or Employer Identification Number [8]. This transform is compared to the sample rate in the assigned strata to determine whether a specific return will be selected for inclusion in an SOI program. Sample weights are based on the probability of selection, but can be adjusted to account for returns known to have been missing or misclassified at the time of selection.
SOI also uses administrative data as the basis for the development of dynamic samples. For a dynamic sample, a file of administrative data is captured from one sample to develop strata for another sample. For example, to capture information on business activities within the nonprofit sector, SOI has a long-running program that integrates its tax-exempt organization (Forms 990 and 990-EZ) and tax-exempt organization unrelated business tax (Form 990-T) programs. Any charitable organization, exempt under Internal Revenue Code section 501(c)(3), selected for the Forms 990/990-EZ study that has a corresponding 990-T is automatically included in the Form 990-T sample [9]. The administrative data from Form 990 are captured at sampling and held in a “tickler file”, which is then effectively used as a separate stratum to select returns into the Form 990-T sample with certainty. This sampling method allows integrated analyses between charitable organizations appearing in each of the two programs, such that a charitable organization’s exempt financial activities can be linked to its business activities. The respective samples also function independently, allowing SOI to produce separate cross-section estimates of each unique population [10].
To better evaluate taxpayer behavior and financial changes that impact individual taxpayers longitudinally, SOI uses administrative data to develop embedded panels within its programs. The longest-running of these panels, the 1999 Forward Individual Panel, includes certain primary and secondary taxpayers selected for the Tax Year 1999 individual cross-section sample. SOI selected the 1999 taxpayers based on a subset of Social Security numbers that conformed to the Social Security Administration’s Continuous Work History Sample (CWHS) methodology [11]. SOI selects returns filed by these taxpayers in each of its subsequent annual cross-sectional samples. A second panel, based on another Form 1040-based program, the 2007 Sales of Capital Assets study, is also selected each year based on a similar methodology [12]. Furthermore, SOI develops a tickler file from the administrative data files for taxpayers whose Social Security numbers are included in the full CWHS sample. The file is used to create a stratum of filers from the individual panels and to augment SOI’s Estate Tax (Form 706) sample. When a 1040 panel member dies and a Form 706 is filed for his or her estate, the Form 706 is included in the estate tax sample. This enables examination of both longitudinal income, based on the Form 1040 CWHS taxpayers, and a single-year snapshot of the wealth that generated that income, based on the Form 706 taxpayers. These linked data have contributed to better understanding the investment strategies of the very wealthy and changes in the concentration of wealth and income mobility in the United States.
Direct estimation
Using administrative data to produce direct statistical estimates has perhaps the greatest potential for impacting SOI’s program. There are some limitations in terms of the specific data items available, because not every data item reported on a paper-filed tax form is captured during initial administrative processing. In addition, data definitions are determined by legislation and may change over time based on changes to the tax code, creating challenges when using data for longitudinal studies [6]. Consistency across records can be a problem because there is some allowed variation in the way tax return filers report similar information, and data quality can be an important concern. These inconsistencies can be addressed manually when working with statistical samples, but to achieve the same level of curation in population data would require a more automated approach. Finally, the unit of observation for some filing populations may not be ideal. For example, there is variation in the way individuals file income taxes, even among filers with the same marital status, and it is difficult to represent all types of households using these data.5Complex corporations likewise can choose to file as a consolidated group or each subsidiary may file separately, complicating the production of consistent statistics. Statistical Policy Working Paper 6 [13] also cautions that there can be considerable variation in quality across variables in an administrative record system. Information that may be statistically important, but only marginally relevant to administrative purposes, is often imperfectly reported, validated, and processed. Data items used primarily as background information, for example, an address or occupation, may be of particularly low quality or even incomplete [14].
Despite these challenges, SOI is successfully making increased use of administrative data to create its statistical products. The biggest impact is on SOI’s individual income tax geographic data series – ZIP Code, metro- and micropolitan areas, county, county-to-county migration, and State. Data quality is a particular concern in these products. When producing small-area estimates, the effects of data errors, even those that do not matter in aggregate, can create significant distortions. Errors on high-income returns can be particularly distorting.
To improve the overall quality of the data used for geographic data products, SOI takes a hybrid approach that combines administrative data for most of the filing population with high-quality, edited data on high-income returns from its samples. This process leverages a feature of the SOI sample design that ensures returns with very high incomes or those that report large values in key fields are assigned specific sample codes and selected at a rate of 100 percent into the SOI samples. Mechanically, this involves assigning sample codes to administrative population files and dropping the codes that are selected with certainty in the SOI program. SOI data are then used to replace these records in the administrative files that will be used to produce the geographic data estimates. Doing this ensures that records with extreme values for key characteristics are of high quality, reducing the possibility of distorting errors in the final estimates. The remaining administrative data are subjected to automated testing and error correction routines, based on rules developed jointly between tax law experts and SOI’s SME, to detect and correct significant errors and outliers. All geographic tables are created from these files. As an additional quality check, the final geographic estimates are compared longitudinally to isolate and evaluate any remaining anomalies. This approach to producing reliable data has allowed SOI to greatly expand both the granularity and the number of data items released in these products, increasing their utility for research.
Exempt Organization Financial Extract
Administrative data have also proven to be useful for developing public-use datasets for large populations, where there is value added in making every record in the population available. Forms 990, 990-EZ, and 990-PF, the information returns filed by tax-exempt organizations to report charitable activities, grants and donations received and distributed, and asset and income information, are unique in that the information reported can be made publicly available. SOI has long released public-use microdata files that include all the information collected for its samples of these filing populations. The usefulness of these files for the public is limited by sample size constraints. While the sample includes most large organizations, sampling rates are relatively low for smaller organizations, and the full sample represents less than 10 percent of the filing population.
To develop a more comprehensive public-use dataset, SOI developed the Exempt Organization Financial Extract. The extract includes IRS administrative data from all Forms 990, 990-EZ, and 990-PF filed within a given calendar year, focusing on data elements that experience shows are generally well-reported. To improve the utility of these files for statistical or research use, SOI conducts mainly automated, high-level testing and corrects issues such as arithmetic errors and truncated or added digits. Additional tests include ensuring internal logical consistency, removing duplicate returns, and verifying outliers. The extracts have proven to be a cost-effective way to provide the public with a comprehensive view of key financial items reported by the nonprofit sector, allowing for expanded research opportunities and potentially improving compliance among filers.
SOI Data Bank
As noted previously, the unit of observation available in administrative tax data makes certain research difficult or impossible. This is especially true for individual income tax filers – married couples may file returns jointly, but they are also allowed to file separately in cases where marginal tax rates favor treating the two incomes separately; dependent children and others living in a home may also be required to file separate returns to report both earned and unearned income; and unrelated individuals who consider themselves a household may have no connection in the tax data, especially if one or more of them does not earn sufficient income to file a tax return. So, while in many cases the tax-filing unit is analogous to the household, for more complex or multigenerational households there are important differences [15].
To create a consistent unit of observation suitable for many research purposes, SOI has led creation of the IRS Data Bank. Based on the population of income tax returns filed each year from 1996–2015, it includes data from almost 2.5 billion returns. Given its size, it is not practical to include all available administrative data for each person in each year. Instead, each record contains core information, including available demographics, indicators of key characteristics, some income, and other selected financial items from information and tax returns. Significantly, all records represent an individual person, not a tax filing unit, and include indicators to aid linking related individuals – parents to children, for example. The resulting file can be used to support simple analyses. More frequently, the Data Bank serves as a sample frame for more complex projects where users draw a sample of individuals based on characteristics present in the Data Bank and then augment that information with more detailed information by linking to detailed, longitudinal tax and information return data required to address specific research questions.
In addition to helping resolve the unit of observation problem with tax data, the Data Bank represents a major advance over archival annual population files by addressing some of the Big Data issues described above. First, it addresses common record linkage problems by applying documented rules for dealing with duplicate or missing data and by validating record linkages. By creating a common baseline for addressing a wide range of research questions, the Data Bank makes it easier to reproduce results; this would not be possible if all research using the population files were done using ad hoc, user-created datasets. Second, although the number of data items for each record is limited, the Data Bank brings together sufficient foundational data from multiple separate data files to support many projects. To compile such information for each new research project would take a great deal of machine time and human effort. Third, the database is indexed and contains all the fields necessary to link to more detailed administrative information so that users can efficiently add additional data items needed for specific projects, recreate tax filing units if needed, or connect multiple generations with relative ease.
Data on nonfilers
Another limitation of administrative data is the fact that the population covered is defined through legislation, and often this population is truncated in some way, based on specific demographic or economic characteristics. In some cases, individuals may have to take some action to become part of the administrative system (e.g., filing a tax return), and there may be perceived advantages for some individuals to evade, particularly if their circumstances place them at or near a threshold requiring mandatory participation. Income tax filing gap estimates for Tax Year 2000 suggest that as many as 11 million taxpayers, or about 9 percent of the potential income tax filing population, either file returns late or not at all [16]. Even under the most compliant circumstances, the population of Federal income filers includes only those U.S. citizens and resident aliens whose gross income, a concept defined by statute, was above a threshold.6 It is estimated that income tax return filers represent roughly 61 percent of the total U.S. individual population [17].
Information documents, provided to the IRS by third-party providers, have potential for providing information on both nonfilers, those who have a filing requirement but do not file timely, and those whose incomes fall below the filing threshold. In 2014, the IRS received 2.3 billion information documents reporting information on wages, mortgage interest, student loan interest, tuition payments, interest and dividend earnings, retirement income, and income distributed by partnerships, small corporations, estates, and trusts [18]. Importantly, this information is available for individuals whose incomes are below the annual income tax filing threshold.
SOI led a collaborative work group made up of researchers from multiple Federal agencies, to develop a consistent framework for organizing information return data to optimize their usefulness for a wide range of research purposes. Using individuals as the unit of observation, key data items have been mapped into a tax return framework to ensure that estimates using these data are consistent with the richer data available for return filers. Rules have been created for dealing with issues such as duplicate filings, which can arise when third parties amend originally submitted information. Other issues that have been addressed include validating matches and eliminating errors that arise when documents are submitted with invalid Taxpayer Identification Numbers (TINs) or when multiple people share the same TIN either accidentally or because they lack status to obtain their own. Because these documents are organized in a consistent, well-documented format, researchers have a consistent starting point for their work, saving time, machine resources, and reducing inconsistencies between estimates based on the same source data. SOI is using these data to produce new data products. This framework has also been adapted to the IRS Data Bank.
Future work
As administrative tax data become easier to access and use, they will play an increasingly large role in the production of SOI tax statistics. They will be used frequently to produce timely preliminary estimates or estimates for partial periods. SOI is already doing this with a series of real-time filing season estimates, which despite containing only a small number of data items, allow economists to produce influential, important estimates of aggregate economic performance months earlier than in the past.
Additionally, SOI can leverage administrative data to create retrospective panels of various tax-filing populations to better apply longitudinal analyses to answer specific analytical questions. SOI can create a panel of current-year filers and supplement the information with administrative data from previous years. This methodology may improve the accuracy of SOI’s longitudinal analyses by reducing estimation problems related to attrition in the current forward-looking panels.
Another important use of administrative data will be to augment SOI’s samples to produce more complete information on certain types of economic activity or sectors of the economy. As discussed above, SOI regularly links its sample of Federal estate tax returns, which provide information on a decedent’s balance sheet at death, to income tax return data. The linked data are used to produce insightful tables that connect wealth and income, two important measures of well-being. In the future SOI’s samples of information returns filed by U.S. partnerships will be linked to administrative individual and corporate income tax return data to more fully understand the economic impact of pass-through organizations as policy makers evaluate U.S. business tax policy. Related to this is the potential for linking the SOI corporate tax sample data to administrative tax data for related domestic and international businesses to better understand complex business behavior.
SOI geographic data have become quite popular among researchers and policy makers. In fact, administrative tax data have spawned a whole new movement in econometric analysis that uses variation in policies or economic conditions among small geographic areas to construct quasi-experiments. To be useful, however, better geographic location codes are needed. It is well known that the mailing addresses provided by tax filers may not always represent a primary address. For example, in some cases, a post office box number is supplied, rather than a street address. Others, particularly those with complex incomes, may provide the address of a paid preparer or closely held business, rather than that of their primary residence. Work by the Census Bureau, related to studies of how administrative tax data might be used to reduce the costs of fielding the 2020 Census, will provide some measure of the scope of these alternate addresses and may even suggest methods for improving the data. For example, commercial information or addresses from IRS information documents may provide better location codes. Other ways to improve the utility of administrative data include developing routines for consistently coding occupation and industry for individual tax filers and businesses.
One challenge inherent in future uses of administrative data is ensuring that accompanying metadata be as complete and as descriptive as possible. Often, metadata developed for administrative purposes lack key information necessary for complex analytical work. In administrative tax data, coverage and content can be subject to discontinuities resulting from changes to laws, regulations, administrative practices, or program scope. For example, income tax law revisions in 1981, 1986, 1990, and 1993 all made significant changes to both the components of income subject to taxation and the allowable deductions from income that had significant impact on the statistical uses of tax return data [19]. As the use of administrative tax data increases, SOI will be challenged to develop documentation to make users aware of how such changes may affect longitudinal analyses. In some cases, SOI may be able to provide program code for creating conceptually consistent, longitudinal variables to support some research.
Looking to the future, data quality in administrative tax data will remain an important consideration. As researchers increase their use of administrative tax data, they will naturally create ad hoc processes for detecting and resolving obvious data inconsistencies, computational errors, and large outliers. While such errors, if not introduced randomly, are a worrying source of bias, inconsistent treatment of these errors by individual analysts may create systematic biases that are even worse. Differences in treatment of inconsistencies by two researchers using otherwise similar data may lead to contradictory findings. SOI has a great deal of expertise in detecting and resolving errors and reporting inconsistencies. An important role for SOI in the future will be to help create tools for consistent error detection and resolution in administrative data files and to make these tools and accompanying documentation easily available to users. Such tools will be essential if administrative tax data are to be more widely used in a National Secure Data Service for evidence-based policy research. Techniques, such as machine learning, based on comparisons of SOI cleaned sample data to originally filed information, may help develop models for improving administrative data in the future.
Finally, data reliability may also be affected if the information respondents provide is manipulated to reduce tax liability or to enable a filer to qualify for specific tax expenditure programs. For some types of information, such as wages reported on tax forms, amounts reported by filers are validated against amounts reported by their employers, making the information difficult for respondents to manipulate. Data for other sources of income and most deductions currently rely solely on accurate self-reporting. Underreporting on tax returns, for example, may have resulted in underpayment of as much as $120 billion in income taxes for Tax Year 1998 [16]. In the future, it will be important for SOI to better understand how reporting behavior affects the statistics generated using administrative data (see, for example, [20]).
Footnotes
The Office of Management and Budget (OMB) has in recent years issued four memoranda emphasizing the value of existing data: Sharing Data While Protecting Privacy (M-11-02), Open Data Policy – Managing Information as an Asset (M-13-13), Next Steps in the Evidence and Innovation Agenda (M-13-17), and Guidance for Providing and Using Administrative Data for Statistical Purposes (M-14-06).
Data are considered to have posted when information imported from electronically filed returns or transcribed from paper-filed returns meets minimum acceptance standards and is saved to IRS authoritative data systems.
Different types of businesses are required to file different tax forms, depending on their organizational structure and purpose.
Married couples can file jointly, using one return to report the income of both spouses, or as “married filing separately”, where each spouse files his or her own return. For unmarried individuals living together as a couple, separate returns must be filed. Children earning less than $10,500 in interest and dividend income can report that income on a parent’s return, but those earning more and those with earned income must file separately.
Nonresident aliens are subject to different filing requirements, based on income earned in the U.S.
Acknowledgments
Thanks to former Statistics of Income Directors Fritz Scheuren, Daniel Skelly, Thomas Petska, and Susan Powers for their leadership and vision; to members of the Statistics of Income Division Consultants Panel for their more-than-3 decades of guidance, encouragement, and support, and to Beth Kilss and Camille Swick for helpful suggestions and editorial support.
