Abstract
Previous studies have shown that modeling based on administrative records can be predictive of Nonresponse Followup (NRFU) enumeration outcomes in U.S. Census Bureau Decennial data collection operations. We compare model predictive power when varying training data sources and evaluate the extent to which survey data can be used to reduce enumerator workload when combined with available administrative data. We perform the evaluation using the 2010 Census and the 2014 American Community Survey. Our main finding is that a large survey-based training dataset, such as the American Community Survey, can provide results comparable to Census data. Robustness checks then illustrate that even small sample survey-based training datasets can also yield comparable predictions. We also discuss a broader role for use of existing survey data in NRFU operations of statistical agencies outside of the United States when national Census or administrative data sources have only incomplete coverage of the population.
Introduction
In preparation for the next Decennial Census of Population and Housing (hereafter Decennial Census or Census) in 2020, the U.S. Census Bureau is seeking to reduce the costs associated with conducting the Census. During the 2010 Census, the largest contributor to cost was the Nonresponse Followup (NRFU) operation, which cost over $2 billion. The purpose of the NRFU operation was to obtain responses for those households and individuals who did not self-respond. This operation led to up to six visits by enumerators to each household. One idea for reducing the number of NRFU personal visits in the upcoming 2020 Census is to use administrative record data to assess occupancy, manage enumerator NRFU workload (the number of addresses, or housing units, that enumerators must visit), and enumerate some households. Several recent studies have proposed and evaluated models for such a purpose, including [1, 2, 3]. These studies have relied on data from the previous Decennial Census to estimate, or train, predictive models of NRFU outcomes. Those responses and the linked administrative records are becoming dated as the next Census nears. In this paper, we take a first step at expanding the source of training data to more current survey-based sources to validate the feasibility of their use in NRFU operations. Our findings are robust across a variety of experiments. Perhaps more strikingly, we also show that even a small sample survey-based training data source can yield estimates of comparable quality to the Census baseline. This is particularly relevant given survey non-response and the cost of administering a large survey.
We use the American Community Survey (ACS), the U.S. Census Bureau’s largest household survey, to train models for NRFU workload removal.2 The ACS is designed to be a nationally representative yearly survey with a sample frame of about 3.5 million addresses. Data are collected monthly, over the course of each year. The annual nature of the survey allows for the collection of more timely data to analyze current demographic change than did the long form of the Decennial Census and it provides small area estimates over rolling, 5-year windows.3
We compare NRFU workload removals generated with a single-year ACS sample to those based on the 2010 Decennial Census, which has served as the benchmark data source for NRFU operations in the upcoming 2020 Census. To analyze the data, two stages of modeling are used as described in [7]; the first removes vacant housing units and the second selects occupied housing units for administrative record enumeration. The occupied modeling stage includes a household composition model and a person-place model. The present paper focuses on results for the person-place model, described later. These models are chosen to be representative of the current methodology actively tested by the U.S. Census Bureau for use in the 2020 Decennial Census.4 In this analysis, we compare the accuracy of predictions for NRFU outcomes generated from models trained, alternately, on the 2010 Census and 2014 ACS. Our evaluation sample is the 2015 Census Test, which was conducted to test Census data collection methodology in Maricopa County, Arizona.
Our main result is that we find comparable match rates in counts and compositions for the ACS and the 2010 Census. The overall conclusion that ACS and 2010 Census training modules perform similarly is robust to variation in the ACS training data by geography, survey month, response mode, and stringency of the removal cutoff. Robustness analyses of various subsamples indicate that even a small sample of the ACS can yield comparable match rates to the baseline model. This outcome suggests that much smaller surveys than the ACS could be used to train the model.
This paper proceeds as follows. We introduce the models and modifications specific to the ACS implementation in Section 2. We discuss the data sources and particularities of the ACS data in Section 3. We discuss the results evaluating 2010 Census and ACS trained models in Section 4. We conclude in Section 5.
Methodology
This analysis uses as a baseline the methodology described in [7, 8], which was implemented for the NRFU operation of the 2016 Census Test. The methodology targets occupied addresses for removal from NRFU after a single visit based on two stages of modeling and the application of a decision rule.5 The models are estimated using household response outcomes as the dependent variable and administrative records (AR) as explanatory variables. We summarize these first and second stage models below, but note that our results focus exclusively on occupied addresses since our focus is the enumeration of occupied addresses and their occupant characteristics.
Model
The first stage model is the vacancy (VAC) model, which is used to assign a housing unit, indexed by
We represent each outcome of
The second stage includes the “person-place” (PP) model [9], which explains the agreement of administrative records with responses using personal identifying information. Define
where
where
Covariates for both stages of modeling include block group characteristics from the 2010 Census as well as address and person linked administrative records on the housing unit, the overall household, and persons in the household. These are documented with more detail in [5].
The second stage also includes the “household composition” (HHC) model, which explains the agreement of administrative records on persons and their ages with the combination of adult and child responses at a housing unit. HHC, a multinomial choice outcome, is intended to capture the differential utility of administrative records by housing unit size and presence of children.6 The results for the PP and HHC models are broadly similar, so although we still use the HHC model for some calculations, we only present results for the PP model. The HHC model is detailed in [5].
The ultimate determination of whether to remove a unit from the NRFU workload is based on a “distance function,” with a root, sum of squared-errors formula. For all housing units that are determined to be occupied based on administrative records (from the VAC model), a distance, or
where equal weight is given to
Our evaluation strategy is to compare the outcomes of using the ACS-trained and Census-trained models to select different removal samples of NRFU addresses. For each sample of removed addresses, we compare the counts obtained from AR with the counts reported in NRFU for the 2015 Census Test. This strategy follows the same methodology used by [10]. Given the limitations of ACS for determining the vacant status of an address, we focus on explaining the count of occupied units.8
For both the ACS and 2010 Census trained models, the 3,400 units with the smallest value of the distance function (with a high likelihood of concordance with administrative records) are enumerated using administrative records and evaluated against NRFU fieldwork records from the 2015 Census Test. Our removal count of 3,400 corresponds with the number used in Census Bureau model development and testing, approximately 15% of NFRU addresses (see [10, 8]). Within this sample of addresses, the goal is to evaluate the success of the modeling process in identifying records that can be accurately enumerated via administrative records.
Note that since covariates and sample size differ between the ACS and Census-based models, there is not a good direct comparison of model coefficients or goodness-of-fit. Our means of comparison between ACS and Census-based predictions is in the match rates of the addresses removed. We also compare the sensitivity of the removal samples and match rates when the scope of the ACS training data is varied to find a minimum survey size beyond which match quality declines substantially.
Data
We construct two main training datasets and one evaluation dataset. The 2010 Census and 2014 ACS are each used to develop a training set, while 2015 Census Test data are used for evaluation.9 Contemporaneous administrative records as well as neighborhood and address information are used in all three datasets. We summarize these datasets in the following paragraphs.
Administrative records come from several sources. The Internal Revenue Service (IRS) provides administrative records composed of Individual Tax Returns (1040) filed any time during the year in 2009, 2013, 2014. We also use records of filing in years 2010, 2014, and 2015 for filings in weeks 4–17 of those years. We also use IRS Information Returns 1099 for filings in years 2010, 2014, and 2015. In addition, we use Medicare enrollment data from the Center for Medicaid and Medicare Services (CMS) and Indian Health Services Patient Database.
We supplement these administrative records from federal government sources with information from the TARGUS database. This is a commercial data source that provides person verification. We also make use of data from the United States Postal Service (USPS) to inform the model with undeliverable-as-addressed (UAA) flag and reason (for 2010 model only). Data from these administrative records sources are matched with person and place observations in the 2010 Census, 2014 ACS, and 2015 Census Test when possible.
We use the 2010 Census as the baseline training dataset to which we make our comparisons. We restrict our use of 2010 Census data to the universe of NRFU cases in Arizona only. The restriction to Arizona cases coincides with the state chosen for the 2015 Census Test, which may improve model fit if associations of outcomes and variables differ across populations and geography. With this restriction, the 2010 Census training dataset has fewer records than our national ACS survey sample defined below (approximately 650,000 compared to 2.4 million, respectively). We use respondent age variables to construct counts and household composition (by age) variables at each address.
The 2014 American Community Survey (ACS) is our alternative training data source for count and composition. The ACS replaced the Decennial long form after 2000 by collecting similar information throughout the decade. Data used in this report is from an intermediate 2014 ACS file that includes both respondents and non-respondents. In particular, the dataset includes non-respondents that were sub-sampled out due to unmailable or non-responding addresses that were not referred to a telephone-based or in-person followup. See [6] for a detailed description of sampling methodology.
Our evaluation data set is the 2015 Census Test. This test took place between April 1, 2015 and August 14, 2015. The purpose of the test was to evaluate methods used to reduce fieldwork and data collection. The test site included several areas within Maricopa County in Arizona. See [10] for more details on the location and methodology. For consistency, we restrict our analysis to Census Test cases that underwent the same followup methodology as used in the 2010 Census. We used respondent age variables to construct counts and household composition (by age) variables at each address.
Population count comparison for resolved true positive AR occupied cases
Population count comparison for resolved true positive AR occupied cases
Note: Comparisons are for the 3,400 housing units, or MAFIDs, selected for removal under each training dataset. Table should be read as AR count relative to the 2015 Census Test NRFU count.
In this section, we first establish a baseline version of the AR model trained using the 2010 Census and evaluated for the 2015 Census Test. Next, we implement the same model estimation and evaluation for 2014 ACS with a sample including all records except those subsampled out of the ACS NRFU. Last, we implement several extensions, considering alternate ACS training samples that vary by time, geography, and sample size. The evaluation framework, in terms of comparing counts and compositions with the 2015 Census Test, is styled after [10].
Comparison of training modules
We first summarize the performance of the model estimated using 2010 Census data. The first row of Table 1 shows the distribution of count discrepancies by the magnitude of the differences in AR relative to NRFU fieldwork records using Census data. Column 2 shows the number of households, while columns 3–7 show the percentage of households disagreeing with NRFU by varying degrees. The last column shows the percentage of households with an unknown fieldwork population count. We find that the household counts from the AR enumeration coincide with NRFU counts for 56.6 percent of addresses. Not only the match rate, but also the distribution of non-matches is important for overall counts.10 6.7 percent of administrative records overcount household population by 2 or more individuals. 11.8 percent of administrative records overcount household population by one individual. Undercounts follow a roughly similar distribution, as can be seen in columns 6 and 7. The general symmetry in the distribution of over and undercounts suggests that, on an aggregate level, AR enumeration would not substantially contribute to over and undercounting of the population, though we do not examine this in aggregate.11 Overall, administrative record enumerations match household population counts in the Census Test within one individual for 83.4 percent of addresses.
A comparison of model predictions for the ACS trained and 2010 Census trained models finds overall similarity in the accuracy of predictions for count and household composition. The baseline and most complete ACS trained model matches the NRFU responses in count with a rate of 58.7 percent compared to 56.6 percent for 2010 Census model.12 The ACS trained model also has a comparable percentage of having one greater or fewer person in the household and NRFU count unknown compared to the 2010 Census sample.
These results suggest that the mixed approach, using the 2010 Census to evaluate vacancy and ACS for the PP and HHC models, minimized the impact of not having detailed UAA codes for ACS. Furthermore, the year-round sampling and smaller sample size do not seem to have resulted in worse, overall accuracy for the ACS trained model. While the ACS trained model achieves a slightly higher agreement rate, we do not regard this difference to be of sufficient magnitude to conclude that the ACS is actually a superior training module. Rather, these results suggest that the ACS would be an appropriate substitute for evaluating and updating the model and incorporating new administrative records.13
Household population count disagreement rates, by predicted error, or distance rank, for 2010 census and ACS training datasets. Note: Quantile bins of housing units are ranked ascending in distance function value (see Eq. (4)) for the 2010 Census and 2014 ACS models. Lower ranked bins are removed with higher priority. The first 10 bins constitute the 3,400 units in the removal sample for each model. The disagreement rate is the complement of the match rate for each bin, in percentage terms.
Figure 1 demonstrates the tradeoff of the quantity of records removed and the disagreement rate in population counts. It illustrates this tradeoff for both the 2010 Census and ACS trained models. The horizontal axis lists bins of the distance rank for each module, with 20 bins encompassing the 6,800 records with the lowest distance scores (bins 1 through 10 contain the 3,400 units in the removal sample). The 20th bin includes the records removed with the least confidence. For each bin, the vertical axis gives the disagreement rate, constructed as one minus the agreement, or match rate, in percent terms. The 2010 Census and ACS trained modules appear to have a similar tradeoff across the full range of the distance function presented here.
Comparison of matches in count
Note: Match % gives the correspondence of person counts at removed addresses to NRFU count (see Table 1). ACS-NRFU training based on ACS MAFIDs where telephone and in-person follow-ups (CATI and CAPI) were conducted due to non-response of the mail-in form. Percentage subsamples of ACS do not precisely correspond with the Baseline sample as the Baseline excludes non-respondents who were sub-sampled out of the ACS frame due to non-response.
In this section, we summarize the overall match rates in counts by various definitions of ACS data used to estimate our predictive models and compare them to our baseline 2010 Census dataset. Resources for training an agreement model may vary, with datasets differing in size, geography, currency, and operational constraints.14 We present matching results for a range of training datasets in Table 2. The first two rows repeat the earlier results. The key feature seen from Table 2 is that changes in the ACS geography (same state as 2015 Census Test), timing (April responses only or survey panels from February to July), and response type (NRFU only) used to estimate the models have negligible effects on the accuracy of counts.
While we find minimal differences in match rates with these variations in definition, a reduction in the estimation sample size alone could also decrease the accuracy of predictions. This consideration may be especially relevant for international applications when a survey data source is available, but a national Census or administrative records infrastructure is not available or is incomplete. Many countries lack administrative records systems (also known as register-based systems) that can be used for enumeration without the use of Census or survey data. The United States lacks a central administrative records infrastructure, which necessitates combining existing administrative records with additional sources [12]. These hybrid data collection methods are common to other countries as well. A description of national Census data collection in other countries can be found in [13, 14].
How do the match rates vary with a reduction in the ACS training dataset? The last five rows of Table 2 detail match rates in counts and compositions with a reduced ACS sample. Compared to the baseline ACS values, the composition match rates remain comparable to baseline estimates even when the ACS sample is reduced to 0.1%. A 0.1% sample of the ACS represents only 3,577 households with valid administrative records, which is substantially less than the samples analyzed in previous sections. The results suggest that even a small, albeit well-designed survey, can serve to augment existing administrative records to reduce NRFU workload with little penalty in overall match rates.
To provide some perspective on the relative size of these samples, note that the Current Population Survey has an initial housing unit sampling size of approximately 60,000 and the Survey of Income and Program Participation (SIPP) has a sampling size of approximately 53,000. These are commonly used surveys produced by the Census Bureau. The sampling size numbers are both greater than 1% of the ACS sample defined in the table, which still yields comparable match rates to those found in the baseline. This suggests that other surveys might also perform well in this application.
Conclusion
This evaluation shows that the ACS performs comparably with the 2010 Census as a source of training data for AR models used for 2015 NRFU, with similar count predictions and a high degree of overlap in the record sets selected for removal. Year-round sampling and a smaller sample size seem to have had minimal effect on the model accuracy. These results indicate that model predictions based on administrative records are not especially sensitive to the differences of Census and ACS fieldwork. Further analysis shows that training the model on even a small percentage of the ACS sample still yields comparable match rates. This limited evaluation suggests the possibility that statistical agencies can employ already existing surveys to supplement administrative records in NRFU operations when Census data is unavailable or unreliable. Further research could examine if this outcome is robust across demographic and tenure groupings.
Footnotes
Extended versions of this paper are available as a CES working paper [4] or in JSM Proceedings [
].
See presentations in 2017 JSM session number 222, “Administrative Record Research for the 2020 Census”, or presentations in Census Scientific Advisory Committee meetings (for example, “Algorithms for Including Administrative Data to Address NRFU Efforts” in
This paper refers to addresses or housing units, which are represented by a Master Address File ID, or MAFID, a unique record in the Census Bureau’s frame of residences.
We define housing unit composition categories in the training dataset to be 1) 0 occupants, 2) 1 adult, no children, 3) 1 adult, with children, 4) 2 adults, no children, 5) 2 adults, with children, 6) 3 adults, no children, 7) 3 adults, with children, 8) other.
The methodology as implemented using 2010 Census training data cannot be directly applied to 2014 ACS training data due to differences in data collected on mailing outcomes. In particular, the 2010 Decennial collects information from the United States Postal Service (USPS) on the reason for an undeliverable-as-addressed (UAA) return code. UAA reasons include insufficient address (i.e. mail without a number or street), no such number, unclaimed, deceased, and vacant, among others. This information is not available for the ACS. We therefore estimate the VAC model using 2010 Census only. This hybrid approach still allows us to use the more recent information on household counts from the ACS, but takes the best available information from the 2010 Census to inform the VAC model, which is upstream of occupied removal.
See [11] for a summary of the 2010 Census coverage measurement and [
] for an analysis of AR coverage.
For household composition, the respective rates were 55.0 and 54.0 percent.
ACS fieldwork operations are different from the 2010 Census. In particular, ACS fieldwork begins with a telephone stage one month after the initial mailing and an in-person component that begins two months after the initial mailing. We attempt to gauge whether reference date inconsistency impacts the accuracy of model predictions by varying the timing of ACS training data used to estimate the HHC and PP model.
Acknowledgments
This research was conducted in coordination with the U.S. Census Bureau’s Decennial Statistical Studies Division. We thank Andrew Keller, Scott Konicki, Thomas Mule, Michael Ikeda, Ingrid Kjeldgaard, Darcy Morris, David Raglin, Larry Bates, Andrew Raim, Erika McEntarfer, Shawn Klimek, and Lucia Foster for assistance and comments.
