Abstract
This paper investigates the utility of a commercial property tax data from CoreLogic, Inc. (CoreLogic) aggregated from county and township governments from across the country, for use to improve American Community Survey (ACS) estimates of property tax amounts for single-family homes. Particularly, the research uses linkages of the CoreLogic file to the 2010 ACS to evaluate the use of CoreLogic data directly to replace survey responses for estimation of property tax amounts, potentially reducing measurement error and respondent burden. I find that the coverage of CoreLogic data varies among geographic areas across the U.S., as does the correspondence between ACS and CoreLogic property taxes. Large differences between CoreLogic and ACS property taxes in some instances may reflect conceptual differences between what is collected in the two data sources for certain counties. This research draws attentions to the challenges of using non-survey data sources that are aggregated from many state or local agencies with different practices for data collection and curation.
Introduction
Administrative records and commercial data can be inexpensive data sources for official statistics and offer some strengths that mitigate weaknesses of censuses and surveys. In particular, surveys place burden on respondents, are subject to errors in responses and can have high levels of nonresponse. Administrative records and commercial data, when of sufficient quality, can be less prone to errors in recordkeeping and offer broad coverage of the population. In some cases, they can even eliminate the need for questions on surveys. Yet, quality can vary across different data sources, as administrative records and commercial data are not collected for statistical purposes. Thus, careful evaluations are needed before using administrative records or commercial data for statistical products.
This paper evaluates commercial property tax data available from CoreLogic, Inc. (CoreLogic) for improvement of survey estimates of property tax amounts from the American Community Survey (ACS), examining 2010 data. CoreLogic aggregates property tax records from counties and townships across the country into one dataset. I focus on single-family homes, for which the record linkage is less challenging than for multi-unit structures.
Specifically, I evaluate whether the CoreLogic data are of sufficient quality that the data can be used in place of asking a question about property taxes on the ACS or to substitute for ACS responses. A major concern for the ACS is the respondent burden from the survey length and content. Thus, I consider the possibility of using CoreLogic alone to construct property tax estimates for geographic areas across the U.S. Separate research investigates the utility of CoreLogic data to potentially mitigate the effects of survey nonresponse [1].
I find that the quality of the CoreLogic data varies among counties and townships across the country, both in the coverage of the CoreLogic data and in the correspondence between ACS and CoreLogic property tax values. In some counties, large differences are found between the ACS and CoreLogic records, possibly due to conceptual differences between what is collected in the two sources. This demonstrates the challenge of using data aggregated from many state and local agencies. These findings do not support the use of CoreLogic nationwide in place of asking about property taxes on the ACS. Nonetheless, there may be counties where CoreLogic can be viewed as a “gold standard” for property tax amounts. Further research could work to identify these counties and townships and determine if the CoreLogic data should be used in place of survey responses.
Section 2 discusses background literature related to reporting error for housing statistics and the use of administrative records and commercial data to address survey reporting error. Section 3 provides an overview of the CoreLogic property tax data file, while Section 4 investigates the quality of the CoreLogic and compares ACS and CoreLogic reports of property taxes. Section 5 concludes by discussing the implications of the research both for using the CoreLogic data for ACS property tax estimates and more broadly for other uses of commercial data for federal statistical products.
Background
The American Community Survey (ACS) is one important source of housing statistics for the U.S. The large sample size of the ACS allows for producing estimates in geographic areas across the U.S., including census block groups for the ACS 5-year estimates. ACS property tax estimates are used for research on housing affordability, for determining formula block grant funds, for mass transportation and metropolitan planning, for determining eligibility for housing assistance, and to inform efforts to plan affordable housing [2, 3].
Reporting error on surveys is a concern for estimates of property tax amounts. A content reinterview survey of the 2012 ACS was conducted and found a moderate level of disagreement [4], indicating the potential for reporting error. While property tax amount response error has not been thoroughly studied, past research has found reporting error to be a concern when studying the related topic of home value. A comparison of survey responses on the 1979–1991 American Housing Survey metropolitan samples to the sale prices of the homes that were sold in the twelve months before the survey interview and found that survey responses tended to be higher than selling prices [5]. A separate study compared survey-reported home values from the Health and Retirement Study to sales prices and also found that the survey responses were greater than sales prices [6].
While there are few examples of commercial data being used to address response error for government statistics, research has shown some compelling examples using administrative data. Much of the research in this area has pertained to program receipt. For example, the Census Bureau is using Social Security Administration data linked to the Survey of Income and Program Participation to correct responses about supplementary security income receipt and disability insurance receipt [7]. Medicaid records have been used to adjust Current Population Survey estimates of Medicaid for underreporting [8]. Other studies have examined linking the Current Population Survey with administrative records for the Supplemental Nutrition Assistance Program, Temporary Assistance for Needy Families, General Assistance and housing assistance to improve estimates of program receipt and poverty [9, 10].
CoreLogic data
The CoreLogic, Inc. 2008–2010 property tax file (CoreLogic) aggregates property tax records from counties and townships across the U.S. While the majority of the records on the file are listed as from 2009, there also records from 2008 and 2010. The full file contains more than 169 million records and includes information on a rich set of housing characteristics, including property value, tax amount, physical and structural characteristics, mortgage, sales and ownership information and geography. The fields available can differ among counties and townships.
Using the geographic and address information from CoreLogic records, the Census Bureau linked the CoreLogic file to the Census Bureau’s Master Address File (MAF), through which CoreLogic records are linked with records from the ACS and other Census Bureau products. The linkage procedure standardizes addresses and uses information on house number, street prefix, directional prefix, street name, street suffix, direction suffix, apartment number/description, and five-digit zip code to conduct the linkage [11]. A two-step probabilistic matching process is conducted.
Challenges in linking CoreLogic to the MAF are further described in [11]. Overall, 63.4 percent of records are linked to the MAF. In studying the linkage of CoreLogic to the 2009 American Housing Survey through the MAF, 79.0 percent of single-unit structures are successfully linked, compared with only 14.8 percent of multi-unit structures.
I examine single-family, owner-occupied records from the ACS and CoreLogic, due the greater availability of linked CoreLogic data for single-unit structures and because the ACS only asks owner-occupied households are asked about their property taxes. Certainly, future research could investigate the quality of CoreLogic information for renter-occupied units.
Previous research conducted by Census Bureau researchers has studied using CoreLogic data for estimates of home values and year that a structure is built. A study of how CoreLogic and 2009 ACS home values compare for single-family homes found that ACS home values tend to be higher than the values from CoreLogic [12]. The difference between ACS and CoreLogic home values tends to increase with the time since the last move, which suggests that recent movers better estimate the value of their homes. An evaluation of CoreLogic data for the year that a structure is built in the 2012 ACS finds that 56.7 percent of single-family, detached homes in the ACS can be linked to CoreLogic records with year built information available [13]. Further, agreement for the time range in which a structure was built between ACS and CoreLogic was found for 78.3 percent of the linked records with reported year built information.
Results
Availability of linked CoreLogic data
In the 2010 ACS file, there are 1,116,568 records for single family, owner-occupied households. Among these, 69.1 percent were linked to CoreLogic records with property tax information available. When property tax information was not available, it may have been due to one of a few reasons: that no corresponding record was available from CoreLogic, that the Core-Logic record was available but the linkage to the ACS was not successful, or that a CoreLogic record was linked but the record did not contain property tax information. I exclude linkages when the ACS report of the year the structure was built was 2009 or 2010, as the CoreLogic file may not have been updated to reflect the recently built structures.
Percentage of ACS records linked to CoreLogic property tax information by state
Percentage of ACS records linked to CoreLogic property tax information by state
Source: 2010 ACS single-family, owner-occupied households linked to 2008–2010 CoreLogic data.
The availability of CoreLogic property tax information varies across states, counties and townships. The match rates for states are presented in Table 1. In Nevada, 89.6 percent of single-family, owner-occupied households in the 2010 ACS are linked to CoreLogic property tax information, while linked CoreLogic tax information is not available in Montana, New Hampshire or Vermont.
ACS match rates with CoreLogic property tax information by household characteristics
Source: 2010 ACS single-family, owner-occupied households.
The availability of linked CoreLogic tax information also varies by household characteristics, as presented in Table 2. Notably, 78.5 percent of ACS households in urban areas are linked to CoreLogic tax information, compared with only 53.0 percent of ACS households in rural areas. Households of higher socioeconomic status are also better represented among linked CoreLogic records than are households of lower socioeconomic status, a finding similar to that found in other studies of administrative record linkage to surveys [14]. Of households not in poverty, 69.6 percent have linked CoreLogic information compared with only 60.7 percent of households in poverty. When the householder is a college graduate, 73.7 percent of households have CoreLogic information compared with only 62.5 percent of households where the householder did not graduate high school. In Table 3, which compares characteristics for ACS records with and without linked CoreLogic property tax information, the median household income for records with CoreLogic information is almost $68,000 while the median household income for records without CoreLogic information is about $56,000. These findings demonstrate a strong association between the availability of CoreLogic data and household socioeconomic status and education. Further, these relationships hold within counties, as shown by estimating a multivariate logistic regression model among ACS records with CoreLogic property tax data using county fixed effects to condition on geographic differences [1].
ACS characteristics for records with and without linked CoreLogic property tax information
Source: 2010 ACS single-family, owner-occupied households.
In order to evaluate the CoreLogic data, I compare responses for property taxes in CoreLogic and the 2010 ACS. A major challenge in interpreting the comparisons is that both data sources may be prone to errors. The ACS suffers from respondent error, and CoreLogic data are only as accurate as the property tax records aggregated from counties and townships by CoreLogic. Nonetheless, comparing property taxes from the two data sources can help with evaluating CoreLogic’s usefulness and help better understand errors in ACS responses.
Across the U.S., there is an overall Pearson correlation of 0.724 between ACS and CoreLogic property taxes when both are reported and available. Since ACS and CoreLogic records are linked, considering the percentage difference between ACS and CoreLogic property taxes is useful. The percentage difference is defined to be
where ACS and CoreLogic are the respective property tax measures from the two sources.
Distribution of percentage difference of ACS property taxes from CoreLogic property taxes by household characteristics
Distribution of percentage difference of ACS property taxes from CoreLogic property taxes by household characteristics
Source: 2010 ACS single-family, owner-occupied households linked to 2008–2010 CoreLogic records.
Table 4 presents quantiles of the percentage difference for linked records by different household characteristics. Overall, the median percentage difference is 0.0 percent. The 5
Interestingly, the interquartile range does not vary as much by the year moved, suggesting that survey recall of property tax amounts differs from patterns for home values found in research [5]. However, while the interquartile range is not as sensitive to the year moved, the 5
Distribution of percentage difference of ACS property taxes from CoreLogic property taxes by county
Source: 2010 ACS single-family, owner-occupied households linked to 2008–2010 CoreLogic records in select large counties.
While comparisons by household characteristics may reflect patterns in ACS response error, comparing ACS and CoreLogic property taxes by geographic area can possibly help with understanding errors in the CoreLogic data. As the property tax data are maintained by different authorities for each county and township, it is not surprising that CoreLogic’s quality and accuracy vary by county. Some patterns emerge by examining statistics for the percentage difference by large county in Table 5 and Fig. 1.
Boxplots of percentage difference of ACS property taxes from linked CoreLogic property taxes by select counties. Whiskers indicate 5
Across counties, the distributions of the percentage difference between ACS and CoreLogic property taxes can differ greatly. Many counties have a median percentage difference near 0.0 percent. However, there are some geographic areas for which ACS and Core-Logic property taxes disagree. For example, four counties in Texas have median percentage differences less than
Examining the interquartile range as a measure of spread of the percentage difference can help with assessing the accuracy of CoreLogic property taxes compared with ACS numbers. Among the smallest interquartile ranges are those of Milwaukee County, WI (5.9 percent) and Wake County, NC (6.8 percent). On the other hand, Dallas County, TX has an interquartile range of 42.6 percent and Harris County, TX has an interquartile range of 79.0 percent. The spread of the percentage difference distribution for a county being much smaller than the distribution for the U.S., as for Milwaukee County and Wake County, may provide a reason to have more confidence in those counties’ CoreLogic data.
The findings of this paper illustrate some of the major challenges with using commercial data for official statistics. As the CoreLogic property tax data are aggregated from counties and townships around the country, the quality of the data varies across geographic areas and is subject to the practices of each local property tax authority. The amounts recorded on property tax records may not reflect the property taxes that are actually billed. For example, in several counties, particularly in Texas, large differences between the CoreLogic and ACS property tax amounts indicate that the CoreLogic data may reflect a different concept than that measured by the ACS. Even in counties CoreLogic for which property taxes are comparable to ACS reports as measured by the median percentage difference, the spread of the percentage difference can be large, which provides doubt for using CoreLogic data directly for estimates.
Further research can help improve understanding of the challenges with such data sources as CoreLogic and inform future use of commercial data to improve survey-based estimates. First, if counties and townships can be identified where the CoreLogic data is a “gold standard”, then the Census Bureau should consider using CoreLogic data instead of survey responses in these counties. Further work would be needed to identify these counties. Obtaining a third independent data source with property tax information, if one can be found, is one possible way to verify the property tax data. It may also be helpful to hold discussions with local property tax authorities to better understand the data. In addition, even when commercial data sources do not constitute a gold standard, the data may still be valuable to improve estimates by providing auxiliary information, such as for imputation modeling or for small area estimation [1] finds that using CoreLogic data modestly improves the predictive power of imputation models for property taxes, while having minimal impact on estimates. Further research can improve understanding of the value of commercial data as auxiliary information to support surveys, including accounting for when data are aggregated from many jurisdictions.
There are some limitations of the methods of this research conclusions for using CoreLogic. First, the research focused on single-family homes and does not consider other kinds of structures. Prior studies have documented the difficulties of using CoreLogic for multi-unit structures in surveys. Future research can study using CoreLogic for ACS multi-unit structure property taxes, although additional challenges would likely emerge. Second, the research does not use a “gold standard” measure of property taxes to verify the CoreLogic records. Without a “gold standard” measure, assessing the accuracy of the CoreLogic data is limited to comparing CoreLogic records to the ACS, which is subject to response error.
Commercial data offer great promise for official statistics and can mitigate some weaknesses of surveys. However, the research demonstrates the set of challenges that can emerge when data are collected and maintained by many authorities throughout the country. As new approaches toward federal statistical products are considered in the future, careful evaluations of these data sources will continue to be needed.
Footnotes
Acknowledgments
This research was supported through a U.S. Census Bureau Dissertation Fellowship during my doctoral studies at Northwestern University through contract YA-1323-15-SE-0097. All views expressed are solely my own. There are many individuals who I want to thank for their support and advice for this research. I thank Tommy Wright and Amy O’Hara for making this research possible and for connecting me with researchers at the Census Bureau. Bruce Spencer, my doctoral advisor, has provided incredible guidance. This research has benefited from conversations with and comments from many individuals, including Trent Alexander, Stephen Ash, Aileen Bennett, Quentin Brummet, Shawn Bucholtz, Bob Callis, George Carter, Tamara Cole, Art Cresce, Diane Cronkite, Craig Cruse, Denise Flanagan Doyle, Larry Hedges, Howard Hogan, Don Jang, Andrew Keller, Ward Kingkade, Arend Kuyper, Tom Louis, Chris Mazur, Chuck Manski, Carla Medalia, Bonnie Moore, Darcy Steeg Morris, Tom Mule, Mary Mulry, Michaela Patton, Steven Pedlow, Tom Petkunas, David Raglin, Michael Ratcliffe, Jerry Reiter, Kristine Roinestad, Joe Schafer, Jacob Schauer, David Sheppard, Eric Slud, Matt Streeter, Lars Vilhuber, Adeline Wilcox, Ellen Wilson, Bill Winkler, and Alan Zaslavsky. Data analyses were conducted at the Chicago Census Research Data Center. I thank Trent Alexander, Stephanie Bailey, Quentin Brummet, Frank Limehouse, Joey Morales, Amy O’Hara, Danielle Sandler and others for their assistance at the Research Data Center.
