Abstract
Small area estimation is critical for a wide range of applications, including urban planning, funding distribution, and policy formulation. Individual-level population data, which typically include each individual’s socio-demographic characteristics and small area location, are a rich source of information for small area estimation. However, individual-level population data are often not made public due to confidentiality concerns. This paper describes the development of a public-use synthetic individual-level population dataset in the United States that can be useful for small area estimation. This dataset contains characteristics of housing type, age, sex, race, and Hispanic or Latino origin for all 308,745,538 individuals in the United States at the census block group level, based on publicly available aggregated data from the 2010 Census. Experimental results suggest the validity of the synthetic data by comparing it to different data sources, and we show examples of how this dataset can be used in small area estimation.
Introduction
Small area estimates are statistical estimates of subpopulation characteristics in small geographic areas (e.g., counties, census tracts, census block groups) [Rao and Molina, 2015]. Various stakeholders (e.g., policymakers, planners, and analysts) are interested in these estimates as they use this information to understand local communities for purposes of policymaking, regional planning, and business development [Gonzalez and Hoza, 1978; Flowerdew and Goldstein, 1989]. To support small area estimation, statistical agencies regularly collect individual-level demographic, social, and economic data at fine geographic levels, primarily through censuses and administrative records [Brackstone, 1987]. Although these data enable direct estimation for small areas, due to the risk of revealing the identities of data subjects, the public release of such data is often restricted by privacy laws and regulations, such as Title 13 of the United States Code (13 U.S.C. § 9) and the Privacy Act of 1974 (5 U.S.C. § 552a). As a result, instead of releasing individual-level data, statistical agencies commonly disseminate preaggregated tables of data for small areas to the public.
Aggregated data can be useful in many applications, but it has limitations that prevent it from fulfilling the increasing demand for small area estimates driven by many stakeholders. One major limitation is that aggregated data often only contain at most three or four subpopulation characteristics at a time [Williamson et al., 1998]. For example, aggregated data from the United States Census Summary File 1 (SF1) are cross-tabulated for at most three of the four characteristics of age, sex, race, and Hispanic or Latino origin. The utility of publicly available aggregated data is limited for small area estimation involving more characteristics or cross-tabulations of characteristics that are not included in the current aggregated data products (e.g., cross-tabulations for all four characteristics of age, sex, race, and Hispanic or Latino origin).
To address the limitations imposed by aggregated data, various methods and software systems have been proposed to generate synthetic individual-level population data to support fine-grained decision making and micro-level analysis. One commonly utilized method for creating such synthetic population data is the iterative proportional fitting (IPF) approach. IPF employs an iterative process to adjust the weights assigned to each individual within a non-spatial individual-level survey dataset until the desired fit to the target aggregate constraints is achieved [Choupani and Mamdoohi, 2016; Lovelace et al., 2015; Simpson and Tranmer, 2005]. Several population synthesis systems have been developed based on IPF, including SYNTHESIS (Synthetic Spatial Information System) [Birkin and Clarke, 1988], PopGen [Konduri et al., 2016], and SPENSER (Synthetic Population Estimation and Scenario Projection Model) [Spooner et al., 2021]. The advancement of methods and systems for population synthesis has also led to the availability of open datasets in countries such as the United Kingdom [Lomax and Smith, 2017; Smith and Russell, 2018; Wu et al., 2022], Ireland [Farrell et al., 2012; Morrissey et al., 2015], and Canada [Prédhumeau and Manley, 2023].
However, the existing methods, software systems, and data have limitations. First, the IPF method typically relies on the availability of a non-spatial individual-level survey dataset for the specific study area. While such datasets are available in certain countries through sources such as the IPUMS International (Integrated Public Use Microdata Series, International) [Minnesota Population Center, 2022], they may not be accessible in many other countries, especially those in underdeveloped regions and the Global South. This reliance can limit the general applicability of IPF-based methods and systems. In addition, while open synthetic population data have gained popularity in European countries and Canada, this trend is not mirrored in the United States, which limits the potential applications that can utilize such data.
The purpose of this paper is to describe a synthetic individual-level population dataset in the United States that is open and realistic and can be used to support small area estimation. This dataset is generated using a new method [Lin and Xiao, 2022; Lin and Xiao, 2023b] that solely relies on public aggregated data, which eliminates the need for individual-level survey data and can enhance replicability and generalizability across space and time. Specifically, we generate the synthetic data based on public census tables from the United States Census SF1. An optimization model is used to construct the synthetic data by minimizing the difference between summarized information of the synthetic population and statistics in publicly available census tables. The validity of the synthetic data is assessed by comparing it with published census tables as well as sampled national individual-level data.
Methods
We aim to generate synthetic population data for all 308,745,538 individual in 220,334 block groups in the United States. Each individual has five socio-demographic characteristics of housing type, age, sex, race, and Hispanic or Latino origin (Table A.1). The following describes the process of synthetic data generation.
Materials
Census tables selected for synthetic data generation.
Sample rows from the H7Z (Hispanic or Latino Origin by Race) table. This is only for illustrative purposes and does not present all columns.
Optimization modeling
An optimization approach [Lin and Xiao, 2022; Lin and Xiao, 2023b; Lin, 2023] is used to construct the synthetic population data. We begin with a matrix representation of the individual-level population data that need to be synthesized. Let n denote the number of block groups covered by the individual-level data (n = 220, 334), and d the number of characteristics for each individual in the data (d = 5). A predicate is formed to contain one value from each of the d characteristics. For example, ≤Household, Under 5 years, Male, White alone, Not Hispanic or Latino≥ is a predicate. The number of all possible predicates is denoted as m (m = 3 × 23 × 2 × 7 × 2 = 1, 932). The individual-level data can then be represented using an m × n matrix
Data Records and Usage
Sample rows of the synthetic data.

Estimating the percentage of non-Hispanic White females aged 18 to 19 who live in households using synthetic data.
Technical validation
Internal validation is first performed to assess the validity of the synthetic data, in which the synthetic data are compared to the 12 SF1 tables used in data generation. Specifically, we compute the squared difference between each synthetic census table (
We also conduct external validation to compare the synthetic data with an external data source known as the American Community Survey Public Use Microdata Sample (ACS PUMS), a five percent sample of the national individual-level data. We retrieve the 2010 5-Year ACS PUMS from the Integrated Public Use Microdata Series (IPUMS) USA [Ruggles et al., 2022]. Each individual in the ACS PUMS shares the same five socio-demographic characteristics as the synthetic data. We process the values of these characteristics to match those in the synthetic data shown in Table A.1. The ACS PUMS uses Public Use Microdata Areas (PUMAs) as the smallest geographic unit for each individual, with each PUMA consisting of a group of adjacent block groups. We aggregate the synthetic data to 2351 PUMAs to make comparisons. We represent the synthetic data for each PUMA as an m-length vector
The cosine similarity ranges between 0 and 1, where a value of 1 indicates the same distribution between two datasets and 0 indicates the opposite. Figure 2 presents the distribution of cosine similarity across all 2351 PUMAs. All of the cosine similarity values are above 0.6, and the majority (63%) of them are greater than 0.95. This suggests that the synthetic data can well represent the population in the sample individual-level data. The distribution of cosine similarity for the 2351 PUMAs.
Conclusions
This paper presents a synthetic population dataset that contains artificially generated values for housing type, age, sex, race, and Hispanic or Latino origin for all 308,745,538 individuals in the United States as of the 2010 Census. This dataset includes small area locations at the county, tract, and block group levels, with block groups being the finest geographic level chosen for individual privacy preservation. Compared to public aggregated data, the synthetic data offer fine-grained individual-level information that is highly desirable.
In recent years, there has been a notable rise in the development of new and advanced methods for generating realistic synthetic population data. These methods offer improved capabilities in capturing the complexity and diversity of real-world populations. For example, Casati et al. (2015) extend the traditional IPF method by integrating advanced techniques such as Gibbs sampling and generalized raking. Farooq et al. (2013) introduce a Markov Chain Monte Carlo (MCMC) simulation method that draws from the original distribution using partial views of joint attribute distributions to synthesize population data. In addition, deep generative models, such as variational autoencoders (VAEs), have gained attention for synthesizing population data by capturing complex relationships and generating realistic populations [Borysov et al., 2019]. However, these methods can be computationally intensive and are typically more suitable for local-scale applications when computational resources are limited. In contrast, the method employed in this paper has a simpler form and can be effectively implemented at the country scale.
Validating synthetic population data has long been a challenge. In this paper, we conduct both internal and external validation. Internally, we compare the synthetic data with the aggregated data used in its generation. In addition, we perform external validation by comparing the synthetic data with available ground-truth data at the individual level. However, as the actual census individual-level responses are not publicly accessible, the data used for external validation may exhibit inherent spatial and temporal mismatches with the synthetic data, which can impact the robustness of the validation process. Fortunately, there are alternative methods outlined in the literature that have potential to address this limitation. For example, Lovelace et al. (2017) suggest considering new sources of data, such as consumer surveys, commercial data, and even social media data, to provide “sanity checks” on the results when direct external validation is not feasible. This offers potential avenues for further improving the research and enhancing the validation process.
The openly available synthetic data are accompanied by open source code that serves as a framework for other researchers to update datasets if changes occur in the census tables coding (such as SF1). The feasibility and implications of updating the datasets depend on the nature and extent of the changes made in the census tables. To ensure adaptability, the code is designed in a modular and organized manner, allowing easy identification and modification of components related to census table integration, including clear separation of preprocessing, modeling, and converting steps. Collaboration and future references are facilitated through versioning implemented in the database, readily maintained on Figshare, which enables tracking changes and maintaining different versions. In addition, the data includes a “YEAR” column for time integration.
The synthetic population data have potential for advancing various research and practical applications. They encompass person-level socio-demographic information collected by the census at a highly detailed geographic level, enabling stakeholders to conduct tailored analyses of socio-demographic characteristics for subpopulations in specific areas. This can facilitate precise estimation and analysis of population patterns, trends, and changes at the local level [Lomax and Smith, 2017; Wu et al., 2022]. Such data can also be used to empower policymakers and planners to simulate and evaluate the impact of various policies, interventions, or scenarios on individual behavior and movement within urban or regional contexts [He et al., 2020; Lin and Xiao, 2023a; Papyshev and Yarime, 2021; Tanton et al., 2009], thus supporting applications in domains such as public health [Grefenstette et al., 2013; Spooner et al., 2021] and transportation planning [Hörl and Balac, 2021, Zhu and Ferreira, 2014]. In addition, there is existing literature on enhancing synthetic population data by incorporating census variables with external data sources such as health and commercial surveys [Spooner et al., 2021; Morrissey et al., 2015]. This integration enables realistic simulation and analysis, suggesting a potential avenue for future research to enrich the usability of our data. Further research will also explore additional applications of the synthetic dataset to broaden its potential impact across various domains.
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Data Availability
The fully reproducible code for data generation, usage examples, and technical validations is publicly available on GitHub at
.
