Abstract
This study demonstrates the application of affinity propagation as a data-driven approach to identifying and mapping typologies of place along the urban-rural continuum. The authors characterize Zip Code Tabulation Areas using demographic, economic, land cover, and accessibility to transportation infrastructure, which results in 22 clusters, 15 of which have a major rural component. The spatial pattern of these clusters varies, reflecting the heterogeneity in U.S. rurality. Rural is not a single concept that can be simply defined by population density. By comparing three economic indicators before and after the global financial crisis of 2007 to 2012, the authors find that the degree of economic recovery is captured by rural typologies. They compare both the methodological results and analysis of socioeconomic resilience to two of the most used threshold-based regional typologies, one developed by the U.S. Department of Agriculture Economic Research Service and one used by the American Communities Project.
Economic development practitioners have long created or applied groups, types, or categories to account for inherent heterogeneity in the world while providing a simplified framework in which to apply economic development. Many of these simplifications are commonly known as developed and developing countries or urban and rural communities. However, the methodology for defining and implementing these groups or types in a consistent and meaningful manner remains a challenge. At the same time, researchers and practitioners alike seek representative places in which to test or compare policies. As Heumann et al. (2020) described, the identification of representative places goes back almost 100 years in the United States (i.e., Lynd & Lynd, 1929).
Both geographers and economists have been working to develop national and continental types of rurality or have expanded the concept of urbanity to create regional systems (Regional Urban Systems), thus integrating socioeconomic linkages within discrete subnational regions around urban centers (Boix & Trullén, 2007; Parr, 2014). Identifying the various characteristics determining differences and similarities along the human landscape is pivotal because often the dichotomic view of rural/urban constitutes the backbone around which policies are formulated and evaluated. Seminal works on the topic of socioeconomic typification in the United States include Green et al. (1967), Weiss (1989), Goss (1995), Isserman (2005), Chinni and Gimpel (2010), and Waldorf and Kim (2015), not to mention the wealth of research on regional geography related to cultural and historic factors tracing back to Vidal de La Blanche (Andrews, 1986). In the United States, research has often focused on market research, although in the last decade or so attempts have been made to refine the U.S. Department of Agriculture (USDA) proximity and population-based definitions. For example, Isserman (2005) and Isserman et al. (2009) built a richer indicator of rurality-urbanity by combining several threshold-based levels reconciling definitions from two different federal agencies. To overcome the limitations of threshold-based approaches, Waldorf and Kim (2015) built a continuous index based on values contributing to or diminishing rurality and urbanity, although their approach still left drivers unweighted. In the European Union, research has focused on identifying and defining rurality as a tool for economic development (Copus, 2015; Lupi et al., 2017; Van Eupen et al., 2012), and, more recently, for evaluating policies and political behavior (Dijkstra et al., 2020; Gagliardi & Percoco, 2017). While the concept of rurality may be formulated simplistically, either in terms of population density or the opposite of urbanity, from both a cultural and economic perspective, there can be many types of rurality based on demographics (e.g., age, race, household size, income), economics (e.g., agricultural vs. resource extraction vs. tourism), and environment (e.g., land cover, accessibility to transportation).
Previous regional typology studies have used a wide range of data, methods, and scales of analysis. Most socioeconomic regional typification research in the United States has been conducted at the county level. Although counties are routinely used for socioeconomic research, counties present several challenges for national-scale analysis. First, the size of counties has a strong East–West pattern with smaller counties in the East and much larger counties in the West as a function of historic settlement patterns. Second, population and density vary widely between counties and within counties, especially large western counties. Third, counties as an administrative unit vary by state.
Geographers have long known about the issue of using aggregated data based on administrative units related to both the Modifiable Aerial Unit Problem (MAUP) and ecological fallacy (Openshaw, 1984). Both issues are exacerbated by units that vary in size as a function of historic settlement patterns and current socioeconomic factors and units that are often internally heterogeneous. Heumann et al. (2020) identified and mapped 11 socioeconomic typologies at the zip code tabulation area level, smaller than the more common county-level scale. Their results illustrated the heterogeneity that occurs within counties and demonstrated the potential for exemplar-driven typification.
Methodologically, typologies often use thresholds (Copus, 2015; Isserman, 2005; Isserman et al., 2009). While thresholds are simple and easy to interpret and can be based on commonly agreed values such as the poverty line, they are often subjectively created and applied. Advances in statistics and data science provide methods that seek to identify patterns within the data to identify boundaries between typologies that minimize heterogeneity within types and maximize the difference between types. Though more complicated than thresholds, this approach provides a more nuanced and less subjective methodology, although some subjectivity remains (e.g., input attributes or model parameters).
This paper seeks to demonstrate a data-driven framework for determining types of rurality based on social, economic, and environmental factors in the United States. In addition, this work applies this new typification approach to assess the resilience types of typological hotspots in U.S. counties after the 2008 to 2011 crisis. Finally, we compare the results we obtain in relation to two widely used methods for categorizing rural-urban counties in the United States: the U.S. Department of Agriculture Rural-Urban Continuum and the American Communities Project (ACP).
To fulfill the main objective, we apply an exemplar-based clustering algorithm, Affinity Propagation (AP; Frey & Dueck, 2007), for typifying Zip Code Tabulation Areas (ZCTAs), the census equivalent of postal zip codes, across the United States based on a rich data set of sociodemographic, economic, climatic, and land cover/land use data. By using AP, we not only seek to improve the efficiency of data-driven approaches when compared to previous methods (see Methods section and Frey & Dueck, 2007), but also seek to demonstrate a strategy for surpassing the issues associated with threshold-based approaches and dichotomic views widely depicted as problematic by existing literature (Chinni & Gimpel, 2010; Wandl et al., 2014), and overcome issues related to weighing as identified by Waldorf and Kim (2015) for their index. Moreover, we use ZCTAs as our unit of analysis to reduce the effects of MAUP and reduce the effects of generalization on heterogeneous administrative units, while still acknowledging that any administrative unit is subject to MAUP.
Methods and Data Sources
Our work comprises two stages. First, we utilize AP at the ZCTA level to identify the major typologies of human landscapes across the contiguous United States, exemplars for each typology, and the attributes determining the typification process. Second, we analyze how these typologies have fared from 2008 to 2012; that is, during the global financial crisis (GFC) and its immediate aftermath, and during the recovery years of 2013 to 2018. Each stage is separated into multiple steps and utilizes data from several different sources, which we describe in detail in this section.
AP: A Three-Phase Approach
Building upon three phases of data mining (Witten & Frank, 2005), we partition our approach into Data (representation), analysis, and description.
Data (representation)
While unsupervised classification techniques such as AP do not require predefined classes or even the number of classes, the data used to generate the distances between data points still need to be selected based on expert judgement (see Attribute Data section). Consequently, defining relevant inputs for determining typologies is the single most important step, and, as of now, one that still relies on the operator's judgment. In this work, we do not intend to deliver the best representation possible for human landscape typologies; rather, we want to demonstrate a data-driven approach, acknowledging remaining limited subjectivity. Specifically, we build upon Heumann et al. (2020) who mapped socioeconomic typologies for zip codes in the United States as well as other works on defining and mapping socioeconomic typologies and rurality (Chinni & Gimpel, 2010; Copus, 2015; Dicka et al., 2019; Goss, 1995; Green et al., 1967; Van Eupen et al., 2012; Weiss, 1989). We enrich those works by using a wider range of input attributes and a finer spatial scale of analysis, which will result in a more nuanced definition of rurality. Overall, our work overcomes the limits of threshold-based multidimensional approaches like the one previously listed, while accounting for both spatial and socioeconomic regional profiles. This approach integrates attributes using research on regional typologies from both the United States (Economic Research Service, 2013; Waldorf, 2006) and the European Union (Dijkstra & Poelman, 2014; Waldorf & Kim, 2015). The rationale is that, although built upon thresholds, these works went beyond the population and proximity/accessibility indices used in their respective studies to account for the economic activities performed in each region. Furthermore, based on feedback at a workshop related to this special issue, we have expanded the attributes that characterize rurality to include sociodemographic, economic, land cover, and transportation accessibility to map the heterogeneity of rurality in the United States and design a useful tool for economic development. We identified the driving forces of these typifications, thus improving Waldorf and Kim’s (2015) unweighted/unclassified approach.
Analysis
We consider three aspects in the representation of place: (1) spatial unit, (2) local attributes, and (3) proximal attributes. For the spatial unit, we use ZCTAs, which allow us to conduct this analysis at the finest spatial scale possible.
Another common geographic unit of analysis is the census tract; however, the input for the AP algorithm is a dissimilarity matrix (i.e., the difference between each place and every other place). Computationally, this means that the amount of RAM required grows exponentially as the number of units increases, without subsetting the data. Using census tracts would require more than 20 GB to load the dissimilarity matrix into RAM and the AP algorithm itself requires an exponential amount of RAM based on the size of the dissimilarity matrix. At the time of analysis, census tracts would have exceeded our computational capacity of 256 GB of RAM, without resorting to some sort of sampling scheme. Thus, ZCTAs represent a trade-off between the size and consistency of the aerial unit as ZCTAs are much closer in population and smaller than counties and computational limits. Additionally, ZCTAs are a convenient unit since personal data are often aggregated to zip codes and they do not make any presumption in terms of jurisdictional design. ZCTAs are also convenient for communicating analysis because they are commonly used by the public. It should be noted that, in most cases, ZCTAs are consistent with county boundaries and can be aggregated to the county in situations where that level of analysis is required due to data availability (see section Application of Typologies to Economic Rural Resistance).
Based on the 2010 census, there were 217,740 block groups, 73,057 census tracts, 32,989 ZCTAs, and 3,143 counties. We restrict our analysis to the continental United States. Preliminary analysis by Heumann et al. (2020) revealed that characterizing Hawaii and Alaska is challenging due to their distinct social and geographic characteristics. We also recognize that the geographic location of Alaska and Hawaii embodies distinct social, cultural, and economic conditions. Including ZCTAs that were removed due to data issues and the exclusion of Alaska and Hawaii, our study was conducted using 32,115 ZCTAs.
Attribute data
In this step, we describe ZCTAs in terms of social, economic, and geographic factors that capture differences between places along the urban-rural gradient and differentiate between different types of rurality.
Socioeconomic data are from the American Community Survey (ACS), as the 5-year estimates (2008–2012) for the 32,989 ZCTAs are based on the 2010 census (United States Census Bureau, 2010, 2011). We compiled and cleaned the data as described in Heumann et al. (2020). We acknowledge that more recent data are available. However, due to funding limitations, we reuse the data prepared for Heumann et al. The specific attributes by category are as follows:
Population demographics: total population, population density, population by race, percentage native place of birth, and population by age. Economic demographics: average household size, per capita income, percentage below the poverty line, percentage of households with income of $200k or more. Occupational data by NAICS: management, business, science, and arts; Service; Sales and office; Natural resource, construction, and maintenance; Production, transportation, and material moving. Educational attainment: less than high school, total high school graduate, associate degree or some college, bachelor's degree, and professional or graduate degree.
Land cover data are calculated using data from the 2011 National Land Cover Database (NLCD; Homer et al., 2015). We aggregate the NLCD classes into four main types: developed (codes 21, 22, 23, and 24), forest (codes 41, 42, and 43), agriculture (codes 81 and 82), and other (all other codes). We then aggregate the raster data as a percentage of the ZCTAs using zonal statistics in ArcGIS Pro 2.3.
The geographic settings data include the following attributes: (1) distance to the nearest urban ZCTA, (2) number of urban ZCTAs within 10 miles, (3) number of urban ZCTAs within 25 miles, (4) distance to the nearest international airport, (5) distance to the nearest limited-access highway, and (6) distance to the nearest railroad. This set of variables helps describe the geographic setting of a ZCTA in terms of its proximity to urban and metropolitan areas, as well as accessibility to transportation infrastructure. Specifically, the distance to urban ZCTAs and the sum of nearby ZCTAs is an attempt to capture the differences between small cities in rural areas versus suburbs or exurbs of major metropolitan areas. All analysis for these variables is conducted using ArcGIS Pro 2.3. For the purposes of identifying the nearest or number of urban ZCTAs, we define urban ZCTAs as any ZCTA with a population density greater than 1,000 people/km2 (OECD, 2013). Distance to the nearest urban ZCTA is calculated using the Near function. For computational simplicity, all distances are geodesic distances and not based on transportation networks. The number of urban ZCTAs within 10 and 25 miles are calculated using the Summarize Nearby function. For this analysis, all ZCTA polygons are converted to centroid points.
Proximity to transportation infrastructure is also calculated using the Near function. The transportation data are from the U.S. Geological Survey (USGS) National Transportation Dataset, which is part of the USGS National Map (U.S. Geological Survey National Transportation Geodatabase, 2019). International airports are used as a proxy for commercial airports since the FAA airport designations were not included in the data set. Similarly, we used limited-access highways as a proxy for access to the interstate highway system. However, there are isolated limited-access highways, particularly around urban areas in the central United States.
Other data considered for analysis include climate data from BIOCLIM and ethnicity and language data from ACS. However, these attributes are not included in the final analysis as our preliminary analysis found that they either were correlated with existing data (e.g., race), or increased the degree of heterogeneity that made it difficult to generalize the data (e.g., climate). We acknowledge that these factors may potentially influence the final result and are part of the limitations of a data-driven approach.
Analysis—affinity propagation
The analysis methodology for this paper closely follows that used in Heumann et al. (2020). AP is a form of unsupervised classification, in which data points are grouped into classes based on a comparison of differences between data points. Unsupervised classifications are widely used across scientific disciplines with more than 10,000 peer-reviewed papers and review articles on the topic listed on the Web of Science (Web of Science, 2021). According to the Web of Science, the most common fields in which unsupervised classifications are applied are electrical/electronic engineering, computer science/artificial intelligence, remote sensing, imaging science/photographic technology, and computer science/information systems. Unsupervised classification is also used in more than 100 papers in the fields of environmental science, ecology, computational biology, and neuroscience. Unsupervised classifications, in which data are assigned to categories, groups, or clusters based on the distribution and relationship of the data points to each other, are typically described as exploratory in nature. Compared to supervised classification techniques in which data are assigned to clusters or groups based on a set of either predetermined rules or rules derived from training data (example data for each predefined cluster), the clusters in unsupervised classifications are not predefined and thus allow researchers and analysts to examine patterns in the data without defining the types or even necessarily the number of clusters in advance. This is particularly useful in applications in which training data are scarce or difficult to obtain or in which predefining the categories may be difficult (e.g., What defines rural?). Commonly used unsupervised classification techniques include k-means that require the user to set the number of classes a priori, and a modification of k-means, Iterative Self Organizing Data Analysis Technique (ISODATA), which splits and merges clusters and thus allows the number of clusters to be determined based on range and distance criterion. The earliest reference to k-means dates to 1955 (Jain, 2010).
Although unsupervised classification techniques are not commonly used in economics, examples include defining economic regions (Crone, 2005; Mimis & Georgiadis, 2013), and analysis of energy policy (Aker & Aghaei, 2019; Liu et al., 2016; Xu et al., 2020). The approach is highlighted in the paper Machine Learning Methods That Economists Should Know About (Athey & Imbens, 2019).
As detailed by Frey and Dueck (2007), AP has several advantages over other commonly used clustering algorithms such as ISODATA and k-centers: (1) clusters are formed based on exemplars—real data points that are representative of the cluster; (2) all data points are potential exemplars; (3) the results include both the clusters and the exemplars. Unlike ISODATA and k-centers, AP does not rely on stochastic processes (i.e., the results are the same given the same inputs). Frey and Dueck also stated that AP is more efficient than k-centers. As described by Heumann et al. (2020), AP has been applied across a range of fields, such as band selection in remote sensing (Qian et al., 2003), landscape ecology (Cardille & Lambois, 2010), genetics (Kiddle et al., 2010), and engineering (Hassanabadi et al., 2014). Aside from Heumann et al., the most closely related application of AP is Cardille and Lambois, who used AP to define landscape typologies in the United States and found that all typologies include evidence of human modification.
All clustering analysis is conducted with MATLAB 2019a and the AP function (Frey & Dueck, 2007). For a detailed description of how AP works, please refer to Frey and Dueck. The two primary inputs for AP are the dissimilarity matrix and the preference value. The dissimilarity matrix is a user-defined value that qualifies the difference between data points. To create the dissimilarity matrix, we use the following procedure: first, all data are standardized (z-scores) to remove the effect of varying unit scales. Second, Principal Components Analysis is used to reduce multicollinearity and reduce data volume. We reduce the dimensionality to 25 components (from 38 variables) while retaining 93.3% of the data variability. The dissimilarity matrix is calculated using a negative Euclidian distance function where the relative fraction of the variance explained from the principal components is used as the weights.
The second input for AP is the preference value. The preference value indicates the relative number of clusters that will be created; smaller preference values result in fewer clusters. The selection of the preference value is subjective. While the FAQ on Dr. Frey's website (now unavailable) recommended using the median or minimum dissimilarity value, Heumann et al. (2020) found that these resulted in far too many clusters to be useful (i.e., hundreds of clusters). While this is a testament to the heterogeneity of the United States, we make commonalities easier to recognize. Heumann et al. found an asymptotic relationship between the preference value and the number of clusters. We subjectively aim for ∼20 clusters for this analysis. This is about twice as many as found in Heumann et al. but given the goal of typification of rural areas, we anticipated that many clusters would have stemmed from what previously would have been considered as various types of urbanity. The preference value we use is 24 times the minimum dissimilarity value (this is a negative value). This results in 22 clusters and exemplars. For analytical and cartographic purposes, and to maintain our focus in line with the overarching topic of this special issue, we use the attributes of the exemplars and the spatial pattern of the typologies to exclude from our analysis the typologies that are strongly urban.
Description—mapping and comparisons
The results of the AP clustering are geographically mapped using ArcGIS Pro 2.3. Two sets of maps were produced. First, a map showing the typology of every rural ZCTA and exemplar was created. For analytical and cartographic reasons, we focus on the analysis and mapping of rural typologies based on our interpretation of the exemplar characteristics and the geographic locations of each cluster. Only 15 of the clusters with characteristics generally associated with rural regions (see section Application of Typologies to (Economic) Rural Resilience) were mapped as it is very difficult to interpret a map with 16 colors (one color for all nonurban clusters), let alone 22 colors. This is also in keeping with the theme of the special issue focusing on rurality.
Second, a series of panel maps were created for each rural typology. The Generate Tesselation tool was used to create a hexagonal fishnet across the United States and the summarize within tool was used to count the number of ZCTAs for a given typology found in each hexagon. This cartographic approach overcomes the visualization limitations variable of ZCTA size and provides a convenient way to compare the spatial distribution of typologies.
Third, we compare our results with common comparable data sets, namely the USDA’s Rural-Urban Continuum codes (U.S. Department of Agriculture, 2020) and ACP county typologies (American Communities Project, 2020). These data are commonly used for identifying rural counties for economic development and policy. We compare our results with these data using cross-tabulation analysis to calculate how often our AP-derived ZCTA typologies correspond with these existing typologies. It should be noted that neither of these comparable data sets has been published in a peer-review journal and the methodology for the ACP typologies lacks the detail needed for independent replication and verification.
Application of Typologies to (Economic) Rural Resilience
The second step of our work is an example of how these new typologies can provide additional information to social scientists. We focus on the socioeconomic changes that have occurred between 2012 and 2018. We chose this time span because it is the only one able to provide consistent data before and after the GFC. With this step we aim to address differences in terms of resilience and performance among counties driven by their human landscape typologies. Please note that we refer to the term resilience in the broadest way possible; we do not analyze the type of rural resilience displayed by these counties, thus combining the two sets of features presented by Scott (2013), which define the shape of rural resilience.
We aggregate our results by county, the unit of analysis for which existing economic data are primarily available to provide a better direct comparison with previous research. This scale is familiar across multiple social sciences. It is routinely used for assessing regional resilience in the United States (Brown & Greenbaum, 2016; Han & Goetz, 2015; Rahe et al., 2019) and for evaluating or formulating policies (Hird, 1993; Jackson et al., 2016; Pierce & Schott, 2020).
We use census data from the ACS 2008 to 2012 and 2014 to 2018 to compare the 5-year average of per-capita income (PCI), share population employed (occupation level), and poverty. The first two variables describe the dynamic of material well-being in counties across the United States, while the third one captures changes in the underlying labor markets and the ability of counties to create opportunities for diffuse growth (Giloth, 2000; International Labour Office, 2017).
A standard principal component analysis (PCA) is used to assess the ability of typologies, summed by county, to explain the dynamics in these three variables (note that this is a different PCA process than the one used in the AP clustering). To analyze how the socioeconomic characteristics summarized by the exemplars are driving the change in PCI, occupation, and poverty, we group the counties by prevalent type (i.e., the typology with the largest number of zip codes). Then we compute the change for each variable as the difference in levels between the 5-year average taken in 2018 and 2012 (Figure 1). Finally, we assign each group/exemplar to the category Urban (U) or Rural (R) according to our interpretation of the AP exemplar results (see Table 1 and online supplementary Appendix Table A.1 for a list of exemplars and their county subdivisions).

Distribution of rural ZCTA typologies and their exemplars in the 48 contiguous United States.
Cluster labels, exemplars, and short descriptors.
The second part of our application of AP clustering in a socioeconomic context focuses on the representativeness of the prevalent typology and its ability to explain the variation in PCI, occupation rate, and poverty rate. This illustrative application follows the common practice in regional science and economic geography, as illustrated by Isserman (2005) and Waldorf and Kim (2015), among others.
For each county, we identify the dominant typology and characterize its representativeness (i.e., its relative frequency in the county with respect to the other exemplars). We ran a PCA aimed at understanding how much of the variability in the change of the three socioeconomic variables before and after the GFC is explained by the grade of the representativeness of the dominant exemplar (Figure 4). Finally, we compared the average percent change across the three selected metrics for each group of counties, and among results from AP, ACP, and the USDA continuum.

Principal component analysis of the exemplar representativeness. Note: the graph shows the eigenvalues corresponding to the exemplar and the main ZCTAs. An eigenvalue above the value of one for the role of the exemplar and below for all the other dimensions of the database shows that the variability among the counties, as measured by the three dimensions of per capita income, occupation, and poverty, is satisfactorily explained by the exemplar itself.
Results
AP Types—Definitions and Characterization
Table 1 lists the exemplar ZCTAs from the AP clustering analysis. For each exemplar, the zip code, place name, urban/rural, and description are listed along with the z-score for each attribute. The smallest and largest values for each attribute are boxed; z-scores < −1 or > 1 are in bold. We then label each cluster by its distinctive characteristics.
The distinctive attributes of the clusters vary among the types of attributes (e.g., socio/economic/land cover/infrastructure) as does the number of distinctive attributes. For example, seven clusters have distinct income attributes, 11 have distinct racial attributes (e.g., African American, Asian, Hispanic, or Native American), and 13 have distinct land cover attributes. These combinations of attributes provide an interesting cross-section of communities outside of major population centers across the United States and illustrate lasting patterns of historic settlement and migration in areas of the Midwest (75416) and South (28441).
Maps and Geographic Distributions
Although we do not name our clusters with subjective elements, several of them are characterized by features often associated with the concept of rurality (American Communities Project, 2020; Nelson et al., 2021; Rowles, 1988). Guided by existing literature, we mapped 13 of the 22 clusters identified and labeled them as rural (Figure 1). Two major spatial patterns emerge. While most clusters show broad regional patterns—such as 28441 (Southeast), 95333 (Southwest), and 38321 (Appalachia and Upper Midwest)—there are other clusters that are located primarily near urban centers, such as 13032 and 14174. In areas such as the Upper Peninsula of Michigan and northern Wisconsin, a mix of clusters is found—38,321, 49,612, and 58640, for example. All three of those are above average for distance from urban areas but differ in other attributes. For example, 38321 has the highest percentage of forests; 49612 has the oldest population (percentage over 65); and 58640 has the largest area and lowest percentage of African American population. The distribution of the exemplars concentrated east of the Rocky Mountains appears to be consistent with the density of ZCTAs across the United States.
To better understand and illustrate the regional patterns of the typologies, the density of each typology is mapped individually (Figure 2a&b).

(a) (b) Prevalence of ZCTA rural typologies in the contiguous United States.
These two panel maps are helpful to draw macro-regions based on typology concentrations. From these maps, we see that most of the typologies show contiguous spatial clumps related to economic and environmental attributes. For example, 38321 illustrates that the rural areas with high levels of forest cover and far from urban areas are concentrated in the Appalachian and Ozark mountains, northern Great Lakes, and Sierra Mountains. In several cases, two or more typologies are interspersed. For example, 61957 and 75416 are both concentrated in the Midwest with high levels of agriculture, but zip codes in the 75416 cluster tend to be further from urban areas. Two typologies are largely defined by demographics and geography. 86039 represents zip codes with high Native American populations that are located far from urban centers, and these occur almost exclusively west of the Mississippi River. These areas also tend to have lower education and income compared to other clusters. 95333 is characterized by high Hispanic populations and occurs primarily in areas associated with migrant agriculture labor: South Florida, Central Valley in California, Eastern Washington, and the Rio Grande Valley.
Comparison to USDA and American Communities Project Typologies
Table 2 shows the results of the cross-tabulation analysis between our AP typologies and the USDA rural-urban continuum codes. The nine urban AP typologies are consistently located in metro counties (as defined by the USDA) with cluster 29410 having the highest percentage of ZCTAs outside of the metro counties (∼9%). For the AP rural typologies, there is less agreement. AP typologies 75416 and 86039 have the most ZCTAs found in USDA's non-metro counties with 87% and 78%, respectively. However, AP typologies 13032, 14174, and 95333 have more than two-thirds ZCTA within USDA metro counties. AP typology 27041 is split almost 50–50 between metro and nonmetro counties.
Crosswalk comparison affinity propagation and USDA rural-urban continuum (share of ZCTAs within each USDA category).
Figure 3 illustrates the AP rural typology ZCTA that occurs within USDA metro counties. This map illustrates both the number and the diversity of AP typologies found within the USDA metro counties. These results indicate that the USDA typologies omit ZCTAs with rural characteristics that are in counties adjacent to major metro areas or contain medium-sized cities, even if the ZCTA itself has a low population density and little developed area. These results also highlight the effect of county size on the USDA's metro typologies as the metro areas tend to be larger in the western United States where counties are larger; the effect is more noticeable with the inclusion of counties adjacent to urban areas. Western medium-sized cities such as Flagstaff (AZ), and all cities in Idaho, Montana, and eastern Washington, have sprawling metro areas, often larger than cities of similar size in Europe. Because of their extensions, neighboring counties often contain mostly rural ZCTAs.

Affinity propagation rural typologies ZCTAs occurring within USDA rural-urban continuum metro counties.
Table 3 shows the results of the cross-tabulation with the ACP typologies. Overall, agreement between the two typologies is low. Agreement tended to be highest amongst urban AP typologies, corresponding with the ACP's Big Cities and Urban Suburbs typologies. For rural areas, the highest agreement was for 67005 and Rural Middle America (58%), 27410 and African American South (45%), 58640 and Graying America (40%), and 90710 and Native American Lands (44%).
Crosswalk comparison affinity propagation and American communities project (share of ZCTAs within each USDA category).
There are also several ACP types that did not align with any of our types including Military Posts, College Towns, LDS Enclaves, and Middle Suburbs. These reflect differences in both input data (ACP included religion based on nongovernmental data) and an algorithm-based versus a human-interpretation approach. Finally, one rural AP typology, 95333, is consistently found in with Urban Suburbs or Big Cities counties, indicating misalignment perhaps due to spatial scale (i.e., rural ZCTAs in counties with major urban areas).
Comparing Resiliencies
First, we analyze whether our aggregations at the county level are able to explain the variation among counties for the three metrics we selected—share of employed population, share of households in poverty, and PCI. Then, we investigated the role of the exemplars in explaining the dynamics of PCI, occupation, and poverty through time. We focus on the most recent and complete two successive waves of data available, (2012–2018), and we calculate the difference in PCI, occupation, and poverty at a county level. Then we ran the PCA to assess the percentage of variability explained by the exemplars (in terms of share of county variation explained) and the principal components of the variability of the database. Figure 4 shows the eigenvalues correlated with the role of the exemplar and to each of the other three variables (PCI, occupation, and poverty). Eigenvalues greater than 1 designate the principal components of the database, namely the variables that can be focused to reduce the dimensionality of the analysis and still preserve the main information in the data. In our analysis, exemplars are the only dimension to play the role of the principal component.
The overall landscape of rural socioeconomic resilience since the GFC varies dramatically depending on the prevalence of ZCTA types in each county (Figure 5a, b&c). Overall, even though counties where one rural typology is prevalent may have seen a slower increase in their PCI (Figure 5a), poverty and occupation levels tell a more complex story. Counties with typologies similar to exemplars 31803, 11102, and 86039, show a very limited increase in poverty levels. 386039 also shows stable levels of occupation. This cluster is characterized by a high Native American population and households with several family members, which constitute a solid social protection network helping navigate through the turbulence of the GFC.

Change in the level of per capita income (a), occupation (b), and poverty (c).
An interesting dynamic can be observed by the similar changes in poverty and PCI levels: counties with the lowest positive changes from 2012 to 2018 are those where rural clusters predominate. In other words, even though these counties have not increased their PCI as much as more urban-denominated counties, the level of poverty has grown more slowly, while the occupation level shows a rather heterogeneous dynamic.
The results in Figure 5 suggest a smooth dynamic through the business cycle and low sensitivity to positive/negative socioeconomic shocks. On the other hand, their inertia could drive down the ability of rural counties to recover from negative shocks or to take advantage of a positive upturn in the overall economic conditions, as in the case of the 2009 to 2020 expansion, partly captured by our timeframe shown in Figure 6.

Comparison of the percentage change in the level of per capita income, occupation, and poverty across counties with a prevalence of rural and urban exemplars. Source: authors’ calculations on American Community Survey, U.S. Census Bureau.
This lack of dynamism is well-embodied by 75416, 58640, and 28441: there, PCI recovered slowly, while occupation levels and poverty have increased during the period at a faster pace than other typologies. In this sense, what we identified as rural shows limited economic resilience similar to that which Lema et al. (2019) found in some communities across the Laurentian Great Lakes, as well as a peculiar dynamic for the job market, mostly driven by unskilled labor.
A special case in the analysis of the differences before and after the crisis must be made for the group of counties represented by the exemplars 38321 and 95333, which both show a drop in PCI. The latter, 95333, characterized by a high percentage of Hispanics, simultaneously shows a considerable increase in occupation and poverty. As in 95333, the exemplar 38321 is characterized by a dominant primary sector (forests), but interestingly, shows a simultaneous drop in the level of poverty and occupation. This dynamic seems to be consistent with an increase in the efficient and effective use of the natural resources available, especially regarding the management of the labor force.
Among the rural typologies, counties with a prevalence of 95333 have increases in their active workforce and a relatively average performance compared to all other typologies in terms of PCI. This typology is in the western United States, mostly concentrated in the interior of California and with a high percentage of foreign-born residents. It is relatively average in terms of sectoral employment, except for a slightly higher share of workers employed in the agricultural and natural resource sector. These results also suggest that the economic vitality of the counties represented by this exemplar may be affected by future policies restricting immigration and/or by negative shocks to higher education institutions. Overall, the exemplars identified by AP are quite representative even when used to determine a higher spatial level of aggregation. As shown in Figure 5, the grade of similarity between the county and its dominant exemplar explains most of the overall variability in the change of PCI, occupation, and poverty (53%).
In our interpretation, the representativeness of the exemplar can be seen as a mitigating factor useful to those wishing to tailor monetary and fiscal measures in response to positive/negative shocks.
Comparing Resilience Across Definitions
When we compare our results with those obtained by using the USDA and ACP definitions, we find that even though the magnitude of urban-rural resilience differs quite substantially, the overall dynamic is similar (Figure 7).

Comparison of the percentage change in the level of per capita income, occupation, and poverty across counties with a prevalence of rural and urban exemplars through different typology definitions. Source: authors’ calculations on American Community Survey, U.S. Census Bureau.
The dynamic of rural areas is similar between the ACP and our results. Even though rural areas in the ACP lag on income and occupation, their poverty level also grew less. Urban typologies in the ACP display far lower levels of growth for all indicators. Compared to the USDA and the ACP, our results show a larger magnitude and stronger resilience among rural typologies. The differences are particularly interesting when looking at the occupation levels and poverty. In both the ACP and the USDA, these are substantially lower in rural areas than in urban ones. Using AP, this difference is greatly reduced, indicating a lively dynamic across what is often thought of as Rural America, even though the jobs in these rural areas are likely low wage/low productivity ones.
Two lessons that can be learned from this comparison are:
multidimensional measures seem to better capture the resilience of multiple typologies of rurality; and a data-driven approach captures the heterogeneity of rural places while still providing a simplified framework in which to examine economic policies.
Discussion: Defining Typologies—A Data-Driven Approach
Rural/urban categorizations have long been used across social sciences to define the locus of research. Rural places or regions, in particular, are often the center of both the scientific and political debates and carry a powerful imaginative set of underlying characteristics that researchers, policy makers, and the public may carry with them. While a binary dichotomy between rural and urban is too simple to capture the demonstrated heterogeneity along the urban-rural continuum, there is still a need to simplify and generalize rural places. An approach like the one shown in this paper balances the need to capture heterogeneity in space and place while still providing the relatively simplified picture required for comparative analysis and policy making.
In this paper, we compare our data-driven approach to a traditional threshold approach used by the USDA. Our comparison shows that while both methodologies agreed on the location of urban typologies (although our data-driven approach that uses a rich set of social, demographic, economic, and geographic factors highlights diversity within and between urban areas), there are many discrepancies between these approaches for rural areas. These discrepancies are due to multiple factors, including areal units and other factors considered in the typologies.
The Modifiable Areal Unit Problem (MAUP) is a well-established phenomenon in the geospatial sciences in which spatial patterns can be created or obfuscated based on the arbitrary aggregation of data to larger areal units (Openshaw, 1984). The existence of MAUP is well documented in analyses using administrative boundaries such as counties or ZCTAs because these boundaries are often disconnected from the spatial pattern and their shape and size vary greatly depending on location. Reducing aggregation is one solution to MAUP. In this study, the comparison of ZCTAs to county-level typologies clearly demonstrates the impact of MAUP. Both the AP and USDA typologies incorporate aggregated values (e.g., population) and adjacencies (e.g., USDA uses neighboring counties and our typologies use the number of ZCTAs within a given distance). The result is that the USDA metro typology creates much larger metro areas where counties are larger in area, which is a factor of historical settlement patterns than current demographics or economics. Our comparative analysis demonstrates that many rural areas near medium-sized cities are omitted in the USDA data set, largely due to its larger areal units and their unequal distribution. This difference is particularly evident in the western United States, which has much larger counties. Large portions of California, Arizona, Montana, Idaho, Washington, and others are metro areas according to the USDA, although these areas contain many of what our study defines as rural typologies. The MAUP for the USDA county-level typologies is further affected by population rather than population density for variable area units. This has significant policy implications as rural areas in the western states that may have more rural characteristics in terms of distance to urban areas, population density, land cover, and economic activity than eastern states, are not correctly classified.
The comparison of the data-driven AP typologies with the ACP types also illustrates the differences between spatial units and highlights the importance of input data in defining community typologies. While our approach and the ACP use demographic data that include population density, age, race, and income, ACP also includes religion (which we omitted due to a lack of reliable publicly available data at the zip code level) and military service. It should be noted that the ACP lacks a detailed methodology on its website, and neither the methodology nor results have been published in a peer-review publication. Moreover, the ACS methodology notes that some of the types were created after the analysis and the algorithm used to create the types is not listed. For several ACP types, our results find multiple comparable types, particularly for the Big Cities, Urban Suburbs, Exurbs, Graying America, and Rural Middle America. However, our types do not align with ACP's Military Posts, College Towns, LDS Enclaves, Aging Farmland, and Middle Suburbs. These differences are due to several factors including differences in spatial units, but more importantly, the differences in input data. While the 36 factors in the ACP are not specified, we can assume that the data from the ACS is similar, including the data date range, while the non-ACS data differ. While the ACP includes religion, military, and politics, our analysis includes land cover, proximity to urban areas, and proximity to transportation infrastructure. This highlights the need for participation by practitioners in the selection of the input criteria that are used to define community types as well as the need for transparent and peer-reviewed methodologies.
These findings are particularly relevant in three fundamental ways to policy makers and scientists. First, this new landscape offers the opportunity to rethink how initial hypotheses are formulated, for example when looking at issues related to rural poverty accessibility or racial inequality. In addition, these typologies offer the opportunity to diversify the rural space across understandable subsets of drivers. Finally, real-world exemplars offer the unique opportunity to focus on subsets of highly representative places, where stakeholders can be more easily engaged when looking at nationwide studies.
In an initial application of our results, we focus on the dynamic of three indicators to capture how counties with a prevalence of one rural typology fared through and after the GFC. We find that the exemplars well capture how PCI, occupation, and poverty levels changed in the years 2012 to 2018. Interestingly, counties with a rural prevalence displayed a less-pronounced increase in income and labor participation but have also shown lower increases in poverty levels. The causes of these patterns are complex as there are a variety of narratives and processes taking place based on the attributes of a given place. Factors that may affect these patterns include the attributes we use to characterize these rural typologies: existing poverty, economic sectors, education, land cover, and accessibility. In terms of policy formulations, our focus is not on these latter results, even though they provide a first, limited picture of how counties in the United States behaved through a period of economic expansion. Rather, we focus here on how AP has been able to deliver a more complex landscape of Rural America to identify drivers of this complexity and to elect descriptive exemplars that depict the selected socioeconomic drivers.
When applied to capture a simple set of metrics recording the post-crisis resilience, AP categories show complex dynamics across the American human landscape. The use of ZCTAs allows us to build higher-level typologies (counties), and to briefly look at how the presence of certain ZCTAs (i.e., typologies) can affect the resilience of entire counties.
The data-driven approach presented in this paper offers many advantages over more traditional threshold-based typologies (USDA) and typologies derived from subjective definitions (ACP). First, a data-driven approach reduces the amount of a priori determination in the typologies. While the selection of the data used to create the typologies will always contain a degree of subjectivity, the actual delineation of the typologies is based on data and can be replicated using the same data. This allows patterns in the data to drive the typologies and produce new or unexpected results that can better inform policy makers of current socioeconomic patterns, especially in heterogeneous and dynamic populations. It also provides improved transparency in community typologies.
The data-driven approach presented here also can incorporate a greater number of factors that better describe rural communities while still maintaining a relatively small number of generalized typologies. For example, our results identified 14 types of rurality that vary based on land cover, income, race, education, and proximity to urban areas and transportation infrastructure. Our study greatly expands upon the typologies provided by USDA. And, while like the ACP, it increases a researcher's ability to capture nuanced socioeconomic regional profiles, while still considering the physical continuum of the continental United States.
A data-driven approach also allows the differences between typologies to potentially vary by region. One of the major disadvantages of a threshold approach is that the threshold for urban or rural may differ based on location or other contextual factors such as land cover or population density. A data-driven approach allows for different dominant factors for each typology, with many factors having geographic and regional patterns. The result of our analysis is a composite of regional typologies that can help policy makers identify regional differences that may require different or tailored approaches to economic development. Finally, our approach provides consistent and transparent results including exemplar locations for each community typology. This serves practitioners of economic development by providing rural typologies created using repeatable methods and providing specific locations for comparisons where data collection or methods require case studies.
There are several potential disadvantages to a data-driven approach for policy makers and economic development practitioners. First, where threshold-based typologies are easy to interpret, data-driven typologies can be complicated, both in creation and definition. Not only do these more complicated typologies require additional interpretation, but there may also be hidden variability within typologies that could affect policy implications. For example, there may be statistically significant variation with a given typology depending on location (e.g., metro vs. nonmetro counties). Since the dominant factors may vary between typologies, it can be difficult to control for any given factor, without additional analysis. Finally, with less well-defined boundaries between typologies, connecting the typologies with decision-making tools such as decision trees becomes difficult if the decision tree criteria cross typology boundaries.
Second, while a data-driven approach removes subjective a priori thresholds or boundaries for typologies, it allows for subjectivity in the factors used to generate the typologies. In this study, we incorporate 38 factors covering a wide range of environmental, social, demographic, and economic factors. While this approach attempts to characterize communities and capture heterogeneity, it is a subjective selection. For this paper, we sought to demonstrate the potential of a data-driven approach rather than identify a definitive set of typologies. Further research in consultation with economic development practitioners is needed to refine and improve input parameter selection or develop a series of typologies for different application scenarios.
Third, a data-driven approach is not a panacea for identifying typologies. AP is essentially an unsupervised classification and one of the greatest weaknesses of unsupervised classifications is that the resulting typologies may not align with real-world features. For example, we also considered including climate as a factor since it affects natural resources, tourism, and other types of economic activity. When we included climatic factors in a model—thus increasing the number of attributes and the dimensionality of data space—the results were nonsensical. For example, much of Manhattan in New York City was classified as the same type as agricultural rural Midwest locations.
When applying AP to the ACS data to compare resilience, we highlight two methodological lessons. By combining more complex types of rurality across multiple drivers, we can capture more complex patterns of resilience and synthesize what researchers may have found by using the two existing typification processes. Although illustrative, our results suggest the need to focus on growth policies, rather than poverty-containment measures in rural counties. This is in line with policies enacted, for example, across the Great Lakes region by successful, mixed rural-urban metropolitan statistical areas, as reported by Lema et al. (2019). At the same time, the level of aggregation at the county level also shows the need to address the complex landscape of Rural America, which reflects similar issues emerging in regional sciences (Casellas & Galley, 1999; Dall’erba, 2005; Fiaschetti et al., 2021; Lagendijk, 2003).
The second lesson, the simplicity and replicability of our framework, also has practical implications. Our results derive by aggregating spatially nested areal units (ZCTAs) to deliver a replicable method for analyzing complex jurisdictional arrangements and comparing them at the state level. In other words, if in this paper we typified counties by prevalent ZCTAs, we could have compared state performance by aggregating ZCTAs at higher-level scales depending on the most appropriate level of governance for promoting regional growth (e.g., state level in Rhode Island, or township level in Connecticut). From the perspective of regional development, our results are only an overview of the dynamics of recovery for rural and urban counties in the United States. Future research will have to focus on two major aspects: first, choosing the sets of inputs that stakeholders think to determine rurality, or that matter for regional economic development. Second, researchers will have to work at different levels of aggregation, beyond counties, and consider spatial spillovers and patterns across a human landscape that has become too complex to be reduced to a simple, threshold-based, dichotomy.
In conclusion, we demonstrate the potential application of a data-driven methodology to identify typologies for fully depicting urbanity-rurality as a continuum across multiple factors. By using this data-driven approach using ZCTAs, we can demonstrate the diversity of rural communities. Our approach has many advantages over threshold-based county-level typologies. We use our classifications to analyze the post-GFC recovery at a higher areal level (counties), showing how well its dominant typologies describe counties even when repartitioning results are based on the current rural-urban dichotomy. The results we obtain show a slower-paced recovery in rural counties, in line with results from other definitions, but more complex and less negative dynamics among rural-dominated typologies.
While this approach can create more complicated typologies, it has the potential to improve policy making by helping policy makers consider a greater range of socioeconomic factors in rural communities.
Footnotes
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Supplemental Material
Supplemental material for this article is available online.
Author Biographies
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
