Abstract
In this article, we present a historical dataset of activity spaces, originally based on publicly posted and geotagged social media sent within the United States from 2012 to 2019. The dataset, which contains approximately 2 million users and 1.2 billion data points, is de-identified and spatially aggregated to enable ethical and broad sharing across the research community. By publishing the dataset, we hope to help researchers to quickly access and filter data to study people’s activity spaces across a range of places. In this article, we first describe the construction and characteristics of this dataset and then highlight certain limitations of the data through an illustrative analysis of potential bias—an important consideration when using data not collected through representative sampling. Our goal is to empower researchers to create novel, insightful research projects of their own design based on this dataset.
Introduction
It has been well over 15 years since location-based services (LBS) and other geo-enabled digital platforms (e.g., Facebook and Twitter) became commonplace in both daily life and research. Jumpstarting renewed interest in computational approaches in the social sciences in general (Lazer et al., 2009) and the spatial sciences in particular (Miller, 2010; Singleton and Arribas-Bel, 2019), researchers have analyzed both the social implications of these platforms and leveraged their datasets for spatial research. This journal alone has published dozens of articles on, or using, social media data. 1 Although this research continues, particularly during the COVID-19 pandemic (e.g., Terroso-Saenz et al., 2022), change is on the horizon. A combination of increasing privacy and ethical concerns (Zook et al., 2017), and the increasing value of data (Sadowski, 2019), means that the open platforms of the early 2000s have been largely replaced by closed APIs and walled-off data. Facebook restricts access to external researchers (Brown, 2020), and Twitter, once one of the most accessible platforms for researchers, has radically altered its API and terms and conditions after Elon Musk purchased the company in 2022.
This results in the paradoxical situation of ever-more data that at the same time is not readily accessible to the scientific community. Instead of direct API access to specific data and platforms, we increasingly encounter disembedded “mobile application data” derived from people’s interactions with an opaque plethora of mobile apps. This data is generated by individual people through the digital apps they use, sold to a web of data brokers who, after aggregating and combining data from many sources, sell access to the combined data. In short, the shape of, and access to, digital data in geographic research is changing precisely as research increasingly shows the potential of this data to help understand human mobility and social processes more broadly (e.g., Ballantyne et al., 2022; Xu, 2021). Were such data to be more widely available in open and ethical ways, more insightful research on mobility and people’s activity spaces could be conducted (Poom et al., 2020).
With this in mind, we devise a method for more widely sharing historical social media data that we have collectively created over the last decades to offer an open, standardized data source for geographic research. While a myriad of research designs might leverage social media data, we focus our effort on the concept of activity spaces specifically. Activity spaces, encompassing all the activities and locations that an individual might visit during their daily life, are a cornerstone of geographic thought tracing back to Hägerstrand’s time geography (Hägerstrand, 1970). New data sources, including the social media data described here, have enabled an increasing integration of this concept in a wide range of geographic work (Müürisepp et al., 2022). Data on activity spaces can help illuminate broader urban processes, ranging from gentrification and neighborhood change (Poorthuis et al., 2021) to segregation and access to green space (Heikinheimo et al., 2020; Väisänen et al., 2022). However, access to open data is increasingly challenging potentially leading to a fragmented landscape where studies are difficult to compare, replicate or even just start if researchers are unable to gain access to the requisite source data.
In this article, we present a historical dataset of activity spaces, originally based on publicly posted and geotagged Twitter posts across the United States from 2012 to 2019. The dataset, which contains approximately 2 million users and 1.2 billion data points, is de-identified and spatially aggregated to enable ethical and broad sharing across the research community. By publishing the dataset, we hope to help researchers to quickly access and filter data to study people’s activity spaces across a range of places, from downtown Chicago and rural Montana; or conversely to support the analysis of the origin of visitors to the nation’s national parks or one specific neighborhood in Austin, Texas. In this article, we first describe the construction and characteristics of this dataset and then highlight certain limitations of the data through an illustrative analysis of potential bias—a perennial concern when using data not collected through representative sampling (Longley et al., 2015; McNeill et al., 2017). From this basis, researchers can be empowered to create novel, insightful research projects of their own design.
Data
We started the data collection for the underlying repository at the University of Kentucky in 2012 via Twitter’s then-open Streaming API. The collection was not limited to specific search terms but includes all geotagged tweets within the boundary of the United States. As some of the co-authors moved institutions, the collection continued at Singapore University of Technology and Design (2016–2020) and KU Leuven (2020–2023) and terminated when Twitter retired public access to its v1 API in early 2023. The exact collection approach and technology is described in Poorthuis and Zook (2017).
By design and limited by the platform’s terms and conditions, the original data is not shared here. The landscape of the platform’s API and terms and conditions has changed radically over the years. For example, it used to be possible to share the raw tweet ID and then other researchers could “rehydrate” the additional information from Twitter’s API, but this method is no longer feasible at the time of writing this article. Therefore, to safeguard the longevity of this data set, ensure our ethical obligation, and to clearly not breach the terms and conditions under which the data was collected, we have elected not to share any of the actual or original data. Rather, we offer a new aggregated and de-identified dataset, tailored for geographic research, on human mobility with transparent details on our method. We hope that this might be the first of multiple such datasets for different regions and topics.
We start with selecting data points between 2012 and 2019 (excluding the pandemic, given its likely effect on activity spaces) created by users with 15 (our baseline for useful information on activity spaces) to 10,000 observations (to minimize bias from bots or power users). As geotagged tweets might have either a precise coordinate location (e.g., derived from GPS) or a more general “place” (e.g., “New York, NY”), we consider only the former. Given our focus on activity spaces, we retain only the timestamp and the location field, discarding other data such as user name or post text, yielding an initial set of around 10.25 million users.
To convert this raw data into activity spaces, we infer the most likely home base of each user. We use the `homelocator` R package to infer home bases at resolution 10 of the h3 grid (Uber Technologies, 2023). We use the proposed ensemble approach (i.e., requiring a three-algorithm agreement before a positive match is made) detailed in Chen and Poorthuis (2021) as they found a 81.5% precision rate compared to a hand-labeled ground-truth dataset. The ensemble approach is relatively strict to prevent false positives and provides the most likely home base for only 19.18% of users, resulting in a final, raw dataset consisting of approximately 2 million users and 1.2 billion data points.
Although this dataset does not contain any obvious identifying characteristics, location itself is sensitive data. To enable ethical sharing of data, we create a new aggregated, de-identified dataset from this input data through a combination of filtering, aggregation, and perturbation in line with Chen and Poorthuis (2021). The process of collecting the initial data and then creating the new de-identified dataset is summarized in Figure 1. We should note that the guaranteed maintenance of privacy and prevention of identification in such data is an open research problem (Fiore et al., 2020) and very challenging if the data is to retain value for mobility research (Lestyán et al., 2022). With this in mind, we believe our design allows us to ethically share this dataset as the data was originally posted in a public forum and both the likelihood of re-identification and potential impact (cf. Solymosi et al., 2023) are low. Specific design elements encompass: (1) including only timestamp and location, excluding more sensitive or identifying information; (2) transforming and aggregating these two variables; timestamp rounded to the nearest hour and location to a 500–1500-m grid cell; (3) adding random perturbation to both timestamps and locations, which provides a degree of plausible deniability; and (4) removing locations with very few observations (i.e., <= 5 observations). An overview of data collection, aggregation, and de-identification workflow.
The resulting dataset, amounting to 13 GB, is hosted in a series of Parquet files accessible at https://doi.org/10.48804/MBT32W. The dataset contains the following fields: • “id”: a randomly generated identifier for each data point. • “user_id”: a randomly generated identifier for each user. • “timestamp”: a UTC date-time stamp, rounded to the nearest hour. • “timezone”: the time zone corresponding to the location of the data point. • “loc_h3_7”: the geographic location of the data point, at H3 resolution 7. • “loc_h3_8”: the geographic location of the data point, at H3 resolution 8 (only included for urban locations). • “home_loc_h3_7”: the most likely home location, at H3 resolution 7. • “home_loc_h3_8”: the most likely home location, at H3 resolution 8 (only included for urban locations).
Parquet files can be read by most analytical environments (e.g., Python, R) and support efficient and partial download over HTTP, allowing access to a subset of the data from the URL. The online documentation of the data has various code examples that illustrate how to read, visualize, and process the data further.
Bias
Social media data has often been used to study mobility and mobility-related processes. However, the inherent bias within this data, and differences with representative sampling, are frequently raised as issues hindering further adoption (e.g., Longley et al., 2015; McNeill et al., 2017). Comprehensive analysis evaluating this bias at a fine spatial resolution remains largely unexplored and is often limited to noting that Twitter users are younger and richer than the average American (Pew Research Center, 2019). Significantly, this understanding of user characteristics often comes from nationwide surveys by PEW Research and does not offer any insights into the potential variation of this bias across different parts of the country. To address this gap, we examine the potential bias within this dataset by comparing the density of home locations to population data from the US Census.
Figure 2 compares the standardized census population against the user population in the dataset. What is clear is that the reliability of the data is spatially dependent. The correspondence at the state and county level is very strong (Pearson’s r = 0.99 and 0.98, respectively), but the correlation at the census tract level is much lower (Pearson’s r = 0.34), revealing considerable over- or underrepresentation in specific census tracts. Spatial distribution of standardized users and its correlation to census population.
To better illustrate the spatial variation in this misrepresentation of users, we conduct a geographically-weighted regression (GWR) with the number of user home locations as the dependent variable and the following independent variables: census population; median household income; median age; and percentage of white people. Since the number of users represents count data, we use a generalized linear model (GLM) to estimate a Poisson regression.
2
Figures 3 and 4 show the estimates for each of the independent variables at the county and tract level, respectively. The global regression results are included for reference in Table A1. The spatial distribution of coefficients for different independent variables at county level. (a) census population; (b) median household income; (c) median age; (d) percentage of white (The vertical dashed line on the legend indicates the coefficients from the global model). The spatial distribution of coefficients by different independent variables at tract level. (a) Jefferson, KY; (b) Los Angeles, CA; (c) New York, NY.

At the county level, a few things stand out. First, the median age in the global model has a negative effect on the number of users, corresponding to the general relationship found in the PEW Research data. However, this effect is significantly diminished in the GWR local coefficients in parts of the West Coast and much stronger than the national average in a north-south corridor in the middle of the country. Even more extreme, the effect of the percentage of white people in this middle corridor is positive overall, as it is in areas such as the Pacific Northwest (i.e., a higher percentage of white people generally means a higher number of users within the dataset). The direction of this effect, however, flips completely in several states including California, Colorado, and the Gulf Coast region.
Shifting to the tract level highlights further localized differences (see Figure 4). For example, the effect for percentage white is consistently positive in Manhattan, including in Harlem, suggesting a relative underrepresentation of non-white people in these areas. In contrast, Los Angeles has areas with both positive and negative effects, with the neighborhood around Compton, which is home to a significant proportion of African-Americans, showing a negative effect. In short, the relationship between race and the density of users in the dataset might very well vary between these two major cities.
Given this variegated (bias in) density of users within the dataset, general and generalizing use of this data should be done cautiously. Nonetheless, the broad scope of this dataset also opens the door for specific approaches that address bias. Depending on the research question and design, users from different neighborhoods could be weighted differently so that the final dataset more closely resembles the population of interest. Alternatively, if the social or spatial group of interest is relatively small (or does not have a large presence in the dataset), these neighborhoods can be oversampled deliberately (and others undersampled). Alternatively, researchers might decide to focus only on users from a specific neighborhood(s) of interest.
Conclusion
In this paper, we present a de-identified, historical dataset of activity spaces, based on the public social media activity of approximately 2 million users across the United States. We share this data publicly against a backdrop of a changing social media data landscape. Digital platform data is no longer as generally or publicly accessible to researchers as it was in the first decades of this millennium. Instead, human mobility research increasingly relies on cell phone or mobile app data, which can require significant resources to procure and for which ethical and transparent approaches of working and sharing need to be developed further.
In the meantime, we hope that the current dataset might help leverage the last two decades of human activity on social media platforms and perhaps enable a larger group of researchers to incorporate human mobility data in their analyses. 3 This, of course, needs to be done in a critical and cautious manner as this data is not created by a representative sample of the population. Nonetheless, the scale of the data allows for a concerted effort in this manner so that potential bias can be considered at the research design phase.
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the KU Leuven Internal Funds (STG/20/2021).
Data availability statement
Notes
Appendix
An overview of the result of GLM at both county and tract levels. Note: *p < .1; **p < .05; ***p < .01.
Dependent variable
Inferred home locations
County level
Tract level
Logged census population
1.0886*** (1.0873, 1.0899)
0.7346*** (0.7315, 0.7377)
Median household income
5.23e-8 (−3.47e-8, 1.39e-7)
3.50e-6*** (3.45e-6, 3.54e-6)
Median age
−0.0316*** (−0.0320, −0.0312)
−0.0359*** (−0.0361, −0.0357)
Percentage of white
0.0038*** (0.0037, 0.0039)
0.0005*** (0.0005, 0.0006)
Constant
−5.2877*** (−5.3114, −5.2640)
−1.7124*** (−1.7411, −1.6836)
Observations
3,087
67,490
Log likelihood
−92,531.39
−608,486.70
Akaike Inf. Crit.
185,072.80
1,216,983.00
Sensitivity test for different bandwidths of the geographically-weighted regression.
An example of using the activity spaces of African Americans in Jefferson County, KY for specific time periods (a) Wednesday between 8a.m.-6p.m. (b) Sunday after 7p.m.
The lines indicate the connections between home locations and visited places. The choropleth map indicates the percentage of African-Americans relative to all visitors in each location. The vignette in the documentation of the dataset has more details on how the data for one specific county was extracted from the data and how activity spaces were linked to data from the Census.
The spatial distribution of African Americans in Jefferson County, KY according to different (residential vs activity space perspectives). The top-left includes two specific segregation metrics: the dissimilarity index (Duncan and Duncan, 1955) and the exposure index (Wong and Shaw, 2010). Note how the residential segregation patterns between the Census and the de-identified social media data set are very similar but how the activity space perspective shows increased segregation on Sunday evenings when compared to a regular weekday. (a) Percentage of African American derived from Census data (b) Percentage of African American inferred from de-identified geotagged data (c) Percentage of African American at visiting places on Wednesday between 8a.m. and 6p.m. (d) Percentage of African American at visiting places on Sunday after 7p.m.
