Abstract
Social media platforms play a crucial role in providing valuable information during crises, such as pandemics. The COVID-19 pandemic has created a global public health crisis, and vaccines are the key preventive measure for achieving herd immunity. However, some individuals use social media to oppose vaccines, undermining government efforts to eliminate the virus. This study introduces the “GeoCovaxTweets” dataset, consisting of 1.8 million geotagged tweets related to COVID-19 vaccines from January 2020 to November 2022, originating from 233 countries and territories. Each tweet includes state and country information, enabling researchers to analyze global spatial and temporal patterns. An extensive set of analyses are performed on the dataset to identify prominent topic clusters and explore public opinions across different vaccines and vaccination contexts. The study outlines the dataset curation methodology and provides instructions for local reproduction. We anticipate that the dataset will be valuable for crisis computing researchers, facilitating the exploration of Twitter conversations surrounding COVID-19 vaccines and vaccination, including trends, opinion shifts, misinformation, and anti-vaccination campaigns.
Keywords
Introduction
The COVID-19 outbreak in December 2019 resulted in more than six million deaths cases and 600 million confirmed cases globally [36]. The respiratory illness caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) was declared a public health emergency of international concern on January 30, 2020, and a pandemic on March 13, 2020, by World Health Organization. Vaccines such as Pfizer, Oxford-AstraZeneca, Johnson and Johnson, and Moderna were started in late 2020 to control the spread [35]. To acquire herd immunity to end the pandemic, it has been estimated that around 60-70% of the world population must get vaccinated [1]. However, one of the biggest hindrances to vaccinations is hesitancy in acceptance due to numerous reasons, importantly perceived fears. Myths and fears around vaccines are not new. The COVID-19 pandemic has brought vaccine hesitancy back to the limelight. Social media, which provides real-time access to people’s opinions and beliefs across demographics, has frequently been signified as a hotbed of activity for anti-vaccination activists [18]. These activists may turn to social media platforms to oppose vaccinations with claims lacking scientific support. Metaphors used to argue against vaccination revolve around negative phrases such as “no forced vaccines” and “say no to vaccines”. A study [20] reported that during February and March 2021, 40–47% of American adults were hesitant towards COVID-19 vaccinations. Previous studies have linked social media and the anti-vaccination movement towards vaccine hesitancy [4, 15]. The COVID-19 vaccines and vaccination-specific discussions on social media platforms have influenced people’s decisions to accept or reject vaccination. Therefore, understanding the nature of vaccine hesitancy through publicly available social media discussions for different geographical regions can open interesting research avenues in the crisis computing domain.
Numerous COVID-19-specific datasets have been released to study the nature of COVID-19 and its vaccines and vaccination contexts. [3, 6, 12, 21, 25] are some of the pubicly available large-scale COVID-19 datasets that have hundreds of millions of tweets. However, these datasets vaguely focus on general COVID-19 discussions as their keyword sets are not entirely focused on vaccines and vaccination specific discourse. Closely focused datasets have also been introduced [7, 28, 30]; however, such datasets are either focused on a particular geographical region or have limited temporal coverage. To bridge these gaps, we present a large-scale geotagged tweets dataset, GeoCovaxTweets, containing vaccines and vaccination-related Twitter conversations with geolocation data. This dataset covers various perspectives on COVID-19 vaccines and vaccination, as our keywords and hashtags set capture a comprehensive conversational dynamics of support, criticism, and hesitance towards the COVID-19 vaccines and vaccination. Besides introducing this novel dataset, we also explore the dataset through topical and sentimental dimensions to generate an outlook on the COVID-19 vaccines and vaccinations over a period of nearly 34 months, i.e., January 2020–November 2022. In general, our contributions to the existing crisis informatics literature are summarized below: We present GeoCovaxTweets, a large-scale global geotagged tweets dataset covering a 34-month period from the first COVID-19 vaccine announcement to the availability of booster doses. This dataset includes a comprehensive conversational dynamics of support, criticism, and hesitance towards COVID-19 vaccines and vaccination. Additionally, we share normalized state names for each collected tweet and country information, ensuring precise geolocation information. The dataset can be accessed from Harvard Dataverse ( We conduct topic modelling on the GeoCovaxTweets dataset to identify clusters of topics associated with vaccines and vaccination discussions. We assign names to these topic clusters based on the content of the tweets within each cluster. Our analysis reveals the existence of 142 topics at the global level, with each topic being discussed by a minimum of 1,000 geotagged tweets. To our knowledge, this comprehensive analysis represents the first instance of such an endeavour. We publicly release the trained topic model to enable further research. We analyze the tweets globally and across the top 6 countries that are more frequent in the discourse to identify prominent global and country-specific discussion clusters.
Several researchers collect and share large-scale COVID-19 tweets datasets. [3, 6, 21, 25, 32] provide large-scale tweet collections with different data distributions, collection periods, and sets of keywords. [3, 6] present multilingual datasets with 2.5 billion and 1.12 billion tweets, respectively. Similarly, [12] has 2 billion multilingual tweets, with the last release on March 2021. [21] maintains an English-only collection with more than 2.2 billion tweets; its re-hydrated version is also available [25]. The keywords sets differ amongst the datasets—[6] use 80 keywords, [3] use 10 keywords, [21] uses 90+ keywords, [12] use 800+ multilingual keywords. The common issue with these billion-scale datasets is that they are curated to capture the conversational dynamics of COVID-19 across every thematic area. In doing so, due to limits (per day upper limit and sample size) set by Twitter on its filtered stream endpoint, these datasets contain proportionately fewer theme-specific tweets compared to theme-dedicated collections.
Vaccines and vaccination-dedicated collections have also been released. [30] presented a dataset of 137 million tweets with antivaccine discussions. The dataset includes tweets fetched through two complementary collections: first, using 60 keywords through streaming API, and second, collecting historical tweets from 70k accounts that allegedly propagated anti-vaccine tweets. [7] introduced a dataset of 4.7 million tweets that includes one week of vaccine-related tweets and designed a dashboard to track and quantify credible information and misinformation records along with their sources and keywords. Similarly, [10] presented spatio-temporal trends from March 01, 2022, to February 08, 2022, toward COVID-19 vaccines in the United States.
Several studies have utilized topic modelling and sentiment analysis to investigate public perceptions and identify prevalent concerns in COVID-19 vaccines and vaccination-specific discussions on Twitter. [31] performed topic modelling using vanilla LDA and identified 12 topics/themes from pro-vaccines and anti-vaccine group tweets. Similarly, [27] analyzed COVID-19 vaccine-related tweets and identified 16 topics grouped into five themes, with opinions about vaccination being the most tweeted topic. [38] extracted prevalent words in location-based tweets discussing vaccine brands, analyzed sentiment using VADER, and employed LDA to identify topics in positive and negative tweets. [11] used sentiment-based topic modelling, utilizing compound scores for sentiment polarity and coherence scores to identify the optimal topic number for various sentiment polarity categories. [17] analyzed tweet sentiment using the Brandwatch tool, measured the frequency of negative and non-negative tweets, and extracted 26 topics related to COVID-19 vaccine discussions. [29] employed nonnegative matrix factorization for topic identification and VADER for sentiment analysis, revealing concerns about vaccine administration and access, with fear being the predominant emotion followed by joy. [27] developed a three-tier model, encompassing sentiment-aware topic extraction, emotion-aware analysis of user reactions to topics, and a comprehensive three-perspective topic relevance analysis, within the context of a COVID-19 tweet dataset. [27] used VADER and the sentiment reasoner tool to calculate compound scores and classify tweet sentiments as positive, neutral, or negative. They conducted temporal and geographic analyses to identify trends and variations in sentiment across different locations. [2] discussed a simple attention-based neural network model following the transformer architecture, applying it to the task of analyzing sentiment in tweets. [34] employed artificial intelligence and geospatial methods to analyze tweets, determine sentiment polarity using the BERT model, and visualize sentiments on a world map. Hotspot analysis and kernel density estimation were applied to identify regions with positive, negative, or neutral sentiments.
However, these studies have limitations in terms of duration and focus mainly on the early stages of the pandemic. Thus, there is a need for research that explores the evolving nature of public perception of COVID-19 vaccines over an extended period.
To address the aforementioned gaps, we present a large-scale dataset of geotagged tweets GeoCovaxTweets related to COVID-19 vaccines and vaccination, with a cross-national, covers a period of approximately 34 months, from January 2020 to November 2022. We used Twitter’s Full-archive search endpoint to collect historical tweets, which returns 100% of Twitter data for a given query condition, unlike filtered and standard search endpoints used in previous studies that only return 1% of Twitter data at a particular time. We also performed topic clustering and sentiment analysis on GeoCovaxTweets to identify (pro-anti-neutral) stances and prominent clusters that emerge from discussions related to COVID-19 vaccines and vaccination. Overall, we believe that the released dataset will assist researchers in the crisis computing domain in analyzing Twitter conversations to explore global spatiotemporal shifts and trends related to the COVID-19 vaccine narrative.
Methods
Data collection
We used twarc python library to utilize Twitter’s Full-archive search endpoint. The endpoint returns all historical tweets, unlike the standard search endpoint, which returns <1% of tweets at a particular time. To collect tweets relevant to COVID-19 vaccines and vaccination, we used meta-seed keywords having pro and anti-vaccine stances through a snowball sampling technique [37]. We started with the following seed keywords: CovidVaccineScam, NoForcedVaccines, covidvaccineVictims, BanCovidVaccine, getvaccinated, covishield affect, Covaxin affect, vaccineinjuries, NoForcedVaccination, vaccine, and vaccination. Moving on, we analyzed N-grams in the collected corpus every 10 minutes to track emerging relevant keywords, similar to the strategy applied in [21]. Furthermore, we refined the keyword list by identifying potential keywords that frequently co-occur with the seed keywords [7]. We provide the complete set of keywords and hashtags used for curating GeoCovaxTweets in Table 1. Furthermore, we applied has:geo and lang:en conditional operators to collect only English-language tweets that are geotagged. The complete data curation process is illustrated in Fig. 1.
Keywords and hashtags used in curating GeoCovaxTweets
Keywords and hashtags used in curating GeoCovaxTweets

The data curation process.
For each collected tweet, we also provide state information besides country information. Although the geo.full_name tweet object contains state information, some tweets had [city, state] or [state, country] information—for instance, [Mumbai, Maharastra], and [Maharastra, India]—and therefore normalization was required for consistency. We set up a planet-level geocoding endpoint powered by OpenStreetMap 1 data. This endpoint allows us to obtain geolocation information on a global scale. The endpoint was backed with a search index of 61 gigabytes 2 (when compressed) on a virtual machine with 24 VCPUs and 216 gigabytes of memory. The endpoint returned state information for each [city, state] or [state, country] request.
Topic cluster and sentiment analysis
We utilized BERTopic [9], leveraging contextual embeddings derived from sentence transformers [33] to cluster topics through a class-based term frequency-inverse document frequency approach [9]. We experimented with multiple minimum cluster sizes, between 10 and 1000. At a minimum cluster size of 1000, we observed that the similarities (Cosine) between the generated topics (their embeddings) were minimal. Increasing the minimum cluster size further would reduce the number of topics; therefore, based on the quality of the topics that were being generated and the similarity matrices (such as shown in Fig. 4) across different settings, we finalized the minimum cluster size value to 1000. Algorithm 1 details our approach for topic generation using BERTopic.
For sentiment analysis, we use BERTsent [24], a RoBERTa-based sentiment classifier fine-tuned for English language tweets and trained on the SemEval 2017 corpus. After running BERTsent over GeoCovaxTweets, the sentiment of each tweet was classified as either negative, positive, or neutral.
Algorithm 1 BERTopic Topic Clustering
1:
2:
3:
4: Tokenize d i to obtain Tokenized d i .
5: Convert Tokenized d i to lowercase to obtain Lowercased d i .
6: Remove URLs, numbers, and special characters from Lowercased d i to obtain Cleaned d i .
7:
8:
9:
10: Apply BERT to Cleaned d i to obtain the document embedding Embedding d i .
11:
12:
13: Use the HDBSCAN clustering algorithm on the document embeddings {Embedding d 1 , Embedding d 2 , …, Embedding d n } to group similar documents into clusters.
14: Assign each document to a cluster, resulting in Cluster _ Assignment i .
15:
16:
17: Group all documents belonging to the cluster as a single document.
18:
19: Compute TF-IDF score tf _ idf (w, d i ) = tf (w, d i ) × idf (w), where:
20: tf (w, d i ) is the number of times the word w appears in document d i (Term Frequency).
21: idf (w) is the Inverse Document Frequency of the word w across all documents.
22:
23: Select the top words with the highest TF-IDF scores as representative words for the cluster.
24:
25:
26: Represent each cluster with its assigned documents to form topics.
27: Store the topics in a dictionary called Topics, where each key-value pair represents a topic: {Topic1 : [d 1, d 5, d 7] , Topic2 : [d 2, d 3, d 6] , …}.
28:
Discussions
Data description
The GeoCovaxTweets dataset has a total of 1,818,253 geotagged tweets created by 464.9k users between January 01, 2020, and November 25, 2022. Figure 2 presents the daily distribution of the tweets and the cumulative number of people vaccinated per hundred globally. The source of the vaccination data is Our World in Data 3 . Out of 1.8 million tweets, 899,592 (49.5%) are from the United States, 263,765 (14.5%) from the United Kingdom, 160,913 (8.8%) from India, 136,389 (7.5%) from Canada, and 65,951 (3.6%) from Australia listed in Table 2. The top 6 countries contributing the highest number of English-language tweets in GeoCovaxTweets are the United States, the United Kingdom, India, Canada, Australia, and South Africa. Consequently, we conducted a detailed study focusing on these countries. Figure 3 presents the daily distribution of tweets in these 6 countries, along with the corresponding cumulative vaccination per hundred data. The tweet distribution was analyzed for the top 6 countries in the study. Figure 3 clearly demonstrates that the number of daily tweets related to vaccines and vaccination was lower before the commencement of the vaccination campaigns. Following the rollout of vaccines in different countries, the number of tweets in each country began to increase a few weeks prior to the official vaccination launch date. Once the vaccination process commenced, there was a significant surge in tweets across all countries.

The daily global distribution of the tweets and the cumulative number of people vaccinated per hundred.
The top 15 countries, states, and tweet sources in GeoCovaxTweets. Note: *
OpenStreetMap data returned

The daily distributions of tweets in the top 6 countries in the discourse along with their respective cumulative vaccination data. The un-uniformity in the vaccination data for some countries is due to irregular updates.
Regarding the statewide distribution, England, California, New York, Ontario, and Texas contribute the highest number of tweets. We also explored the different tweet sources in GeoCovaxTweets. Twitter’s native application for iPhone, Android, and iPad apps are the top 3 sources contributing more than 1.7 million tweets. The list also includes Instagram and dlvr.it in the top 5 generating above 10k tweets. We provide the top 15 countries, states, and tweet sources with their respective frequencies in Table 2.
We explored the top n-grams in the vaccines and vaccination-related discourse. We removed noisy data such as URLs, retweet information, special characters, paragraph breaks, and stop words before extracting bigrams and trigrams. Table 3 lists the top 15 bigrams and trigrams in GeoCovaxTweets. Similarly, we extracted the 15 top-tweeted hashtags, which are as follows: #covid19, #vaccine, #getvaccinated, #vaccination, #covid, #covidvaccine, #vaccines, #vaccineswork, #coronavirus, #covid_19, #vaccinated, #pfizer, #wearamask, #covid19vaccine, and #astrazeneca.
The top 15 bigrams and trigrams in GeoCovaxTweets
We identified 142 topics, each with at least 1000 tweets, and listed them in Table 4. The listed topics include clusters that discuss various contextual aspects of COVID-19 vaccines and vaccinations, such as vaccine mandates (ethics, politics, personal autonomy), COVID-19 vaccination (effectiveness, side effects, public perception), vaccine distribution (challenges, disparities, global impact), vaccine misinformation (campaigns, conspiracy theories, social media), vaccine development (progress, challenges, breakthrough), COVID-19 variants (vaccination strategies, effectiveness, global response), vaccine passports (implementation, privacy concerns, public acceptance), vaccination for children (efficacy, safety, ethical considerations), COVID-19 and public health (lockdowns, testing, vaccine policies), and vaccine cost and affordability (global access, equity, and financing).
Most discussed topic clusters in GeoCovaxtweets . Topics are sorted based on the volume of tweets
Most discussed topic clusters in
Certain topics exhibit a high correlation. To explore this relationship, we utilized cosine similarity on topic embeddings and generated a similarity matrix (Fig. 4). We found dense-colored blocks in the similarity matrix for topics 0-9, indicating a strong correlation among these clusters. These topics include “Vaccines and Public Perception,” “COVID-19 Vaccination Experiences”, “Modi and COVID-19 Vaccination,” “Vaccine Mandates,” “Vaccine and Mask Hesitancy,” “Vaccinated vs. Unvaccinated,” “Canadian vaccine distribution”,“Vaccines: Cost and Affordability”, “Education during the pandemic”, “Vaccine Experience,” and “COVID vaccine skepticism”. Conversely, topics 77-141 demonstrate a different pattern. The similarity scores within this range are <0.4, indicating minimal correlation among the topic clusters (as depicted by light-coloured blocks in the similarity matrix). However, a few instances show a strong relationship despite the lower overall similarity scores. Notable pairs of strongly correlated topics include “Preparing for vaccine side effects” –“Pfizer Vaccine Side Effects”, “Vaccine Misinformation Campaigns” –“Anti-vaccine Activism,” “Vaccine passport debate” –“vaccine passports for international travel,” and “Appointment booking & availability” –“COVID-19 Vaccination Decision: Taking the Shot or Not?”.

Cosine similarity matrix illustrating the relationship between generated topic clusters.
Next, we examined the topic distribution in the top 6 countries, depicted in Fig. 5. Each graph represents tweet distribution per topic per country. Notably, the United States had a presence in 81 topic clusters, followed by the United Kingdom (37), Canada (15), India (14), Australia (12), and South Africa (5). We also argue that a minimum topic size of 1000 is significant [26] in capturing topics as <1% of tweets are geotagged [21, 22, 32]. A comparative analysis in [24] showed that full-volume tweets and geotagged tweets exhibit similar volumetric patterns, implying that the identified topics are near-true representations of global events concerning COVID-19 vaccines and vaccination.

The distributions of tweets across 142 topics over the top 6 countries in the discourse.
Further, our topical analysis reveals the presence of multiple (contextually) regional topic clusters that received discussions worldwide — some such topic clusters include “Modi and COVID-19 vaccination”, “Canadian vaccine distribution”, “Trump Vaccine Endorsement”, “Biden’s Vaccine Mandate”. For instance, the discussion surrounding “Modi and COVID-19 vaccination” (topic 2) gained attention worldwide and is only the (contextually) regional topic cluster (centred around India) that is present in the top 5 topics discussed worldwide. Although this topic is in the Indian context, the global interest may be due to the influence of the Indian diaspora spread worldwide. As a result, discussions related to COVID-19 vaccines and vaccination efforts gain traction and visibility across global communities, consequently expanding the topic’s reach beyond regional boundaries [14].
The discussion on COVID-19 Vaccination before and after the vaccination rollout and the rise in confirmed cases and deaths between 2020–2022 affected people in terms of physical and mental health [8]. Studies related to the pandemic and its sentiments have reported a rise in positive and negative sentiments [13, 19] in the early phases of the pandemic. Following up with the earlier studies in the literature, utilizing the GeoCovaxTweets dataset, we explored the early 2020 –late 2022 timeline of the COVID-19 pandemic through topic-involved sentiment analysis across the globe and the top 10 topics in the discourse. Results show that neutral sentiments dominated the discourse globally (shown in Fig. 6).

The class-wise global daily distribution of the sentiments of the tweets.
Figure 7 explores the distributions of tweets across the top 10 topics discussed between January 2020 and November 2022. We only considered global topics for the sentiment analysis while ignoring regional topic clusters like “Modi and COVID-19 vaccination” and “Canadian vaccine distribution”. The top 10 topics include “Vaccines and Public Perception”, “COVID-19 vaccination experiences”, “Vaccine Mandates”, “Vaccine and mask hesitancy”, “Vaccinated vs unvaccinated”, “Vaccines: Cost and Affordability”,“Education during pandemic”, “Vaccine Experience”, “COVID vaccine skepticism” and “Pfizer and Moderna”. Our volumetric analysis reveals the increase in tweets regarding vaccines and vaccination between early 2021 and late 2021 for the top 10 topics. Next, we explored (in Table 5) the overall and monthly-basis sentiment proportions (positive/negative ratios) for that period across the 10 most discussed topics. We observed the highest average sentiment proportions for the topic “Pfizer and Moderna” (0.43) followed by “Vaccines and Public Perception” (0.40), “Vaccine Mandates” (0.36), “Education during pandemic” (0.36), “COVID vaccine skepticism” (0.36), “COVID-19 vaccination experiences” (0.33).

The distributions of sentiments of top 10 topics between January 2020 and November 2022.
Per-month basis sentiment proportions for the top 10 topics in the discourse
Regarding “Vaccines and Public Perception,” the sentiment proportion ranges from 0.18 to 0.65. There’s an upward trend from January to March, followed by a decrease until May and a gradual increase until November. Concerning “COVID-19 vaccination experiences,” the proportion ranges from 0.13 to 0.57. There’s a dip in proportion from April to June, followed by a slight recovery in July and August. Regarding “Vaccine Mandates,” the proportion ranges from 0.16 to 0.62. There’s a peak in proportion in February, followed by a decline until May and a slight increase until November. Regarding “Vaccine and mask hesitancy,” the proportion ranges from 0.07 to 0.56. There’s a general increase in sentiment from January to March, followed by a decrease until June, and then a gradual decline until November. For “Vaccinated vs unvaccinated,” the proportion ranges from 0.03 to 0.69. There’s a significant peak in proportion in February, followed by a sharp decline until December. Regarding “Vaccines: Cost and Affordability,” the proportion ranges from 0.04 to 0.94. There’s a significant increase in proportion in March, followed by a decline until May, and then a relatively stable proportion until December. This topic exhibits a high variability range. Regarding “Education during the pandemic,” the proportion ranges from 0.09 to 0.71. There’s a peak in proportion in May, followed by a gradual decline until October, and a slight increase until December. Concerning “Vaccine Experience,” the proportion ranges from 0.07 to 0.68. There’s an initial increase in proportion from January to March, followed by a decrease until May, and then fluctuation with no clear trend until December. Regarding “COVID vaccine skepticism”, the proportion ranges from 0.14 to 0.59. There’s a peak in proportion in June, followed by a gradual decline until September, and then an increase until November. Regarding “Pfizer and Moderna,” the proportion ranges from 0.15 to 0.72. There’s a significant increase in proportion in March, followed by a decline until May, and then relatively stable sentiment until November.
In summary, the sentiment analysis shows that the sentiment proportion fluctuated throughout the year, suggesting that public perception and sentiment regarding these topics were dynamic and subject to change. Some topics exhibit consistent positive sentiment (e.g., “Pfizer and Moderna”, “Vaccines and Public Perception”), while others display more variability (e.g., “vaccinated vs unvaccinated”). Interestingly, it is observed that in the month of March, the sentiment proportions across multiple topics are the highest, indicating the presence of positive public perception during the early phase of vaccine distribution. Conversely, the discourse surrounding vaccines and vaccination during late 2021 carried more negative sentiments.
Next, we explored the top 10 topic-sentiment frequencies for the top 6 countries in the discourse. The results are summarized in Table 6. The frequency of topic-sentiment patterns differs among countries, indicating variations in public sentiment or discourse intensity. For instance, the United States generally exhibits higher frequencies than other countries, suggesting a potentially more active or extensive discourse on the given topics. Several sentiment patterns consistently emerge across multiple countries. Topics such as “Vaccines and Public Perception”, “COVID-19 vaccination experience”, “Vaccine Mandates”, “Vaccine and mask hesitancy”, “Vaccinated vs unvaccinated”, “Vaccines: Cost and Affordability”, “Vaccine Experience” tend to have neutral sentiment as the dominant frequency across all countries, indicating some commonality in the sentiment distribution for these topics globally. While neutral sentiment prevails, there are instances of negative sentiment in the dataset. Negative sentiment may arise from criticism, dissatisfaction, or unfavourable opinions associated with specific topics. Although less frequent than neutral sentiment, these instances provide insights into the presence of dissent or negative sentiment within the discussions. The differences in frequencies and sentiments across countries may reflect cultural, regional, or contextual variations. Public sentiment is influenced by various factors such as cultural norms, political climate, socioeconomic conditions, and media influence. Therefore, the patterns observed in the table can potentially reflect these contextual differences.
Top 10 topic-sentiment frequencies for the top 6 countries in the discourse
The GeoCovaxTweets dataset will be made publicly available from Harvard Dataverse after the peer review. We release the following tweet objects in the dataset: id, created_at, author.id, author.location, source, entities.hashtags, geo.country_code, author.public_metrics.followers_count, and state. Twitter’s data re-distribution policy 4 restricts the sharing of raw Twitter data with third parties. Therefore, tweet identifiers (i.e., id tweet object) need to be hydrated to re-create the dataset locally. Hydration of tweet identifiers refers to the process of fetching raw tweet data using Twitter’s tweet lookup endpoint. The lookup endpoint has a limit of 900 requests per 15-minute window; therefore, GeoCovaxTweets can be re-created locally within a few hours.
Tools such as Hydrator (desktop application) and twarc (Python library) can be used to hydrate the tweet identifiers present in GeoCovaxTweets. In both use cases—hydrating all the tweet identifiers in GeoCovaxTweets or a subset—tweet identifiers should be in a file (e.g., TXT/CSV) with each identifier on a different line without any header and quotes. With the Hydrator app, we need to link a Twitter account for authorization and load the file with tweet identifiers for starting hydration. In the case of twarc, we can use the following command with valid Twitter API credentials:
The above command hydrates the tweet identifiers in
Dataset usages
GeoCovaxTweets covers a comprehensive conversational dynamics of support, criticism, and hesitance towards the COVID-19 vaccines and vaccination on Twitter. There are several applications and potential usages of GeoCovaxTweets. We discuss some of them below:
First, Twitter demographics are biased toward the younger generation and tech-aware users. Second, the dataset only includes English-language COVID-19 vaccine discourse missing the opinion of minority dialects and multilingual speakers [16]. Third, geotagged tweets are required to extract situational awareness as they involve spatial analyses; however, today, less than 1% of tweets are geotagged [21]. Hence, our dataset may not fully reflect the Twitter users’ conversations around the COVID-19 vaccines and vaccination.
Disclaimers
This dataset should be used only for non-commercial purposes while strictly adhering to Twitter’s policies. Note that the number of tweets after hydration can be less than reported, as tweets can be deleted or made private. GeoCovaxTweets has not been scrutinized for misinformative and propaganda tweets—the dataset contains all the tweets returned by Twitter’s Full-archive search endpoint for applied query+condition. Any location information present in the dataset and the geoinformation generated after hydration should be used while maintaining the privacy of the individuals.
Conclusion
This paper introduces the GeoCovaxTweets dataset, a comprehensive collection of geotagged tweets worldwide focused on COVID-19 vaccines and vaccination. The dataset consists of over 1.8 million English tweets from 464.9k users across 233 countries and territories, spanning from January 01, 2020, to November 25, 2022. It captures diverse conversations on COVID-19 vaccines across multiple thematic areas.
Using topic clustering, we identified 142 topics with a minimum of 1000 tweets each. These topics cover various subjects, including vaccines and public perception, COVID-19 vaccination experiences, and vaccine mandates. Analyzing topic volumes in the top 6 represented countries, we observed the United States as the leader in topic discussions, followed by the United Kingdom, Canada, India, Australia, and South Africa. This highlights the need to address vaccine hesitancy, equity, access, and mandates globally, given the complex and diverse viewpoints surrounding COVID-19 vaccines. Sentiment analysis revealed a predominance of neutral sentiments in the global discourse on COVID-19 vaccination, with positive and negative sentiments also present. Topics such as “Vaccines and Public Perception” and “COVID-19 Vaccination Experiences” displayed a higher occurrence of positive, negative, and neutral sentiments compared to others. Overall, the GeoCovaxTweets dataset offers valuable insights for crisis computing researchers, enabling exploration of the conversational dynamics of COVID-19 vaccines across various spatial and temporal dimensions. The dataset’s extensive coverage and geotagged information facilitate trend analysis, opinion shifts, identification of misinformation, and monitoring of anti-vaccination campaigns.
