Abstract
Social media has become a powerful conduit for misinformation during major public events. As a result, an extant body of research has emerged on misinformation and its diffusion. However, the research is fragmented and has mainly focused on understanding the content of misinformation messages. Little attention is paid to the production and consumption of misinformation. This study presents the results of a detailed comparative analysis of the production, consumption, and diffusion of misinformation with authentic information. Our findings, based on extensive use of computational linguistic analyses of COVID-19 pandemic-related messages on the Twitter platform, revealed that misinformation and authentic information exhibit very different characteristics in terms of their contents, production, diffusion, and their ultimate consumption. To support our study, we carefully selected a sample of 500 widely propagated messages confirmed by fact-checking websites as misinformation or authentic information about pandemic-related topics from the Twitter platform. Detailed computational linguistic analyses were performed on these messages and their replies (N = 198,750). Additionally, we analyzed approximately 1.2 million Twitter user accounts responsible for producing, forwarding, or replying to these messages. Our extensive and detailed findings were used to develop and propose a theoretical framework for understanding the diffusion of misinformation on social media. Our study offers insights for social media platforms, researchers, policymakers, and online information consumers about how misinformation spreads over social media platforms.
Keywords
Introduction
During the past few decades, advances in communication and networking technologies have paved the way for unprecedented global and massive information sharing to occur easily. Social networking platforms such as Twitter, Facebook, and Instagram have played a vital role in allowing this information sharing to happen. With the expansive reach and extensive information-sharing capabilities that these platforms provide, anyone utilizing them would be able to share information wide and far with others, irrespective of the truthfulness or veracity of the information shared. The prevalence and ubiquity of these platforms have raised concerns regarding the nature and accuracy of information being shared on these platforms. Inaccurate, false, misleading, or malicious information shared widely has the power to drastically alter the psyche, actions, and behaviors of the users exposed to such misinformation (Allcott and Gentzkow, 2017; Allcott et al., 2019; Jensen and Potts, 2004; Rains et al., 2022; Volkova et al., 2017).
Social networking platforms have become a preferred medium for spreading misinformation on the Internet (Allcott et al., 2019; Bode and Vraga, 2015), and recent instances of spreading misinformation through social media have demonstrated its potentially harmful impacts on global societies demanding a comprehensive examination and understanding of this phenomenon (Freiling et al., 2023). Stakeholders from various domains increasingly recognize the importance of understanding misinformation on social media, its nature, manner of spread, and its impact on the consumer (i.e., someone who reads misinformation) (Escolà-Gascón et al., 2023). Consequently, it has also been a focus of intense research inquiry lately in the information systems (IS) domain (e.g., (Kim and Dennis, 2019; Kim et al., 2019; Moravec et al., 2019)).
Misinformation refers to the false stories circulated on social media that mislead consumers or society (Allcott and Gentzkow, 2017; Jang and Kim, 2018). Current research shows that misinformation is widely shared on social media, even more frequently than real news (Silverman, 2016, p. 201; Vosoughi et al., 2018). This results in a swift, large-scale propagation of misinformation across a vast consumer base. Research also shows that such a proliferation of misinformation on social media impacts socio-political sentiments and has economic ramifications (Allcott et al., 2019). For example, misinformation gained prominence during the U.S. presidential elections in 2016 when its role in influencing the results was implied (Allcott et al., 2019). Similarly, false rumors about Barack Obama’s injury in an explosion resulted in a loss of about $130 billion in stock value (Rapoza, 2018). Unfortunately, such misinformation circulating the media is expected to continue their upward trend (Pan et al., 2017), undermining people’s trust in serious journalism.
Given the predominance of misinformation on social media and its social, political, and economic implications, researchers have been motivated to conduct in-depth examinations to design appropriate interventions to counter its spread (Khan et al., 2022). These research initiatives have revolved around various aspects of misinformation from its generation to its consumption and impact. These include the producers of misinformation, the content of messages, the consumers who respond to them, the resulting diffusion patterns, and finally, their societal impacts. Previous studies have examined the effects of the digital persona of misinformation producers (Ong and Cabañes, 2018), the source ratings (Kim et al., 2019), and the presentation format (Kim and Dennis, 2019), on the diffusion of false information. Researchers have also compared the spread of misinformation with true news (Vosoughi et al., 2018). Some researchers contend that misinformation travels faster, farther, deeper, and more broadly compared to true news (Kalichman et al., 2022; Rapoza, 2018). Others report that real news shows a wider breadth (Jang and Kim, 2018). This existing body of research in the IS domain provides a valuable snapshot of the proliferation of misinformation on social media (Kim et al., 2019; Kim and Dennis, 2019), albeit piecemeal.
While past studies have often examined individual components of the misinformation ecosystem, such as messages, consumers, or producers in isolation (see Table 1), such a piecemeal approach may only provide a partial picture of their impact on misinformation or information spread. During the process of misinformation/information diffusion, a continuous interplay unfolds among producers, messages, and consumers. For example, misinformation often spreads through a network, and by studying the dynamics between producers and consumers, we can gain a better understanding of their behavior, motivation, and the network effect that contributes to the spread of information. Moreover, not all components have received attention in the literature. For instance, while previous studies have examined sources of misinformation, the role of producers’ or consumers’ persona in misinformation diffusion has received little attention. Therefore, a more thorough and holistic understanding of the entire misinformation ecosystem is needed. To address this gap, the current study adopts an integrative approach based on Ackoff’s system thinking (Ackoff, 1971). Drawing upon Ackoff’s systems thinking approach, we consider the interplay of multiple elements like producers, consumers, and messages to analyze the dissemination of misinformation. We develop an integrative framework to study the system rather than focusing solely on individual elements. This approach can enhance our understanding of the misinformation ecosystem on social media.
Literature related to drivers of misinformation/information origination, transmission, and diffusion.
Additionally, consumers engage with and spread information that resonates with them or exploits emotional responses. By understanding the relationship between message content and producers and consumers of the information, we can identify patterns of amplification of specific messages. Therefore, we further incorporate the neuroscience perspective (Bechara, 2004) to understand the role of emotions in different aspects of misinformation diffusion. Specifically, we use the neuroanatomical framework of decision-making (Bechara, 2004) to explain how emotions impact the spread of misinformation. Consumers tend to interact with messages that are emotionally charged. These messages can influence decision-making and behavioral responses through memory encoding and recall. By assessing the emotional elements of original messages and immediate replies, we seek to understand the role of emotions in the transmission and diffusion of misinformation.
Therefore, the current study advances literature by proposing a comprehensive framework for misinformation origination, transmission, and diffusion. Drawing upon existing literature, and relevant theories from the communication, persuasion, online content generation motivation, system’s theory, and neuroscience domains, we first identify critical factors associated with information and misinformation diffusion on social media. Previous studies have reported producer-, consumer-, and message-related factors as important in misinformation diffusion (Kim et al., 2019; Silverman and Singer-Vine, 2016; Wu and Liu, 2018). Further, through exploratory analysis of a multi-level data set from Twitter, we provide quantitative and qualitative insights to understand the differential anatomy of misinformation and information spread on social media. Lastly, a framework of misinformation diffusion is proposed, and specific propositions are offered for future researchers. Thus, the two main research objectives (RO) of this study are:
For this study, data was collected from Twitter, which is one of the most popular micro-blogging and social networking services. It is also one of the most used sites for the dissemination of ideas, messages, and news and has grown exponentially in popularity in recent years. ‘It is often observed that news stories are first broken in Twitter space and then the electronic and print media take them up’ (Jain et al., 2016). At the same time, the lack of moderation and vetting of information, ease of use, and the competitive race to post news items on Twitter has resulted in the problem of credibility and misinformation (S. Zhao et al., 2023). This makes Twitter an apt source of data for researchers examining misinformation (Aswani et al., 2019; Rosenberg et al., 2020; Shin et al., 2018). This study aims to highlight the complexities of misinformation dissemination on social networking platforms and offer valuable propositions for future research.
Theoretical foundation and relevant literature
Misinformation on social media
Although several definitions of misinformation exist, most of these definitions include two key elements, namely authenticity, and intent. That is, misinformation is verifiably false and is created to deceive and mislead (Shu et al., 2017). In this study, misinformation refers to false information spread intentionally to deceive for ideological, political, or financial gains (Allcott and Gentzkow, 2017; Jang and Kim, 2018; Subramanian, 2017).
Businesses have been using misinformation such as misleading advertisements (He et al., 2023) and greenwashing for long for-profit motives (Lyon and Montgomery, 2015). However, the spreading of misinformation for ideological and political gains has become more pronounced recently. Previous research attests to the existence of websites that regularly create and promote misinformation (C. Shao et al., 2017). On social media such as Twitter, misinformation is created by ordinary social media users (Jang and Kim, 2018) with no journalistic training and minimal concern for the authenticity of the news (Kim et al., 2019). When displayed as news, misinformation can profoundly impact receivers’ attitudes, emotions, and C. Current research also shows that misinformation is more widely and frequently shared on social media than information (Silverman, 2016; Vosoughi et al., 2018), with a significant impact on social sentiment and behavior. Numerous instances of widespread proliferation of misinformation with considerable socio-political fallouts have come to light in recent years.
Public response to the COVID-19 crisis has also been influenced by misinformation (Krause et al., 2020; Mian and Khan, 2020). For example, following the widespread misinformation regarding the potential effectiveness of chloroquine against COVID-19, there have been cases of individuals overdosing on the drug (Mukhtar, 2021). During the initial stages of COVID-19 vaccination, a prevalent myth emerged that falsely claimed Bill Gates intended to utilize vaccines as a means to implant microchips in individuals for surveillance purposes. This misinformation led to vaccine hesitancy among the public (Muhammed and Mathew, 2022). Furthermore, studies have shown that individuals who endorse misinformation are less likely to adhere to public health guidelines, which poses a challenge to public health efforts in controlling epidemics/pandemics (Mukhtar, 2021).
The rapid proliferation of misinformation is largely attributed to advanced internet and communications technology (ICT) and social media (Allcott and Gentzkow, 2017; Lazer et al., 2018). According to a report by Pew Research Center, as of 2018, 68% of Americans were occasionally accessing news on social media, and around 20% of Americans relied on social media as their primary source of the news (Matsa and Shearer, 2018). Users’ access to and reliance on social media as a source of news has created new challenges (Allcott and Gentzkow, 2017). Digital social media platforms such as Facebook, Twitter, and Snapchat enable rapid creation, dissemination, and exchange of information and misinformation among large communities of users. Given that human beings are not very good at detecting deceptions (Rubin, 2010; Twitchell et al., 2004), discerning misinformation from information on social media is challenging for most users. Automatic detection of deceptive messages online on social media is better than humans. Additionally, users’ exposure to social media information is typically constrained by algorithms based on their social networks, preferences, and prior online behaviors (Bakshy et al., 2015). The algorithms controlling the flow of information tend to foster ‘ideological polarization’ and ‘echo chambers’ when like-minded users readily accept misinformation aligned with their perspectives as accurate and share it quickly with others (Silverman & Singer-Vine, 2016; Spohr, 2017).
As misinformation passes from the producer to the consumer on social media, various factors play a role in dispersing and impacting users’ ideologies, beliefs, and responses (Jenke, 2023). It is crucial to consider all these factors as a system in order to comprehend how they interact and impact the polarization and diffusion of information. However, as presented in Table 1 below, many studies have investigated the proliferation of misinformation, considering only one or few factors such as consumer characteristics (Buchanan, 2020), message characteristics (Kim and Dennis, 2019), and producer/source characteristics (Kim et al., 2019). Drawing upon Ackoff’s systems thinking approach, we assert that, due to the intricate interplay of multiple elements like producers, consumers, and messages, it is prudent to analyze the dissemination of misinformation from the system perspective for a comprehensive understanding of this phenomenon. We also bring the neuroscience perspective to explain how emotional factors play a significant role in the dissemination of misinformation. Emotionally charged messages tend to capture attention and have a stronger impact on memory encoding, recall, individual decision-making, and behavioral responses, regardless of its veracity. False information that elicits strong emotions can be more easily remembered and shared, contributing to its spread.
Therefore, we apply the system (Ackoff, 1971) and neuroscience (Bechara, 2004) perspectives to understand the spread of information and misinformation on social media. Specifically, we integrate these perspectives to understand producer and consumer personas, message characteristics, and the role of emotion in consumers’ responses to the message and message diffusion (Auger, 2014; H. Han et al., 2019; Katz et al., 1973; Leung, 2013; McCornack, 1992; Osmundsen et al., 2020; Otterbacher, 2011; Ramage et al., 2015; G. Shao, 2009; Xun and Reynolds, 2010).
In the next section, we discuss the neuroscience perspective and system perspective in detail.
Role of emotion: Neuroscience perspective
In the information management and economics literature, it is often assumed that decisions are derived through cost and benefit analysis by a rational agent to maximize utility (Kahneman and Tversky, 2013). However, studies have shown that emotions play a crucial role in our decision-making abilities. Evidence from neuroscience research shows that people who cannot process emotions (due to damage to the prefrontal cortex) also show impairment of the decision-making process. This lack of ability to process emotional information can compromise the quality of decisions in their everyday life (Bechara, 2004). The somatic marker hypothesis (SMH) suggests that signals based on emotional processes are integrated into the brain. These signals control behavior and decision-making, especially in complex situations. When making decisions, emotional processing serves as an indicator of the value of each option presented. The emotional signals trigger attention, activate memory, and guide individual behavior in response to a stimulus event through an evaluation of rewards and punishments associated with each choice (A. R. Damasio, 1994, 1996). The role of emotional signals in decision-making increases when it is difficult to interpret a stimulus (Rogers et al., 1999). Decision-making under ambiguity and uncertainty of outcomes activates an emotional state that enables a person’s reaction. In the next section, we discuss how a written description of a message acts as a stimulus event and plays a role in consumers’ response to the message.
Critical factors in misinformation and information diffusion on social media – A system perspective
A system is defined as ‘a set of interrelated elements’ (Ackoff, 1971). The system approach focuses on studying the system as a whole. (Ackoff, 1971) further elaborated the system approach as an approach that is ‘concerned with total-system performance even when a change in only one or a few of its parts is contemplated because there are some properties of systems that can only be treated adequately from a holistic point of view. These properties derive from the relationships between parts of systems: how the parts interact and fit together’ (p. 661). There are three key components and their interrelationships contributing to the diffusion of a message, namely, message producer, the message, and message consumers. Therefore, a reductionist approach investigating these components and their influence in isolation may fail to capture the entire system effect. First, we use the system approach to investigate the role of producer persona, consumer persona, and message characteristics on the diffusion of information and misinformation and propose a conceptual framework. Second, we analyze our data to understand how and why misinformation diffusion differs from the diffusion of information and present a research framework with testable propositions.
Role of producer and consumer digital personas. 1
On social media, a user’s role is often blurred. Users often play the roles of producer and consumer simultaneously. For example, on social media sites such as Twitter, a producer may read a misinformation or information article on another source as a consumer and then start a tweet as a producer on Twitter (Whitehead et al., 2023). Those who read this tweet are consumers. On reading a tweet, a consumer may forward it or comment on it in the consumer role or start another new tweet, assuming the role of producer. Due to this overlapping nature of the roles, it isn’t easy to separate the two. In this study, we denote the original messages that carry misinformation or authentic information as root messages or root tweets on the Twitter platform specifically. This study distinguishes between the two roles based on each root tweet examined and its cascade of information. We define a producer as the person who is credited with creating and posting the original tweet on social media to which others have responded. On the other hand, a consumer is defined more exclusively than someone who merely receives the message. Rather, the consumer is regarded as an active participant in the life cycle of a root message who reads the original tweet and likes it, comments on it, revises it, or shares it and not social bots (Vosoughi et al., 2018).
Several personal factors influence users’ acceptance of and engagement with information or misinformation on social media. For example, users typically spend time on social media platforms with a hedonistic mindset (e.g., for pleasure or entertainment), as opposed to a utilitarian mindset (for facts or useful information) (Hirschman and Holbrook, 1982; Z. Zhou et al., 2011). In a hedonistic mindset, users are less interested in fact-checking and are inclined to believe the information that aligns with their interests and pre-existing beliefs, reflecting confirmation bias (Hirschman and Holbrook, 1982; Nickerson, 1998).
Research suggests that the characteristics of producers or consumers of a message can have a significant influence on its spread and overall impact (Jang and Kim, 2018; Vosoughi et al., 2018). In the case of producers, the underlying motivations for content generation and the persuasion strategies used by the producer would impact message content, consumer response, and misinformation diffusions (Y. Han, 2021). Consumer response to a message (information or misinformation) is influenced by the messenger (source) they encounter (Paletz et al., 2019). In other words, the consumer persona is influenced by the producer persona. Similarly, consumers’ emotions (neuroscience perspective) and motivations would also impact diffusion. Other factors such as misinformation producers’ and consumers’ partisanship levels and their profile characteristics (such as the number of followers, follower-followee ratio; online status, account age, whether an account is verified or not, account language, geolocation of the account, engagement and temporal patterns of an account, and so on) may also impact the diffusion of misinformation on social media (Aggarwal et al., 2012; Castillo et al., 2011; Chu et al., 2012; Murthy et al., 2016).
Han (2021) examined mutual impact of the characteristics of misinformation producers, consumers and messages on Twitter in the COVID-19 pandemic. (Chu et al., 2012) analyzed the patterns that distinguish humans, bots, and cyborgs on Twitter based on their profile characteristics, their engagement behavior, and the content they post on social media. Another study used features of Twitter users like account age, number of tweets, and the follower-followee ratio to detect phishing on Twitter (Aggarwal et al., 2012). Further, (Castillo et al., 2011) assessed the credibility of tweets considering the features of the news creator and consumers, including ‘registration age, number of followers, number of followees, and the number of tweets the user has authored in the past’ (p. 678). Various other studies show that analyzing misinformation users’ (producers’ and consumers’) behavior on social media are among critical aspects of detecting suspicious social accounts (J. Zhao et al., 2014) or social bots designed for distributing misinformation (Ferrara et al., 2016). User profile analysis includes characteristics such as the account language, the geographic location of an account, account age, whether an account is verified or not, number of tweets by an account, and user engagement. (Murthy et al., 2016) studied the interaction of bots with human users on Twitter during the 2015 UK general election. This study utilized the engagement and temporal activities of users on Twitter to distinguish social bots (Lazer et al., 2017; Rubin, 2010; J. Zhao et al., 2014; X. Zhou and Zafarani, 2018). The following is the list of information collected in users profiles on Twitter: whether an account is verified or not, an account age, the number of friends and followers an account has, a user’s activeness on the platform measured by the number of messages they have posted, number of likes they have received, and number of accounts they have listed.
In summary, studies have shown that producer characteristics influence message content and the consumer persona, and both consumers’ characteristics and producers’ characteristics contribute to the diffusion of the message. (i.e., (producer characteristics, consumer characteristics) → message diffusion; producer characteristics → message characteristics; producer characteristics → consumer characteristics).
Role of message characteristics
The characteristics of a message, such as the content, style, structure, and emotional tone affect the receiver’s response to the message (Paletz et al., 2019) and in turn, its diffusion (Steele and Blau, 2023).
For example, previous research has indicated that the framing of a message, whether negative or positive, affects the level of persuasion it stimulates (Smith and Petty, 1996: 19). (Stieglitz and Dang-Xuan, 2013) focused on the negative and positive sentiments of Twitter messages. They found that ‘emotionally charged Twitter messages tend to be retweeted more often and more quickly compared to neutral ones’ (Stieglitz and Dang-Xuan, 2013: 217). (Berger, 2011) suggests that arousal, whether positive or negative, increases the diffusion of a message. (Y. Han, 2021) indicates that the emotional and sentimental properties of misinformation messages influence Twitter users’ decisions on retweeting and replying to the messages.
However, different emotions can lead to different persuasive effects. According to (Panagopoulos, 2010), emotions like fear, pride, and shame are highly persuasive, and deceptive messages that activate these emotions appear to stimulate extensive behavioral and affective responses (Panagopoulos, 2010). Various emotions can be characterized by different levels of arousal they activate. (Gross and Levenson, 1995) found that high-arousal emotions such as anxiety or amusement have a greater impact on sharing behavior than emotions such as sadness or contentment. The audience’s perception of the information contained in a message can also be assessed by analyzing the linguistic characteristics and emotional content of replies.
With the rising trend of misinformation circulation on social media, users’ ability to differentiate between misinformation and information has become important. Recent research endeavors in this direction have focused on the content or message of misinformation versus real news (Horne and Adali, 2017; Kim and Dennis, 2019). For example, (Kim and Dennis, 2019) reported that how misinformation content is presented impacts user engagement.
The style, structure, and linguistic cues extracted from the text are vital aspects of deception detection that may differentiate misinformation from information content. (Fuller et al., 2009) utilized automated text analysis for deception detection. The study identified eight linguistic-based cues for deception detection through feature selection, namely: third-person pronouns, content word diversity, exclusive terms, lexical diversity, modifiers, sentence quantity, verb quantity, and word quantity. After that, there have been multiple studies in various domains focusing on the linguistic patterns and content structure of deceptive messages. These linguistic features include quantity, complexity, uncertainty, subjectivity, non-immediacy, diversity, specificity, and readability. There have also been studies that analyze the content style of a message using different Natural Language Processing (NLP) techniques to varying levels of the lexicon, syntax, semantic, and discourse.
Another critical body of research on misinformation content revolves around the use of affective tone or emotions to persuade the audience by evoking their emotional responses and resonating with their opinions and feelings (Tan and Hsu, 2023). Social media websites such as Facebook flood people with emotional content, whether or not it is true, and deliberate use of affective tone in misinformation articles is widespread (Bakir and McStay, 2018). Misinformation is frequently emotionalized on social media. Given that people tend to be less inhibited online (Suler, 2015) and come with hedonistic intentions, misinformation with emotional content elicits greater reactions from users. Therefore, analysis and understanding of emotions (sentiment analysis) in fake messages are important to avoid potential manipulation of public sentiment via ‘empathically optimized automated fake news’ (Bakir and McStay, 2018).
Sentiment analysis of both the original message and reactions by the consumer has been used for explaining the attitude and emotions of communication through social media (Bond et al., 2017; Hauch et al., 2015; Humpherys et al., 2011; Li et al., 2017; Potthast et al., 2017; Siering et al., 2016; D. Zhang et al., 2016; L. Zhou et al., 2004). For example, (X. Zhang and Ghorbani, 2020) exploited a variety of verbal and non-verbal behavioral features in developing models for detecting fake online reviews. They specifically included review content features, rating features, reviewer characteristics, and brand features in building online fake review detection models. In another study, (Siering et al., 2016) assessed the role of linguistic and content-based cues in detecting fraudulent behavior on crowdfunding platforms. They derived different linguistic and content-based cues based on different theories as input for various fraud detection classifiers. Studies have also explicitly assessed the features of misinformation content specific to Twitter, such as the number of hashtags and mentions to detect phishing on Twitter (Aggarwal et al., 2012; Castillo et al., 2011). In this study, message characteristics include the analysis of the emotions and sentiment expressed in the root tweets. We also include the topics of the root tweets using structural topic modeling. In summary, studies have shown that characteristics of a message influence the response by the consumers and message diffusion (i.e., message characteristics → consumer persona; message characteristics → diffusion).
Message diffusion
Diffusion refers to the spreading of a message on social media platforms. The diffusion of a message on social media can be understood by analyzing cascades of the message that follow message origin. A cascade is a tree-like information spreading pattern with an unbroken retweet chain originating from a single source (Vosoughi et al., 2018). Topologically, it is a directed tree formed by an original tweet as the root and retweets of the root tweet as its nodes. An edge from node 1 to node 2 indicates that node 1 retweeted node 2. On Twitter, a tweet can be retweeted only once by a specific user. Therefore, there is one and only one edge between two nodes, and each node represents a unique user. Figure 1 illustrates the topology of a sample cascade.

Network graph of a sample misinformation cascade.
A message cascade can be characterized by the number of steps (i.e., hops) the message has traveled or the times it was posted. Propagation measurements for hop-based message cascades often include size (number of users involved), depth (number of retweet hops from the original tweet, where a hop is a retweet by a new user), breadth (the maximum number of users involved in the cascade at any depth), and virality (average distance between all pairs of nodes in a cascade) of the cascade. We use cascade analysis to understand the anatomy of misinformation and information diffusion.
We draw upon the theoretical perspectives described above to develop a research design aimed at examining the interrelationships among critical factors of misinformation and information diffusion as identified from a review of relevant existing literature.
Message Diffusion = f (Producer Persona, Consumer Persona, Message Chara-cteristics)
Consumer Persona = g (Producer Persona, Message Characteristics)
Message Characteristics = h (Producer Persona)
Specifically, we integrate these perspectives to understand producers’ motivations and persuasion strategies and the role of emotion in consumers’ responses to the message and message diffusion. Table 2 shows the components involved in message diffusion on social media and their dynamic based on literature.
Dynamics of critical factors involved in message diffusion on social media in literature.
In the next steps, we collect data on both misinformation and information to understand the differences in their diffusion patterns.
Research design and data analyses
To conduct the exploratory analyses, we used mixed-design methods, which provide a richer understanding of a given topic and help build credible IS theories (Galliers, 1993; Mingers, 2003) . A mixed design entails the use of ‘mixed methods (of data collection) and/or mixed data and/or mixed techniques’ (Walsh, 2015: 540). In this research, we adopted mixed design at the level of both data collection and data analysis (Sandelowski, 2000). Specifically, we collected both qualitative as well as quantitative data regarding misinformation and information on Twitter and performed multi-level (cascade-level and consumer-level) qualitative as well as quantitative analysis.
Literature over the last few years indicated that a complex portfolio of issues pertaining to the producer of misinformation, message, and the consumer play a key role in misinformation creation and diffusion on social media. A better understanding of these issues or ‘themes’ is necessary to develop a theoretical framework for misinformation diffusion. Given the complexity involved in uncovering these themes, neither qualitative nor quantitative data were considered sufficient in themselves (Sandelowski, 2000). Combining both qualitative and quantitative treatments allowed us to extract richer insights from the data (Walsh, 2015). Details of the data collection and data treatments are discussed in the next section.
Data selection and collection
To further explore the potential factors influencing the diffusion of authentic information and misinformation, we collected an original data set on COVID-19 pandemic from the Twitter platform (Table 3). Our data collection process is illustrated in Figure 2, which consists of three main steps.
Sample root tweets.
Some components of root tweet samples are replaced with indicators in the form of <. . .> in order to exclude personal-identifiable information in tweets.

Data collection and selection.
Step 1: Collecting fact-checking articles (manual)
We did not judge the authenticity of root tweets (i.e., determining if they are authentic information or misinformation) by ourselves. Instead, we followed prior misinformation studies (e.g., Shu et al., 2017; W. Y. Wang, 2017) to employ fact-checking websites’ judgment, which is made by professional fact-checkers of news stories. Following prior studies, we manually located articles published on at least one of the four popular fact-checking websites, including snopes.com, politifact.com, factcheck.org, and truthorfriction.com, which fact-checked COVID-related news stories. Each of these websites had their COVID-related fact-checking articles arranged into a special class or labeled with specific tags. In particular, politfact.com and factcheck.org arranged all COVID-related articles into a category named ‘coronavirus’. snopes.com labels COVID-related articles with the tag ‘COVID-19’. truthorfiction.com labels COVID-related articles with multiple tags which carry the term ‘covid’ or ‘coronavirus’, such as #covid-19, #covid-19-memes, #coronavirus, and #coronavirus-memes. We first included all fact-checking articles in these COVID-related categories or having such COVID-related tags. In addition, we also reviewed these websites frequently and included other articles that carried COVID-related keywords and any of their variants in title or main body. The keywords we used include ‘covid’, ‘coronavirus’, ‘mask’, ‘vaccine’, ‘stay-at-home’, ‘lockdown’, ‘test’, ‘cases’, ‘death rate’, ‘cure’, ‘symptom’, ‘CDC’, and ‘compensation’. Finally, we carefully reviewed all the articles included by far and only kept those which were actually discussing COVID-related news stories and performing fact-checking.
Step 2: Determining veracity of news stories (manual)
Each fact-checking article typically gives the news story it discusses a single-phrase rating on its authenticity. To simplify data analysis, we classified the news stories discussed in the fact-checking articles selected in step 1 into two veracities, authentic information and misinformation, based on the ratings they received from the fact-checking articles. Following this setting, we classified news stories receiving the following ratings as ‘authentic information’: from www.politifact.com – ‘true’, ‘mostly true’; from www.snopes.com – ‘true’, ‘correct attribution’, ‘mostly true’; from www.truthorfiction.com – ‘true’, ‘mostly true’; from www.factcheck.org – this website typically did not discuss authentic news stories. Meanwhile, we classified news stories receiving the following ratings as ‘misinformation’: from www.politifact.com – ‘pants on fire’, ‘false’, ‘mostly false’, and ‘half true’; from www.snopes.com – ‘false’, ‘mostly false’, ‘mixture’, ‘miscaptionated’ (image appearing in news story received misleading caption), ‘misattributed’ (actual statement attributed to wrong source), ‘outdated’ (outdated information included in order to mislead audience), ‘scam’; from www.truthorfiction.com – ‘not true’, ‘decontextualized’ (information included misleadingly without considering its original context), ‘misattributed’, and ‘mixed’; from www.factcheck.org – ‘false’, ‘false stories’, and ‘false claim’.
Step 3: Collecting root tweets and related content (manual and automatic)
We manually located tweets on Twitter which presented (in other words, spread) the news stories discussed in the fact-checking articles we selected in the previous steps. Occasionally, some fact-checking articles were fact checking news stories discovered on Twitter. In this case, the fact-checking articles would offer links to the tweets that spread the news stories. We simply included the provided tweets. Otherwise, we had to find out the tweets which presented the fact-checked news stories. In this case, we picked keywords and key expressions from fact-checking articles and searched for them using the search function on Twitter in order to locate the relevant tweets. In either case, we reviewed the collected tweets carefully and made sure that they were actually presenting the fact-checked new stories. In case there were multiple, identical tweets presenting the same news story, we only kept the tweet posted earliest for that news story as that tweet was more likely to be the initial spreader of the news story on Twitter.
Finally, the remaining tweets were considered root tweets for this study. This process of data collection was conducted until we had collected 250 root tweets with authentic information and 250 with misinformation, posted between March 15, 2020 and January 25, 2021. Further, we collected all the retweets, replies as well as information of the Twitter users who made those root tweets, retweets and replies. Table new1 presents some samples of the selected root tweets. In total, we collected and conducted linguistic analysis on (1) 198,750 tweets, including all root tweets and their replies, and (2) 1.2 million Twitter user accounts, including all the producers of the root tweets, and the retweeters and the repliers of the root tweets.
Data description and variable definitions
Table 4 summarizes the data sample used in this study. The total sample consists of 250 misinformation tweets produced by 228 unique producers and 250 information tweets produced by 223 unique producers. The misinformation sample shows a higher mean number of retweets but lower mean number of replies and mean number of likes (namely, favorites) compared to the information sample.
Data description.
The following quantitative variables were extracted for each tweet producer as well as consumers: (i) follower count; (ii) followee count; (iii) listed count, which shows the number of users who added a specific user to their list of interesting accounts; (iv) favorites count, which shows the number of tweets that a specific user has marked as ‘like’; (v) statuses count, which represents the number of tweets made by a specific user; (v) account age in days; (vi) verified account, verified status is received by authentic, notable, and active accounts, and (vii) engagement, which captures how active a user has been since creating their Twitter account (Vosoughi et al., 2018).
We analyzed cascade size to measure information and misinformation diffusion through the retweet network (Vosoughi et al., 2018). Cascade size represents the total number of unique retweeters (users who post retweets) in a cascade. Cascade size measures how widely the root tweet has been spread.
Data analyses and results
Settings of cluster analysis
In this study, we employed cluster analysis to discover the major classes of root tweets based on different feature sets related to them, such as the producer and consumer persona and linguistic properties of the root tweets. Cluster analysis is one of the classic unsupervised machine learning methods, which groups a set of objects (e.g., text documents) in such a way that objects in the same group (called a cluster) are more similar to each other than to those in other groups (clusters). A clustered data set ideally presents the greatest similarity within the same cluster and the greatest dissimilarity between different clusters (Grira et al., 2004; Sinaga and Yang, 2020). In this study, we used the K-means clustering algorithm in Python scikit-learn library (version 1.1.3) (Pedregosa et al., 2011), which has been applied to solve problems in many fields, such as public health (Mullick et al., 2021), public policy (Imtyaz et al., 2020) and the movie industry (Ashari et al., 2022). We adopted the library-default settings for the main hyperparameters, such as: (1) specific algorithm: Lloyd’ implementation (Lloyd, 1982); (2) Maximum number of iterations for a single run: 300; (3) Method for initialization of cluster centroids: The k-means++ strategy (Arthur and Vassilvitskii, 2007).
One of the primary hyperparameters of the K-mean algorithm which has to be set up by the user before execution is the number of clusters to produce (denoted as num_clusters). Choice of the number of clusters can significantly impact the performance of K-means (Kodinariya et al., 2013). In this study, we followed a commonly used process (Kodinariya et al., 2013) for discovering the optimal num_clusters. First, we ran K-means multiple times to build clustering models with the corresponding data set at different values of num_clusters (from 2 to 10). The Silhouette value, a metric of goodness of clustering models, was computed for each model (F. Wang et al., 2017). Finally, the value of num_clusters that produced the highest Silhouette value would be taken as the optimal number of clusters to be used in the actual analysis. Table 5 presents the Silhouette values computed for the cluster analyses involved in this study. The boldfaced values are the maximum Silhouette values whose num_clusters values were selected to be the optimal and used in the corresponding cluster analyses. An example to help read the table is as follows: the first table row records the discovery of optimal num_clusters for the cluster analysis presented in Section 3.3.2 – Producer Persona. That cluster analysis was performed on the root tweets of authentic information to inspect the producer persona. To determine the optimal num_clusters to be used in that clustering process, we constructed K-means models with 2–10 clusters and computed their Silhouette values. Since the Silhouette value at num_clusters = 5 is the maximum (0.464), 5 was then chosen as the optimal num_clusters and adopted for the actual clustering process in that subsection.
Silhouette values at different number of clusters.
Boldfaced values represent the maximum Silhouette scores, indicating the optimal number of clusters (num_clusters) selected for use in the corresponding cluster analyses.
Quantitative analyses
The primary goal of quantitative analysis was to develop the overt and covert persona of message producers and consumers to see if there were differences across personas of misinformation producers/consumers and information producers/consumers.
Producer persona
Table 6 shows the key characteristics of misinformation and authentic information producers. The column Ptg. Verified presents the percentage of producers in each veracity group (i.e., misinformation or information) whose identity has been verified by the platform. Each of the other columns presents the mean of a different metric measured over all the producers in a veracity group. In addition, the difference between the means of the two veracity groups, the p-values of mean comparison (t-test), and the p-values of distribution comparisons (Kolmogorov-Smirnov (KS) test of the two veracity groups are provided.
Producer characteristics.
Results show that on average the producers of misinformation compared to producers of information show a significantly lesser number of followers (p-value = 0.012), tweets (p-value = 0), and listings (p-value = 0). Fewer misinformation producers are identity-verified. In addition, on average the producers of misinformation have younger accounts compared to the producers of information (p-value = 0). However, the producers of misinformation show more followees per follower (p-value = 0). Even though the mean favorites count of producers from both groups is not significantly different, the distributions of favorites count seem different (see Figure 3). Compared to producers of information, producers of misinformation are distributed lighter in the area from 0 to 50k favorites, but more heavily in the area from 50k to 150k favorites. Information producers are distributed more heavily in even higher favorites count ranges such as 500k, 600k, and above 800k.

Distribution of producers’ favorites’ count.
To further examine the characteristics of producers within each veracity group, we conducted a cluster analysis on the producers of misinformation and authentic information separately. Features in Table 7 were included in the clustering process to characterize the producers. As shown in Table 4, there are 223 and 228 unique producers of authentic information and misinformation root tweets, respectively. The number of clusters, as shown in Table 5, was six for misinformation and five for authentic information.
Clusters of producers.
Table 7 first presents the results of cluster analysis on misinformation producers running at six clusters. For simplicity, only the three largest clusters are shown in the table. The identified largest three clusters include: a larger cluster (C0) taking 109 (or 47.81%) of the producers, a smaller cluster (C1) taking 87 (38.16%) of the producers, and a cluster of a small number of outliers (C2) taking 14 (6.14%) of the producers. The largest cluster of producers (C0) hold older accounts in average and they are less engaged on the platform. By contrast, the smaller cluster of producers (C1) hold younger accounts but they are more engaged and post fewer statuses. The outliers (C2) post lots of statuses and are more popular on the platform by having more followees and favorites.
Similarly, producers of authentic information can also be divided into three clusters: a larger one (C0, 139 or 62.33% of the producers), a smaller one (C1, 49 or 21.98% of the producers), and a group of outliers (C2, 17 or 7.62% of producers). Different from the largest group of misinformation producers, C0 of authentic information can be characterized as such producers who have more followers and higher listed count than the other producer groups. Meanwhile, they have fewer followees. This suggests that a majority of the authentic information producers might be more willing/able to attract others’ attention than listening to others.
In summary, compared to the producers of authentic information, the producers of misinformation are less active on the platform and less able or willing to expose themselves in the social network. Their accounts are younger. They make fewer tweets. They are less likely to verify their identity. They have fewer followers and are less likely to be noticed by other users by being listed. However, they make lots of friends given their exposure to the social network. In addition, the clusters of misinformation producers show different characteristics compared to clusters of information producers. Thus, the findings depict that: F1: There are differences across the characteristics of information and misinformation producers.
Consumer persona – Retweeters
Table 8 shows the key characteristics of the retweeters per root tweet. The retweeters of a root tweet are characterized by a set of mean values measured from these retweeters, such as their mean number of followers, and the mean number of followees. The misinformation and information group means of each of these retweeter means are presented in Table 8.
Retweeter persona (first 48 hours).
In addition, the difference of the veracity group means and the p-values of the t-test comparing the group means, and p-values of the KS-test comparing the two veracity groups are also presented.
Results show that, on average, root tweets of misinformation compared to root tweets of information are retweeted by retweeters who have significantly younger accounts (p-value = 0), lower number of followees per follower (p-value = 0), received fewer favorites (p-value = 0.011), and fewer listings (p-value = 0). Even though there is no significant difference between means of the number of tweets posted by misinformation retweeters and information retweeters, the distributions of retweeters’ mean retweets count in different veracity groups show some differences (see Figure 4). Specifically, compared to root tweets of information, root tweets of misinformation are more attractive to retweeters with a very low statuses count (<50K) and less attractive to retweeters with statuses count between 50K and 150K.

Distribution of retweeters’ mean tweets count.
Further, a cluster analysis was conducted on the root tweets of misinformation and authentic information separately. Features characterizing these root tweets’ retweeters were included in the clustering process (as shown in Table 9). Since all the root tweets have at least one retweeter, all the root tweets were included in the cluster process (N = 250 for each veracity group). The number of clusters, as shown in Table 9, was two for both veracity groups. Features about retweeter characteristics shown in Table 9 were included in the cluster processes.
Clusters of root tweets based on retweeter persona.
Root tweets of misinformation can be divided into two substantial clusters based on the retweeter persona: a larger cluster (C0) with 163 (65.2%) of the root tweets and a smaller cluster (C1) with 87 (34.8%) root tweets. Root tweets in the larger cluster (C0) can attract retweeters who are less active in expressing themselves and less popular on the platform: on average, these retweeters post fewer statuses, being listed less frequently, having fewer followees and followers, and being less engaged on the platform. By contrast, root tweets in the smaller cluster (C1) are attractive to a different type of retweeters, who are more socialized on the platform and more active in content generation by having more followers, followees, and the number of statuses. Different from misinformation, almost all the authentic information root tweets (in C0, with 237 or 94.8% of the root tweets) attract less-active, less socialized retweeters. Only a few outliers (in C1, with only 13 or 5.2% of the root tweets) are attractive to more active, more engaged, and socialized retweeters.
In summary, retweeters of misinformation present similarities to the producers of misinformation. Root messages of misinformation can attract such consumers to retweet. Compared to retweeters of information, these retweeters are less active on the platform and less able or willing to expose themselves in the social network. Their accounts are younger. They make fewer tweets and their tweets are less popular. They have fewer followers and are less likely to be noticed by other users by being listed. In addition, we can see differences between clusters of tweets generated based on retweeter personas of information and misinformation groups. Thus: F2: There are differences across the characteristics of information and misinformation retweeters.
Consumer persona – Repliers
Table 10 shows the key characteristics of the repliers (i.e., users posted replies to the root tweets).
Replier persona (first 48 hours).
Using similar measures as in comparison of information and misinformation retweeters, we calculate the group means of each measure for a sample of repliers. In addition, the table also shows the difference of the group means and the p-values of the t-test comparing the group means. Finally, p-values of the KS-test are also presented. In general, repliers of the two veracity groups are not that different from the retweeters across the two groups. Results show that on average, root tweets of misinformation compared to root tweets of information are replied to by retweeters who have a significantly lower number of followees per follower (p-value = 0).
Further, even though there is no significant difference detected in the group means of the number of replies per root tweets, number of followees, number of listings, and engagement, the distribution of these metrics show significant differences between different veracity groups, as supported by the low p-values of KS-test (p-values of KS-test for mean num. replies: 0, for mean num. followees: 0.03, for mean num. listings: 0.044, and for mean engagement: 0.029). Moreover, some repliers’ metrics are observed to be distributed differently across veracity groups even though not supported by KS-test. For example, Figure 5 indicates that repliers’ mean favorites count of misinformation root tweets compared to information root tweets is distributed more heavily in the low favorite count area (<10,000). Figure 6 suggests that the mean account age of repliers is distributed more heavily in either a low age (<1500 days) or high age area (>2500 days) and less distributed in between.

Distribution of repliers’ mean favorites count.

Distribution of repliers’ mean count age in days.
Further, we conducted a cluster analysis on the root tweets examining their replier persona. There are 219 root tweets of authentic information and 130 root tweets of misinformation which have at least one replier (in other words, at least one reply). Thus, the cluster analysis only included these root tweets. As shown in Table 5, clustering was conducted on each of these sets of root tweets with four clusters for authentic information and three target clusters for misinformation. Features about the repliers shown in Table 11 were included in the clustering processes.
Clusters of root tweets based on replier persona.
Table 11 presents the three largest clusters of each veracity group. The root tweets of misinformation can be divided into two equally sized clusters, each taking half of the root tweets, including C0 (65 or 50% of the root tweets) and C1 (64 or 49.23% of the root tweets). C2 with only one root tweet is ignored. Toot tweets in C0 and C1 can attract distinct types of repliers. Specifically, root tweets from C0 tend to attract repliers who on average have older accounts and are less popular and less active in generating content on the platform as they have a lower mean number of favorites, statuses, and lower mean user engagement. On the contrary, root tweets from C1 are attractive to repliers who show opposite properties in the above metrics.
Root tweets of authentic information present three significant clusters: a larger cluster (C0) with 142 root tweets (64.8%), a smaller cluster (C1) with 54 root tweets (24.7%), and a small number of outliers (C2) with 21 root tweets (9.6%). The largest cluster (C0) of root tweets can attract repliers who are relatively more socialized on the platform as represented by their high average numbers of followers and followees. In comparison, the smaller cluster (C1) is attractive to the less socialized, less popular, and inactive repliers who, on average, present the lowest numbers of followers, followees, favorites, times of being listed, and statuses. Compared to root tweets in the two major clusters above, the outliers of root tweets (C2) tend to attract repliers who are most popular and active in expressing themselves on the platform, suggested by their high values of mean followers, followees, favorites and statuses.
In summary, the above comparisons of misinformation and information repliers with respect to means across multiple measures, distribution of means, and replier clusters show evidence for differences between these two groups. Thus, our findings show that: F3: There are differences across the characteristics of information and misinformation repliers.
Message diffusion
Table 12 shows the key metrics of the cascade size (i.e., number of retweets) of root tweets in each veracity group.
Cascade size (first 48 hours).
As shown in the table, even though there is no significant difference between the mean cascade size of misinformation and information, the cascade size of misinformation is lower than the cascade of information in 25- and 75-percentiles while higher for the top 7.5% cascades. In addition, as per the KS test, the distribution of misinformation and information are significantly different (p-value = 0). Thus: F4: Misinformation and information propagate differently (faster or slower) depending on cascade size for the first 48 hours
Qualitative analyses
Content analysis
Various qualitative analyses were conducted to understand hidden emotions and sentiments expressed in both information and misinformation tweets and their respective replies. Specifically, we performed sentiment analysis of the root tweets and their replies, using the VADER (Valence Aware Dictionary and Sentiment Reasoner) sentiment analysis tool and lexicon (Hutto and Gilbert, 2014). VADER is a suitable tool for analyzing sentiment expressed in microblogs such as tweets and other social media posts, as it highlights positive, negative, and neutral sentiments. In addition, it reports compound scores. This is a unidimensional weighted composite score, which ranges between −1 (extremely negative) and +1 (extremely positive). To avoid data preprocessing, VADER is capable of understanding and handling many conventional uses of punctuations (e.g., Great!!), word-shape (e.g., All CAPS), sentiment-laden slang words, acronyms, and emoticons. 2
We also identified the emotional tone of root tweets and their replies using the emotion lexicon curated by the National Research Council of Canada (NRC). 3 NRC lexicon is considered very effective in identifying eight key emotional tones – anger, fear, sadness, disgust, surprise, anticipation, trust, and joy (Plutchik, 1980; Vosoughi et al., 2018). Before running this analysis, messages were pre-processed by converting text to lowercase, removing punctuation, URLs, and frequently occurring stopwords, to reduce inheritance noise.
Table 13 summarizes average emotional scores and average sentiment scores of root messages in each veracity group. The table also presents the mean difference between the misinformation and information group average scores along with their p values and the p values of KS tests. Results suggest that on average root messages of misinformation compared to root messages of information present lower overall positive emotion (p-value = 0.047) and higher negative sentiments (p-value = 0.061) but significantly lower sadness (p-value = 0.017).
Emotions and sentiments of root tweets.
As shown in Figures 7 and 8, compared to root messages of information, root messages of misinformation are distributed more heavily on lower negative and lower positive emotion scores and more lightly on higher negative and higher positive scores. Moreover, Figure 9 shows that root messages of misinformation are distributed more heavily on the negative sentiment area and more lightly on the neutral (=0) and positive sentiment area.

Distribution of root messages on negative emotion score.

Distribution of root messages on positive emotion score.

Distribution of root message on compound sentiment score.
Next, we conducted a cluster analysis on the root tweets based on their popularity and linguistic characteristics. Different from the cluster analyses previously, we did not consider the features of the producers, retweeters or repliers, but only focusing on the features of the root tweets themselves. Cluster analysis was performed on all root tweets in each veracity group (N = 250) over the features in Table 14. As shown in Table 5, number of clusters was set to five for authentic information and six for misinformation.
Clusters of root tweets.
The largest three clusters of root tweets in each veracity group are shown in Table 14. We can see that all the top three clusters have substantial size: two larger clusters, C0 and C1, take 70 (or 28%) and 67 (or 26.8%) of the root tweets, respectively; a smaller cluster, C2, takes 46 root tweets (18.4%). Root tweets from one of the large clusters (C0) tend to express strong negativity with their high negative sentiment and low positive emotion in average. These root tweets receive high popularity with their high retweet count, reply count, and favorite count. The other large cluster (C1) of root tweets tend to express higher neutrality by holding a high neutral sentiment. Their popularity is moderate as indicated by their lower or medium retweet count, reply count and favorite count.
By contrast, root tweets of authentic information can be grouped into one larger cluster (C0, size: 81, proportion: 32.4%) and two smaller clusters (C1, size: 65, proportion: 26%; C2, size: 63, proportion: 25.2%). One of the small clusters (C2) expresses strong negativity with its distinctively high negative emotion and negative sentiment. Root tweets of this cluster suffer from low popularity with their low retweet count, reply count and favorite count on average.
In summary, root tweets of misinformation are emotionally more neutral and sentimentally more negative than root tweets of authentic information. In addition, root tweets of different veracities tend to have different clustering effects. Thus, our findings show that: F5: Misinformation and information root messages show different sentiments and emotional tones.
Table 15 shows the key characteristics of the replies per root tweet. The replies of a root tweet are characterized by a set of mean emotion and sentiment scores measured from these replies, such as their mean fear, disgust, and positive sentiment. The group means of each of these mean scores of replies are presented in the table along with the p-values of the t-test and KS-test. Results show that on average the root messages of misinformation compared to root messages of information can attract replies that express significantly less surprise (p-value = 0), less negative (p-value = 0.007), and more neutral (p-value = 0.001) sentiments. In addition, as shown in Figure 10, the distribution of the average compound sentiment scores of misinformation replies and information replies shows some differences (KS-test p-value = 0.06), with misinformation root tweets distributed more on the positive and negative areas (sentiment score < or > 0) and less on the neutral area (i.e., sentiment score close to 0).
Emotions and sentiments of replies.

Distribution of replies’ mean compound sentiment scores.
Finally, we conducted a cluster analysis of the root tweets based on the linguistic features of their replies. There are 219 root tweets in the veracity group of information and 130 root tweets in the veracity group of misinformation which have at least one reply. Accordingly, a clustering algorithm was run on each of these groups of root tweets with two clusters for authentic information and three clusters for misinformation (see Table 5). Features about the replies shown in Table 16 were included in the cluster processes.
Clusters of root tweets based on replies’ characteristics.
Table 16 presents the largest three clusters for each veracity group. Results suggest that the root tweets of misinformation can be divided into two significant clusters, C0 (size: 83, proportion: 63.85%) and C1 (size: 39, proportion: 30%), and an outlier cluster, C2 (size: 8, proportion: 6.2%). Root tweets from the larger cluster (C0) seem to attract replies expressing more negative sentiment on average. In comparison, root tweets from the smaller cluster (C1) can attract replies with more neutral sentiments. The outliers (C2) are most likely to attract replies of more positive sentiment.
Clustering results of authentic root tweets indicate only two clusters, where C0 (size: 189, proportion: 86.3%) far outweighs C1 (size: 30, proportion: 13.7%). Root tweets in the C0 tend to attract replies presenting more neutral and negative emotion and sentiment on average. On the contrary, root tweets from C1 seem to attract replies showing more positive emotion and sentiment.
Considering all the evidence above, we can say that: F6: Replies of misinformation and information root messages show different sentiment and emotional tones.
Topic modeling
Topic modeling (Wallach, 2006) was conducted to extract and understand what main topics the root tweets discussed. In topic modeling, a topic is a mixture of words from a corpus of documents which are expected to be related to a common subject (Jelodar et al., 2019). In this study particularly, we are interested in exploring the topics within each veracity group separately in order to investigate the difference between authentic information and misinformation from the topical perspective. Accordingly, we used the Structural Topic Modeling (STM) method, which discovers and compares topics from subsets of documents in a common corpus separated by a selected covariate (Roberts et al., 2013). STM has been applied in topic exploration in many fields, such as social science (Roberts et al., 2013), education (Chen et al., 2020) and business (Hu et al., 2019). In this subsection, all the results were generated using the stm package (version 1.3.6) of R (Roberts et al., 2019), which is one of the most popular programming libraries for STM. Our STM models were constructed using the stm function in the library with library-default settings for the main hyperparameters, such as: (1) maximum number of iterations: 500; (2) method of topic initialization: deterministic initialization using the spectral algorithm (Arora et al., 2014); (3) prior estimation for the content covariate coefficients: L1 prior (Villena et al., 2009). Furthermore, a common series of data cleaning tasks were performed on the root tweets prior to topic modeling, including case lowering, stemming, and removal of stopwords, numbers, punctuations, and non-alphabetic characters.
The STM algorithm we used typically relies on the user to choose the number of topics (denoted as num_topics) to be extracted. We followed a widely-used process of discovering the optimal num_topics (Vanhala et al., 2020), in which trial STM models are generated at different numbers of topics, and the num_topics of the trial model is picked at which the topic model maximizes the held-out log likelihood while minimizing the residual (Chang et al., 2009; Lemay et al., 2021). Held-out log likelihood is the logarithm of ‘probability of held-out documents given a trained model’ (Wallach, 2006). It ‘evaluates how well the information learned from a corpus applies to unseen documents’ (Chang et al., 2009; Lemay et al., 2021) Residual is a measure of the variance in the corpus not explained by the topic model (Taddy, 2012). Large residuals can indicate that the true number of topics is greater than the number of estimated topics (Lemay et al., 2021; Taddy, 2012). In our discovery process of the optimal num_topics, we constructed and evaluated trial STM models at num_topics from 4 to 30 (with num_topics = 1–3 the training algorithm did not converge and thus failed to produce a model). The evaluation result is presented in Figure 11. Based on the result, we picked num_topics = 17 as the optimal number of topics, which produced the maximum held-out log likelihood (−6.807) and a close-to-minimum residual (2.186) at a default held-out rate of 0.5.

Diagnostic values by the number of topics.
Figure 12 presents the prevalence of each topic with respect to the corpus and single veracity groups. Figure 12a shows the expected topic proportion of each topic. The five words with the highest topic assignment are displayed in the figure, where the words have been preprocessed into word stems as required by many text mining algorithms (Kannan et al., 2014). Expected topic proportion (Blei, 2012) of a topic is the proportion of the words in the corpus that belong to the topic. This metric measures the prevalence (in other words, popularity or significance) of the topic in the corpus. Topic assignment (Blei, 2012) of a word in a topic is the ratio between the number of occurrences of the word in the topic and the number of occurrences of the word in the corpus. This metric measures the prevalence (in other words, association or significance) of the word with respect to the topic. Essentially, Figure 12a presents the most prevalent topics in the corpus of root tweets with the five most significant words of each topic.

Prevalence of discovered topics. (a) expected topic proportion of discovered topics; (b) estimated mean difference in root tweets’ topic proportions between misinformation and authentic information.
Table 17 presents selected topics and a sample of representative documents for these topics selected by the programming library. The table also presents our tentative interpretation of the topics. Following the method of topic interpretation by Carina et al. (Jacobi et al., 2018), we tried to interpret each topic into a few phrases or an expression that can summarize (or at least are associated with) most of the top-five words of the topic. However, our interpretation should not be considered the only correct interpretation of these topics. This is because interpretation of machine-extracted topics is a challenging task as the result can be highly subjective, depending on interpreter’ personal understanding of the topic words and document samples (Jacobi et al., 2018; Maier et al., 2021).
Topics and sample root tweets..
+Boldfaced values represent highest-probability words of topic appearing in sample.
Some components of root tweet samples are replaced with indicators in the form of <. . .> in order to exclude personal-identifiable information in tweets.
According to Figure 12a, Topic 5 is the most popular topic (expected topic proportion = 0.109) across all the root tweets, which seems to be related to mask-wearing and business (‘buis’ is the word stem of ‘business’). Topic 15, the second most popular topic (expected topic proportion = 0.084), seems to be related to significant public/political figures during the pandemic, such as Mr. Donald Trump, President of the United States (National Archive, 2023), and Dr. Anthony Fauci, Director of U.S. National Institute of Allergy and Infectious Diseases [NIAID, n.d.]. Topic 7, the third most prevalent topic (expected topic proportion = 0.07) seems to address the event that Convent in Michigan lost 13 sisters to Covid-19 in early 2020 (Convent in Michigan Loses 13 Sisters to Covid-19, 2020; Convent Outside Detroit Lost 13 Nuns to Covid-19 with 12 Dying in One Month, 2020). Overall, subjects such as public figures (e.g., ‘trump’ in Topics 10, 6, 13, and 15 and ‘fauci’ in Topics 15 and 12) and masks (e.g., ‘wear’ in Topics 5 and 14, ‘mask’ in Topics 12 and 14) are more significant in the topics.
Figure 12b plots the estimated mean difference in each topic’s topic proportions of misinformation root tweets compared to its topic proportions of authentic information root tweets (denoted as μ diff_tp ). In other words, each data point is the mean change in a topic’s topic proportions shifting from misinformation to authentic information. The confidence intervals of the data points under 97.5% confidence level are also included in the plot as error bars. A topic with a positive (negative) data point means that the topic is more discussed by/associated with the authentic information (misinformation) root tweets. If the entire error bar of a topic lies on the positive (negative) side of the horizontal axis, it means that the algorithm is confident that the topic is more associated with/discussed by authentic information (misinformation) root tweets. The distance to point 0 indicates how strong a topic is associated with its veracity group.
Figure 12b suggests that there are four topics discussed significantly more by authentic information root tweets, ranked from the most to the least associated with auth-entic information: Topic 13 (
Focusing on the above topics of a significant veracity association, we can observe several differences between authentic information and misinformation. The authentic information root messages display more interest in subjects that stress the severity of the COVID-19 pandemic. For example, Topic 1 from the authentic information carries words like ‘test’, ‘posit’ (from ‘positive’) and ‘death’, which seem to relate positive test results with death. Topic 7 from the authentic information seems to address the event that Convent in Michigan lost 13 sisters to Covid-19 in early 2020 (Convent in Michigan Loses 13 Sisters to Covid-19, 2020; Convent Outside Detroit Lost 13 Nuns to Covid-19 with 12 Dying in One Month, 2020). By contrast, the misinformation root messages express greater interest in subjects that countermeasures against COVID problematic. For example, Topic 8 with ‘vaccin’ (‘vaccine’) and ‘die’ seem to associate vaccination with death. Topic 5 with ‘busi’ (‘business’) and ‘wear’ might associate mask policy with decline in business.
Furthermore, there are three topics which are almost equally associated with both veracity groups, including: Topic 15 (
Considering all the analysis above, we have: F7: Misinformation and information root messages emphasize different topics within the same context.
Discussion
Key findings and proposed framework
We compared the characteristics of misinformation and information and identified that misinformation is different from information concerning four key dimensions, namely, producers’ persona, message characteristics, consumers’ persona, and message diffusion.
First, as presented in finding F1 the producers of misinformation compared to the producers of information show significant differences across multiple measures such as number of followers, number of tweets, account age, and listed accounts. Second, as indicated in Findings F2 and F3 characteristics of misinformation retweeters and repliers have differences across multiple measures (e.g., number of followees, number of listings, engagement) when compared to retweeters and repliers of information. In addition, replies to misinformation and information root messages show different sentiments and emotional tones (F6). Thus, we see a clear difference between misinformation and information consumer personas.
Third, the characteristics of misinformation and information root message show differences across multiple measures such as sentiments and emotional tone (F5), and key topics discussed (F7). Finally, as mentioned in F4, our preliminary analysis shows a difference between misinformation and information propagation. Therefore, we expand our initial conceptual framework based on the literature to capture these differences. Also, the natural conclusion of our analysis would be that the relationships between these constructs would be different for misinformation and information (F8–F13). Therefore, we proposed that the misinformation diffusion can be explained by misinformation – message characteristics, consumer personas, producers’ personas, and their interrelationships. In summary:
Misinformation Message Diffusion = f (Misinformation Producer Persona, Misinformation Consumer Persona, Misinformation Message Characteristics)
Misinformation Consumer Persona = g (Misinformation Producer Persona, Misinformation Message Characteristics)
Misinformation Message Characteristics = h (Misinformation Producer Persona)
Figure 13 shows the proposed framework.

Proposed theoretical framework of misinformation diffusion on social media.
Table 18 shows the updated list of proposals.
Updated list of proposals.
Implications for theory and practice
Our work has key implications for research and practice. First, we used the system approach to develop a holistic framework for evaluating inter-related components of misinformation and information diffusion. The framework embeds specific questions for future research. The conceptual categories identified in the framework are grounded in theories and prior literature and represent the key dynamics involved in information and misinformation diffusion on social media. The propositions that we offer, and additional insights gained from the exploratory data analysis, provide an agenda for future research on misinformation diffusion. These results, although preliminary, offer compelling implications for future research on misinformation diffusion.
We adopt a multi-theory approach to understand and explain misinformation and information diffusion. It is important to understand the misinformation diffusion processes since it is a dynamic phenomenon that takes shape as information spreads from one person to another (DiFonzo and Bordia, 2007). This truly emergent nature of misinformation cannot be captured unless we examine ‘the changing communicative patterns that occur during the lifecycle of misinformation on social media’ (Shin et al., 2018: 278). Thus, we treat misinformation as a dynamic phenomenon that has a life cycle and a pattern shaped by multiple factors. We also focus on the largely neglected social and psychological aspects of misinformation in examining how online producers’ and consumers’ characteristics or digital persona impact misinformation diffusion patterns (Shin et al., 2018). ‘
Second, our adoption of a mixed-design allows us to develop rich consumer and cascade-level insights on the propagation of misinformation and its emergent nature. The analyses helped us develop a more nuanced understanding of our key theoretical constructs. For example, within the consumer level, we conducted separate analyses of retweeters as well as repliers as separate yet overlapping groups of consumers. This enabled us to understand retweeter and replier differences. The results also confirm the dynamic nature of the misinformation diffusion as evident by the changing patterns of the tweets as misinformation is spread. Our results on emotions and sentiments in misinformation diffusion support the socio-psychological dimensions of misinformation diffusion.
Finally, the insights gained from this research can be generalized to other micro-blogging platforms also, such as Tumblr, Twister, and Reddit. The microblogs generated at these websites are also a function of the producer, consumer, message, and transmission characteristics similar to Twitter. Likely, the diffusion of misinformation on these platforms would also follow similar patterns.
From a practical perspective, a better understanding of issues and motives related to misinformation origination, transmission, and diffusion makes several contributions to social media platforms, individual users, and policy makers. Table 19 summarizes the possible practical implications of the current study’s findings.
Example of practical implications of the proposed Misinformation Diffusion framework.
At the platform level, a better understanding of issues and motives of misinformation origination, transmission, and diffusion could help platforms develop effective intervention capabilities. It would enable platforms to be more proactive and transparent in identifying and reporting misinformation messages and engaging and communicating the misinformation identification reports with the consumers (Rodrigo et al., 2022). Examining message emotions helps platforms develop algorithms to differentiate misinformation messages from true news. This would improve their credibility and minimize the end-user’s exposure to the harmful effects of misinformation. At the individual level, the current study provides insights for consumers about identifying misinformation messages and producers. Misinformation could harm not only news consumers but also the retweeters’ credibility. The findings provide insight into how consumers can carefully select their inner networks (e.g., follower-following networks) by observing producer characteristics and their behaviors and reducing the likelihood of being exposed to misinformation. The original message and immediate replies to the message (e.g., emotional and polarized replies (Y. Han, 2021), replies perception about source credibility, message timeliness) play a critical role in building a perception around the original message, which could influence the spread of information or misinformation. Finally, our research presents cohesive evidence for policymakers to consider while framing legal and statutory policies (Hartley and Vu, 2020) to curb the emerging pattern of misinformation generation on social media. These findings confirm and extend the current literature.
Conclusion and future studies
Drawing upon the existing body of research and insights gained from data analysis, we propose a holistic framework to explain the difference between information and misinformation diffusion on social media. Our study has some limitations that future research may address. First, our analysis is based on only Twitter data related to COVID-19. This may raise generalizability issues. However, considering that Twitter is a popular social media platform that is ranked among the top three sites that Americans get their news on, the generalization problem may not be that severe (Matsa and Shearer, 2018). Nevertheless, further research could extend our analysis to other platforms, such as Facebook and Instagram, and other news topics.
Second, we considered only a few measures to capture producer persona, consumer persona, message characteristics, and diffusion. For example, we used only the cascade size as a measure of message diffusion. Future research can also investigate other cascade characteristics such as cascade depth, and structural virality to measure message diffusion. Also, our basic units of analysis were the words and we did not content analyze pictures and videos since our sample was tweets related to COVID-19. We used sophisticated software capable of analyzing words, punctuations, emoticons, and emotions, which was apt for our data related to COVID-19. However, future research may consider content analyzing pictures, videos, and emojis too when examining other topics.
Third, in addition to producers’ persona considered in this study, identifying news source’s intentions (financial/ political/for power/ for popularity, etc.) (J. Zhao et al., 2014), and polarization level and political bias (X. Zhou and Zafarani, 2018) are some factors in identifying unreliable sources. Malicious users intentionally produce misinformation to manipulate their audience while other users unintentionally engage with and spread misinformation on social media (Rubin, 2010). Future research can examine these characteristics to generate more insights about the role of the producer in creating, engaging with, sharing, and spreading deceptive messages on social media and in combating the spread of misinformation (Lazer et al., 2018). In this study, we assumed that the diffusion of misinformation is not driven by the platform policies and algorithms. The role of platform policies and algorithms on information vs. misinformation diffusion on social media can be considered in future studies. Further, in this study, we only focused on misinformation diffusion on a single platform. Future research would benefit from examining cross-platform diffusion of misinformation since the users frequently communicate and share information across multiple platforms such as Twitter, Facebook, and YouTube (B. Wang and Zhuang, 2018).
The current study identifies major factors in misinformation diffusion and the dynamics between them. The proposed framework is a preliminary attempt to advance the theory on misinformation diffusion and provides thrust for future empirical examination.
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
