Abstract
Aims. This article analyzes the digital childhood vaccination information network for vaccine-hesitant parents. The goal of this study was to explore the structure and influence of vaccine-hesitant content online by generating a database and network analysis of vaccine-relevant content. Method. We used Media Cloud, a searchable big-data platform of over 550 million stories from 50,000 media sources, for quantitative and qualitative study of an online media sample based on keyword selection. We generated a hyperlink network map and measured indegree centrality of the sources and vaccine sentiment for a random sample of 450 stories. Results. 28,122 publications from 4,817 sources met inclusion criteria. Clustered communities formed based on shared hyperlinks; communities tended to link within, not among, each other. The plurality of information was provaccine (46.44%, 95% confidence interval [39.86%, 53.20%]). The most influential sources were in the health community (National Institutes of Health, Centers for Disease Control and Prevention) or mainstream media (New York Times); some user-generated sources also had strong influence and were provaccine (Wikipedia). The vaccine-hesitant community rarely interacted with provaccine content and simultaneously used primary provaccine content within vaccine-hesitant narratives. Conclusion. The sentiment of the overall conversation was consistent with scientific evidence. These findings demonstrate an online environment where scientific evidence online drives vaccine information outside of the vaccine-hesitant community but is also prominently used and misused within the robust vaccine-hesitant community. Future communication efforts should take current context into account; more information may not prevent vaccine hesitancy.
Health experts recognize childhood vaccinations as one of the most important public health achievements of the 20th century (Centers for Disease Control and Prevention [CDC], 2013; Sadaf, Richards, Glanz, Salmon, & Omer, 2013). A large majority of parents (88%, according to a Pew Internet survey) believe the benefits outweigh the risks of childhood vaccinations; however, even this rate may not be enough to prevent outbreaks (Funk, Kennedy, and Hefferon, 2017). A subset of parents is refusing some or all vaccines or delaying immunization, a behavior defined as vaccine hesitancy by the World Health Organization Strategic Advisory Group of Experts (SAGE; MacDonald, 2015). The increasing numbers and combinations of vaccines combine with globalized, nonhierarchical communication to make today’s challenges in combating vaccine hesitancy particularly complex (Larson, Jarrett, Eckersberger, Smith, & Paterson, 2014). Increased efforts are under way to provide vaccine-hesitant parents with evidence-based information to support positive vaccination decisions (Larson et al., 2014; Shoup et al., 2015).
In 2012, the SAGE working group was appointed to address vaccine hesitancy. This article focuses on the SAGE working group’s “Confidence” hesitancy group, for whom trust in the evidence supporting vaccinations plays an important role (MacDonald, 2015). These parents engage in thoughtful information-seeking behaviors and rely on their own research to make vaccination decisions (Peretti-Watel, Larson, Ward, Schulz, & Verger, 2015; Reich, 2016). Thus, their media environments are of particular importance (Larson et al., 2014). In 2012, 72% to 90% of adult Internet users reported searching for health information online (Cole, Suman, & Lebo, 2013; Pew Research Center, 2015), and the Internet is a central tool for parent-driven vaccine research (Buis & Carpenter, 2009; Guidry, Carlyle, Messner, & Jin, 2015; Nan & Daily, 2015; Restivo et al., 2015; Rundblad, 2015; Venkatraman, Garg, & Kumar, 2015; Weaver, Thompson, Weaver, & Hopkins, 2009; Weiner, Fisher, Nowack, Basket, & Gellin, 2015; Wheeler & Buttenheim, 2013). Social media platforms are likewise serving as emerging sources for information (Feng & Xie, 2015; Thackeray, Crookston, & West, 2013; Wilson & Keelan, 2013).
The Internet is a network of information and misinformation about vaccines. Parents have refused or delayed vaccinations based on online information (Kata, 2012; LaVail & Kennedy, 2012; Weiner et al., 2015; Wheeler & Buttenheim, 2013; Witteman & Zikmund-Fisher, 2012). One study found that the Internet was a leading factor associated with unvaccinated children (Restivo et al., 2015); another study discovered geographic clustering of vaccine hesitancy correlated with online misinformation (Wilson & Keelan, 2013). Together, these findings suggest that vaccine-hesitant parents find much of their vaccine information online from nontraditional public health sources, that they trust the information from these nontraditional sources more than the information from traditional sources, and that this change in modes of authority is having negative impacts on vaccination rates.
The goal of this study was to better understand the architecture of the network of online publishers of vaccine content and the modes of authority used by the vaccine-hesitant community in that network. We used network analysis to identify the most influential publishers and the pattern of information spread within the vaccine-hesitant community. We were specifically interested in comparing the traditional modes of authority used by scientific academia, by mainstream media, and by crowd platforms within the network.
Previous scholarship on online networking, echo chambers, and behavior have largely focused on social media interaction (Bessi, et al., 2015; Del Vicario et al., 2016). This study, however, focuses on the architecture of hyperlinks on the open Web. While we know that social media users create “bubbles” of like-minded users that limit the kinds of information they see, here we explore whether and how those echo chambers exist outside social network platforms. What kind of information will a parent who browses the Web for vaccine information get, and from what perspective? Once misinformation is found, how hard might it be to find better information? To our knowledge, this study and its methodology are the first attempt to apply a large-scale network analysis to the coverage of vaccination on the open Web.
Method
To analyze the content of the online debate over vaccines, we used the open Media Cloud platform (Media Cloud, 2015). Media Cloud provides a searchable archive of over 550 million stories from 50,000 media sources and tools to search and analyze that archive. It has been used to perform network analysis of a variety of politically oriented controversies online (Benkler, Roberts, Farris, Solow-Niederman, & Etling, 2013; Faris, Roberts, Etling, Othman, & Benkler, 2015; Graeff, Stempeck, & Zuckerman, 2014).
Media Cloud organizes its collection into sources and stories. A source is defined as a publishing entity under the editorial control of an institution (e.g., mainstream media outlet such as The New York Times) or an individual (e.g., a personal blog). A story is a single unit of content, for example, a news story, a Wikipedia page, or an academic paper. We searched for stories that included at least one instance of a word beginning with the stem “vaccin.” The search included all stories published from June 1, 2014, to March 1, 2015, from a variety of collections of sources, focusing mostly on regional and national U.S. mainstream media and political blogs. These dates captured important public health current events related to childhood vaccines, such as the 2014/2015 measles outbreak in the United States and the resulting legislation in California, Senate Bill 277, eliminating personal belief and religious exemptions for school-aged children. This law, which went into effect on July 1, 2016, has already been cited for increased vaccination rates for students (Sun, 2017).
For each story discovered, Media Cloud downloaded the HTML and extracted the substantive text, using the python-readability library to eliminate advertisements, navigation, and other surrounding boilerplate. The Media Cloud topic spider then downloaded each link in the substantive text of each matching story and added to the topic any pages matching the “vaccin” keyword. In this spidering approach, only the links in the substantive text are captured, increasing the relevancy of the links themselves by eliminating the links in the boilerplate that are incidental to the vaccine topic. The spider repeated this iterative process 15 times, by which point it was discovering very few new stories. We determined the influence of each source by measuring its hyperlink indegree centrality (explained below). We then generated network maps based on the link structure between sources and explored the relationship of the structure of that map to the coded sentiment values.
Hyperlink Indegree Centrality: Hyperlink indegree centrality is defined as a measure of online influence determined by how connected a source is online (Jackson, 2008). It measures the number of Web pages published by other sources that are linking directly to that specific source. We generated this metric by computing the number of inlinks to a given source from stories in our set by other sources. We based this approach on the assumption that a linked site has exerted some influence over the publisher linking to it; linking behavior does not necessarily imply agreement but has been shown to indicate influence on behavior (Hargittai, Gallo, & Kane, 2007).
Sentiment: The researchers coded a random sample of 450 stories for sentiment: provaccine, antivaccine, primary science (peer-reviewed), NA (broken link or not relevant), or none (balanced or no clear sentiment). The sample included only stories that received at least one cross-media hyperlink. This mostly random sampling method was chosen to include a wide spectrum of network engaged content. The detailed codebook is available in the Harvard Dataverse data set linked at the end of this article. Stories that mentioned vaccines in any positive light without a countervailing negative opinion were coded as provaccine. The authors achieved Krippendorf’s alpha interreliability score of .82. STATA was used to statistically analyze these communities and the story sample.
Link Network Analysis: Last, we analyzed the links network of the 500 sources with the highest hyperlink indegree centrality. We assigned communities to each of those 500 sources using the Louvain community detection algorithm (North, 2004). We generated a visual graph from those sources, with the sources as the nodes and the hyperlinks between sources as unweighted, undirected edges. Using the open-source visualization software Graphviz, we then laid out the graph using the neato algorithm and colored each node by its Louvain community (Blondel, Guillaume, Lambiotte, & Lefebvre, 2008). We sized each node in the map by its hyperlink indegree centrality value.
We generated, for each of the Louvain communities, a list of the 10 sources with the most inlinks from other sources within the same community. We assigned a label to each of the Louvain communities by qualitatively examining the content of each of the sources in each of those top 10 community source lists. We cross-referenced the story sentiment statistics by the Louvain communities to determine vaccine sentiment within each community and to validate the community labels.
Results
The link network map is shown in Figure 1. Each node in the map is one of the top 500 most inlinked sources and is colored by one of four communities detected by the Louvain algorithm: a vaccine-hesitant community (green) arguing against vaccination, a health and science community (pink) providing scientific evidence supporting vaccination, a provaccine community (blue) directly confronting the arguments of the vaccine-hesitant community, and a mainstream media community (yellow). Note that these labels represent the most prevalent sentiment and valence of the most influential sources within these link-structured communities. They do not reflect every individual story, or source, within each community.

The network map of the top 500 publishers about vaccines by hyperlink degree centrality, demonstrating social clustering, who is linking to one another, and how frequently: a vaccine-hesitant community (green), a health and science community (pink), a provaccine community (blue), and a mainstream media community (yellow) that used language common to both provaccine lay audiences and science.
The separation between the vaccine-hesitant and mainstream media communities on the map indicates that sources within these communities are only rarely interacting with one another through hyperlinks. Most link interaction between communities happens between the mainstream media and the provaccine and health and science communities, between the provaccine and the health and science communities, and between the vaccine-hesitant and the provaccine and health and science communities.
The Media Cloud search and spidering process discovered nearly 50,000 stories published by 4,817 sources. Of those stories, 28,122 stories received at least one link from a story in a different media source. A single researcher review of 100 random stories from the resulting set found that 98 of those stories were relevant in the sense that they included nonmetaphorical mentions of human vaccines. The 15 most common words (minus stopwords) within the topic stories indicate a strong focus on childhood vaccination: “vaccine,” “children,” “autism,” “virus,” “infection,” “immunization,” “influenza,” “scientific,” “studies,” “flu,” “mmr,” “hospital,” “cancer,” “measles,” “child.” Nine of the 10 stories with the most inlinks were directly about childhood vaccination.
Table 1 shows sentiment scores by community, with the “none” and “not applicable” sentiments omitted. Within the coded sample of 450 stories, 46.44% were provaccine (95% confidence interval [CI; 39.86%, 53.20%]), and only 14.67% were antivaccine (95% CI [6.13%, 23.21%]). Of the 93 coded stories in the mainstream media link community, 66.67% were provaccine (95% CI [54.94%, 78.40%]). Only 15 of the 93 mainstream media stories were coded as either antivaccine or neutral (16 were coded as not applicable, overwhelmingly because the pages would not load at the time of the coding). Of all the anti-vaccine stories in the coded sample set 70.21% were within the vaccine-hesitant community (95% CI [54.61%, 85.81%]), even though that community included only 30.48% of the stories in the community detection.
The Number and Percentage of Stories Within Each Community by Story Sentiment.
Stories coded as NA or None are not included here because they have no implication in the research.
The three separate measures of network layout, community detection, and sentiment support the validity of the clustering of sources within the above communities. The network layout and community detection rely on the same network structure but use distinct algorithms to find patterns in that structure. Others have found the common use of a combination of layout and community detection algorithms to be potentially troublesome for this reason (Waltman, van Eck, & Noyons, 2010), but an advantage of this algorithmic diversity is that consistent findings of clustering by the two different algorithms (as shown in Figure 1 by the uniform coloring of each map quadrant) provide some orthogonal reinforcement for one another. Moreover, the story sentiment findings are derived separately from the two network algorithms, so the patterns within that data (the prevalence of antivaccine stories within the vaccine-hesitant community, the prevalence of primary research within the health and science community, and the prevalence of provaccine stories within the provaccine community) provide further evidence for the composition, layout, and labeling of those communities.
One important ambiguity in the results is the layout and community identification of scientific authorities. Many of sources that act as traditional scientific authorities, including the National Institutes of Health (NIH), are located near the border between the vaccine-hesitant community and the health and science community, indicating that these sites participate roughly equally in the link economies of both communities. Some of those sites are placed within the vaccine-hesitant community, indicating not that they publish antivaccine content but rather that they are more highly embedded within the link economy of the vaccine-hesitant community than the health and science community. This ambiguity is the source of the surprisingly high provaccine sentiment in the vaccine-hesitant community. All 14 stories coded as provaccine within the vaccine-hesitant community come from the following sources: Food & Drug Administration, NIH, Oxford Journal, YouTube, Health Impact News, Wiley, gpo.gov, and WebMD.
Table 2 shows the 10 most inlinked sources overall and within each of the detected communities. Of particular note are the locations of four of the most inlinked sources in the overall set, each of which represents a different mode of vaccine authority. Each of these sources is influential overall but disproportionately influential within a specific community on the map—the CDC as public health authority within the health and science community, the NIH as scientific/academic authority within the vaccine-hesitant community, Wikipedia as crowd-sourced authority within the provaccine community, and the New York Times as mainstream media authority within the mainstream media community.
The Top 10 Sources From the Network of Publishers on Vaccines Overall and by Each of the Four Social Communities, Based on Hyperlink Indegree Centrality (the 10 Sources Receiving the Most Inlinks).
The CDC is the most prominent public health institution in the United States; it has the most inlinks from all sources in the controversy and ranks third on inlinks in the health and science community. Almost all the CDC inlinks within the health and science community are to secondary articles that summarize public health information about vaccines and related diseases.
The NIH is largest public funder of public health research in the world (Viergever & Hendriks, 2016); its National Database for Biotechnology Information (NCBI) website houses a series of bibliographic databases to open-access abstracts of biomedical research and related literature. NCBI has the second most inlinks within the overall network and is the most linked source within the vaccine-hesitant community. A single researcher review showed that nearly all its inlinks within the vaccine-hesitant community are direct links to peer-reviewed scientific paper abstracts from antivaccine stories.
Wikipedia is the most prominent example of crowd-sourced knowledge production on the Internet. It has the third most inlinks overall within the controversy and is the most inlinked source within the provaccine community. A wide variety of Wikipedia articles receive links within the provaccine community. The five most linked pages are “MMR vaccine,” “Andrew Wakefield,” “herd immunity,” “Maurice Hilleman,” and “Katie Couric.” These links generally come from stories within the provaccine community that make wide-ranging arguments for vaccination and use Wikipedia to give scientific, historical, and cultural context.
The New York Times is one of a few media sources that claims the authority of the mainstream medium of record for the country. It has the fourth most inlinks from sources within the mainstream media community and the seventh most inlinks from all communities. The New York Times stories with the most inlinks from the mainstream media community all address specific controversies that gained traction in the media, including the lack of an Ebola vaccine, the measles outbreak in Disneyland, and the Lancet retraction of the widely discredited 1998 Wakefield autism paper.
Discussion
One way to view our method and results is to think of the hyperlink network graph as a map of the content available for a user to navigate as she or he follows a series of links from one Web page to another, having started from some given point in the network. In this sense, the link network that we describe in this article acts as the architecture within which users must make decisions about how to find and make sense of conflicting information about vaccination.
Our results do not point toward any magic bullets that will eliminate the community of vaccine-hesitant parents online, but we think they can help researchers and practitioners better understand the architecture within which vaccine related content is published and consumed. The biggest lesson from this architecture is that the vaccine-hesitant community is a separate but robust network that is resistant to the influence of other online, provaccine communities.
Another lesson is that simply publishing more provaccine or primary research content online is unlikely to influence the vaccine-hesitant community. Our data show that antivaccine content is already uncommon online, that the large majority of that uncommon antivaccine content is published within the vaccine-hesitant community, that antivaccine content is especially uncommon in mainstream media, that the vaccine-hesitant community shows strongly clustered linking behavior toward itself, and that the vaccine-hesitant community remains robust while commonly linking to primary research content. Together, we think these findings suggest that merely promoting the publication of more provaccine material or more primary research content is unlikely to significantly change the architecture of the vaccine controversy online, with its currently robust vaccine-hesitant component.
Similarly, our findings suggest that encouraging mainstream media to write fewer “he-said-she-said balanced” stories about vaccines is unlikely to significantly change the architecture of the network, despite research suggesting this approach (Dixon & Clarke, 2013). It may be that mainstream media has an influence on vaccine-hesitant behavior by directly influencing its readers. But our research indicates both that mainstream media sources are already overwhelmingly provaccine and that the online vaccine-hesitant community does not interact strongly with the mainstream media, at least through hyperlinks. A more provaccine message from mainstream media is unlikely to shift the fundamental architecture of the network that segregates antivaccine content within the vaccine-hesitant community and minimizes linking between the vaccine-hesitant and mainstream media communities.
Finally, our findings inform our understanding of the different modes of authority used within the four communities we identified. The hierarchical mode of authority in science is based on peer-reviewed publication of highly technical papers that must be accepted by the public based on trust in the scientific community. And, in fact, several researchers attribute the growth of the vaccine-hesitant community to the breakdown of this hierarchical mode of authority online (Kata, 2012; Witteman & Zikmund-Fisher, 2012).
However, our findings show that the hierarchical mode of academic/scientific authority is not helpful in reducing the viability of the vaccine-hesitant community online. In fact, the most effective usage of this mode of authority online is by the vaccine-hesitant community itself to enforce their vaccine-hesitant narrative. The placement of the NCBI within the vaccine-hesitant community is strong evidence that, at least within the hyperlink economy, primary research about vaccines is more likely to be used within stories arguing for vaccine hesitancy than within stories making provaccine arguments. We conjecture that the abundance of direct links to NCBI abstracts is useful within the vaccine-hesitant community precisely because academic papers are so technical and specific, and therefore opaque to readers not deeply versed in the literature that gives context to individual papers. Additionally, the full papers are often not available, making it easier to include an abstract, minus any study limitations, as a method for backing a vaccine-hesitant narrative.
Wikipedia offers a contrasting mode of crowd-based authority. On Wikipedia, authority is generated through a crowd of laypeople editing each article, and through making content both accessible to the lay reader and useful for providing high-level context to specific issues. We conjecture that the prevalence of links to Wikipedia articles within the provaccine community reflects these aspects of authority-through-the-crowd. Articles that are intelligible to the lay reader and that present their content in and for context are more transparent to the reader and harder to co-opt for a specific agenda (unlike abstract linking). None of the provaccine sources, including Wikipedia, are successful at making inroads within the vaccine-hesitant community, but the Wikipedia stories at least seem more resistant to misuse by the vaccine-hesitant community than individual abstracts on NCBI.
Due to the novelty of this study, limitations exist. The network of vaccine publishers is dynamic and constantly evolving. Further study is needed to identify consistent patterns of network behavior and their implications, beyond just our restricted study period. Although network clustering of vaccine-hesitant behavior occurs geographically (Birnbaum, Jacobs, Ralston-King, & Ernst, 2013; Lieu, Ray, Klein, Chung, & Kulldorff, 2015; Omer et al., 2008; Smith, Chu, & Barker, 2004) and our results demonstrated clustering online as well, further study is needed to determine the strength of association between online network behavior and off-line behavior and health outcomes. Our methods did not assess demographic characteristics of those linking to sources, which would be useful in future studies. Future research could include deeper analysis of individual stories published by each community of sources in our network map. Our study addresses the hyperlink architecture of the network but does not directly address audience consumption of vaccine content by looking at measures like source audience size or social media sharing. And our study uses only a small fraction of the available algorithms and techniques for analyzing network structures. Overall, our study begins to illustrate emerging challenges with online information sources and paves the way for future study to develop and test potential solutions.
Conclusions
Our study indicates that the online community of vaccine-hesitant parents is robust in the face of an overwhelming prevalence of provaccine content online, that more provaccine content in mainstream media is unlikely to be effective in shrinking the size of the vaccine-hesitant community, and that a crowd-based mode of authority that provides easily understandable, contextualized information about vaccines may be more effective at distributing provaccine messages than the current mode of relying on the authority of decontextualized and often pay-walled scientific papers. As health communication and education professionals build from the work of the SAGE, results from this study can serve as an important step in beginning to understand emerging contextual influences within the public health communication and media environment so that we can adapt our communication strategies.
Footnotes
Authors’ Note
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: All phases of this study were supported by the Pershing Square Venture Fund for Research on the Foundations of Human Behavior. Media Cloud was made possible by the generous support of the Ford Foundation, The Open Society Foundation, and the John D. and Catherine T. MacArthur Foundation.
