Abstract
Social media is a rapidly expanding set of technology tools that people use to communicate, learn, interact, document, create, and participate in societies worldwide. It is also transforming how social work, among other professions, conducts qualitative research. This study outlines a field-tested method used to analyze data from Reddit, a major social media platform used by 6% of online adults in the United States. It provides a step-by-step account of a Reddit-based qualitative thematic analysis from a social work heuristic lens on the subject of poverty. To our knowledge, no such account of mining social media big data from Reddit for social work practice exists in the literature. Philosophical, ethical, and practical considerations of this method are discussed.
The digital revolution is producing vast quantities of social, psychological, and organizational data that social workers can harness to address society’s most difficult problems. (Coulton et al., 2015, n.p.) Substantial reciprocity across major social media platforms. Note: Data retrieved from “Social Media Update 2016,” by the Pew Research Center, 2016.
With so many users of social media, it is not surprising that researchers have begun to realize its potential and utility as an abundant and free source of data, especially about how people learn, experience, create, share, and communicate (Giglietto et al., 2012; Jürgens, 2012). Social media data are considered to be an aspect of “big data,” which is a term that describes data sets of enormous size, collected on and from people who are attached to technology in myriad and overlapping ways (such as through credit or debit card purchases at a store, digital health information, use of online streaming media, posts to social media websites) for the purposes of making predictions and understanding human behavior (Cukier, 2010). One of the first noted instances of using social media as a research tool followed the 2010 earthquake in Haiti, when both Smith (2010) and Oh et al. (2010) analyzed samples of posts, known as “Tweets” on the social networking site Twitter related to this disaster. Other studies of social media communication have examined how police, health departments, and nonprofit organizations disseminate important information to the public (Denef et al. 2013; Guo and Saxton, 2014; Neiger et al., 2013). We are also learning that such experiences have profound real-world consequences, as studies on youth violence in social media spaces have demonstrated (Patton et al., 2014).
Technology also affects the way that human services are designed, delivered, and researched, especially the innovative potential, utility, utilization, and limits of social media as a research and intervention tool in countries across the globe (Bredl et al., 2012; Broadhurst, 2016; Chan and Holosko, 2016; Giglietto et al., 2012; Goldkind, 2015; Johnsen et al., 2003; Marziali, 2006; Moylan et al., 2015; Patton et al., 2016; Sunderland et al., 2014). For example, in the field of child welfare and adoption in the United Kingdom, social media has been shown to be helpful to practice, but it also poses some challenges (Greenhow et al., 2017; Simpson, 2013). In this area, content derived from parents, children, and caretakers figures in child welfare court cases (Sage and Sage, 2016), and social workers revealed that clients frequently attempted to connect with them via social media (Breyette and Hill, 2015).
In the area of mental health, social media is used increasingly in mutual support and help-seeking for trauma and various mental health issues. Survivors of sexual assault found support in social media forums, specifically, and they turned to social media when they reported feelings that they had nowhere else to go to receive support (Rosetta and Webber, 2012). Brennan et al. (2016) found that some sexual assault offenders used the social media site Reddit to process their emotions after committing an assault in their study of shame, guilt, depression, and anger. In a study of N = 527 U.S. teenage males, nearly half used social media to find specific health information, and those who sought peer support online reportedly had better mental health outcomes (Best et al., 2016). Additionally, in North America, online mental health support groups are growing in popularity faster than the ability to staff them. To offset these workload realities, both researchers and practitioners are creating draft responses to online posters, using natural language generation, and have found that automated, naturalistic responses to online posts about depression and anxiety, while grammatically correct and varied, can be discerned from actual human-generated responses (Hussain et al., 2015).
From outreach and engagement to service utilization and evaluation, social media provides rich information for social work practitioners and researchers alike. For example, Chan and Holosko (2016) proposed a framework in which social workers used social media to facilitate outreach and engagement with young people in Hong Kong, particularly in terms of the client’s initial search, first encounters with potential clients, ice-breaking, and snowballing. Hutchinson and Jackson (2014) noted that social media is a useful means, outside of institutional bounds, to learn about the public’s experiences with nursing care, and they found that anonymous blog entries contradicted the rather dominant helping narrative that the field of nursing has constructed for itself. These authors asserted, Social media provides a unique source of commentary that is potentially of great value to nurses and other providers of health care as individuals are able to freely express opinions and their perception of experiences within the health system that may otherwise remain hidden and unheard. (Hutchinson and Jackson, 2014: 82)
Social media posts are of value to qualitative social work researchers who may wish to better understand social phenomena more in-depth, especially the intimate accounts of daily lives, lived experiences, personal insights, and opinions that might not be gathered through traditional qualitative methods (del Fresno García and Lopez Peláez, 2014; Patton et al., 2013). A study by del Fresno García and Lopez Peláez (2014) described how generic drugs are framed online in blogs, forums, and comment aggregators and noted that online social contexts allow the researcher access to a bountiful source of individual narratives as they interact with each other. Due to the cloak of anonymity that an online environment allows, this heuristic context may afford researchers better access to sensitive and/or stigmatizing topics than traditional qualitative methods, in topics such as cancer survivorship (Chou et al., 2011); smoking cessation (Struik and Baskerville, 2014); youth violence (Patton et al., 2014); and bullying (Calvin et al., 2015; Hong et al., 2016).
Social media is also transforming the very act of disseminating qualitative research and knowledge. Moylan et al. (2015) suggested that as technology enhances qualitative inquiry with digital tools to collect primary data, it also advances at such a rapid rate that any suggestions may be outdated by the time they appear in the current academic literature. However, although technologies like social media have the capability to quickly and efficiently disseminate findings from qualitative researchers, they may also potentially threaten traditional ways of transmitting information, such as the printed journal (Ruckdeschel and Shaw, 2013).
Rationale and statement of purpose
Today, computer-associated data are so important that the American Academy of Social Work and Social Welfare (AASWSW) identified “Harness technology for social good” as one of the most important challenges facing the fields of social work and social welfare for the 21st century (Coulton et al., 2015). Social media is ubiquitous in our contemporary culture, and the literature about how to mine big data within it is in nascent form. Indeed, qualitative social work researchers need current methodological strategies to reduce and analyze these data in more rigorous ways, to enable our field to fulfill AASWSW’s charge to harness technology for social good. The purpose of this paper is to outline and describe a method used to analyze Reddit, a major social media platform, by providing a step-by-step account of our own Reddit-based qualitative thematic analysis on the subject of poverty. To our knowledge, no such account of mining social media big data for social work practice exists in the literature. This served as the main rationale for this study.
Our journey into the Reddit Labyrinth
Reddit is a social news and content aggregation website and is considered one of the largest in the world. It is free and has 274 million unique users and eight billion monthly page views (Reddit Audience and Demographics, 2017). Reddit, the self-named “front page of the internet,” is an online bulletin board where community members submit content in the forms of links, pictures, and questions. Unlike other social media platforms such as Facebook and Twitter, which feature linear interactive discussions, Reddit’s unique interactive feature of “upvoting” and “downvoting” engagement displays discussions hierarchically, in which more popular posts essentially rise to the top of the page. This distinct and perpetual voting process is central to Reddit’s mission and design, governing not only the content submissions but the comments as well. Additionally, Reddit users can show further appreciation for another user’s particularly insightful comments by giving them symbolic “Reddit gold,” which is paid for by the giver in actual money or Bitcoin as part of an enhanced membership option and therefore adds to the revenue of the organization. A recent inquiry into factors predicting higher scoring comments on Reddit showed that they are frequently written within 90 minutes of the original post, and they most closely retained the subject matter of the original post (Weninger, 2014).
The central activity of Reddit takes place in its almost 10,000 active communities known as subreddits. These subreddits can be created by any registered user and may focus on any topic or shared interest. Each post to a subreddit contains multiple comment threads, with a parent comment replying directly to the poster and a child comment replying to the parent, thus creating an intricate nested system of ongoing comments. This tiered system provides the structure for diverse and multileveled conversations to take place surrounding each post’s topic. Figure 1 shows the anatomy of subreddit.
A screenshot of AskReddit illustrating the nested nature of the AskReddit social media platform and showing the original question and resultant discussion.
In 2014, the technology company Embed.ly (n.d.) created a way to visualize Reddit conversations, showing the complex patterns of communication between users (Virdee, n.d.). Figure 2 is a visualization of a typical Reddit discussion, in this instance, a post of a picture of Mark Twain petting a kitten in a park. Each dot represents a unique post, the size of the dot represents the number of upvotes, a shading scheme identifies similar users who post multiple times, and solid black dots are posts made by unique users.
Visualization of a Reddit discussion on “Mark Twain and kitten in NYC in 1907.” This figure illustrates a pattern of comment threads, the popularity of posts (larger circles represent more popular), and the shade of the dot indicates users who post only once (in black), and users who contribute multiple posts are assigned their own color, featured in grayscale here. Source: Virdee, n.d.
The comments used for our research came from the subreddit known as Askreddit. Askreddit has approximately 15.4 million people who are registered as users with between 113 and 154 million page views per month (AskReddit, 2017). It is impossible to know the sociodemographics of these registered users, but the most recent figures from a study by the Pew Research Center found that 6% of internet users in the United States are also users of Reddit (Duggan and Smith, 2013). Given that this study is several years old, it is possible that number is even higher, at least in absolute terms. Further, there was no significant difference in users of Reddit across income groups (Duggan and Smith, 2013). This same study also found that men aged 18 to 29 years were most likely to be the users of Reddit. However, this fact does not really offer any true indication about these specific demographics of the commenters (Duggan and Smith, 2013).
AskReddit posts, comments, and upvotes.
Data collection
Although the data itself comprised text in the form of numerous posts, there is also a quasi-observational aspect to this research. Participants in this study are akin to people being observed in a public setting, that is, a park or restaurant. Unlike these public venues where characteristics such as images and voices are readily observable and describable, we cannot provide an accurate portrayal of the writers of these posts due to the concealed nature of how they communicate, that is, through written text only. Kitchin (2002), in fact, argues that online material ought to be considered public data to researchers.
Following the lead of other researchers (cf. Fereday and Muir-Cochrane, 2006), we conducted an inductive and deductive qualitative thematic analysis of posts from AskReddit in January 2016. Our initial guiding research question was “what is the lived experience of poverty?” We found a discussion in response to an anonymous Reddit user [unrelated to the current study] who posed the question, “What do insanely poor people buy, that ordinary people know nothing about?” and saw that this was an abundant source of data to answer our guiding question. First, we created a data set by extracting all N = 21,501 comments into Microsoft Word documents. We read each comment and retained those that were only relevant to our guiding research question. This initial screening process was necessary due to the nature of Reddit discussions, namely that posts can iteratively diverge into topic realms that do not relate to the original post. For example, it was not uncommon for comment threads to start off discussing poverty-related issues, but they then switched to a discussion of favorite movies or the merits of soymilk over cow milk, etc. We independently determined when a thread began to go off topic, thereby losing relevance to our study question. At that point, we truncated the discussion, and irrelevant comments were not copied into the final data set.
The entire discussion comprised N = 21,501 comments and a visualization is shown in Figure 3. The center circle is from the original poster, and the size of any circle reflects the popularity of the comment, with the largest circles having the most upvotes. Black circles are one-time posters, and people who post more than once are assigned a unique color (shown in grayscale here). The pattern in Figure 3 is typical for AskReddit component of Reddit, as there is a tendency for more upvoting for AskReddit conversations than for general discussions on Reddit (Virdee, n.d.).
Data visualization of the AskReddit question (posted by an online user unrelated to the study), “What do insanely poor people buy, that ordinary people know nothing about?” This figure illustrates comment threads, the popularity of posts (larger circles represent more popular), and the color shade of the dot indicates the original poster, users who post only once (in black), and users who contribute multiple posts are assigned their own color, featured in grayscale here.
Documents were then organized by comment thread, with each document containing a parent comment and all of its replies [the term “comment thread” is synonymous with “discussion” and as such, will be used interchangeably]. This resulted in a raw data set of N2 = 107 documents representing a commensurate number of discussions. To capture the most popular reflections from this group of discussions, we decided to only examine the top 25 comment threads, determined by net number of upvotes. As highlighted earlier, a central feature of Reddit social media discussions is the ability of users to vote for or against an individual comment, thus expressing group sentiment. Other researchers of social media have made decisions to include only the most popular, current, or “well-liked posts” (Paulus, 2016). Twenty-five discussions containing N3 = 1495 comments were included in the final data set. The total word count in these 25 discussions was N4 = 105,767.
We imported these documents into QSR International’s NVivo 10 software for analysis and coding (QSR International, 2012). We used thematic analyses to examine data for themes, following the steps outlined by Braun and Clarke (2006), initially by using inductive coding, allowing for themes and codes to be developed from detailed readings of the data (Thomas, 2006), then by using deductive coding. The first step in analyzing data began by familiarizing ourselves with the data through thorough, independent, and multiple readings. Additionally, we utilized three thematic analyses techniques suggested by Ryan and Bernand (2003) who synthesized decades of work on thematic analyses to describe a range of representative procedures: (1) “repetition” that is combing the codes for topics whose expressions reoccur throughout the data, (2) “similarities and differences” in comparing codes for convergence or divergence, and (3) “cutting and splitting” or manually grouping codes to develop themes (see Lincoln and Guba, 1985 for further information on “cutting and splitting”). Initial codes were generated to capture aspects of the data that were deemed both interesting and germane to our research question, allowing for subsequent organization of these data (Tuckett, 2005).
The initial inductive coding approach was performed by two of the researchers for the purpose of consistency. After this analysis was completed and a codebook created, the other researchers conducted a deductive analysis using the assigned codes, allowing for increased dependability (Braun and Clarke, 2006). We continued to refine our themes, eliminating overlap, in order to express the essence of each identified theme (Braun and Clarke, 2006). The results were written, and the study itself is currently under review for publication.
It is at this point that we would like to provide a candid and simplified outline of the steps we took to mine and analyze our secondary data from the social media website AskReddit. What follows is a nondefinitive linear synopsis of our process for readers; these particular steps are not meant to be a prescriptive blueprint but rather a codifying of our process:
Suggested steps to using Reddit content as secondary data for qualitative analysis
Identify a core enabling research question as the cornerstone of your analysis [in our case, “what is the lived experience of poverty?”] Consult with, and if necessary, obtain approval from your institutional review board, independent ethics committee, ethical review board, research ethics board, or other human subjects-protections entity regarding the use of internet-based data. Decide on your epistemological approach. Decide on your methodological approach. Identify an original post that will provide data to answer your question [in our case, we found the question, “What do insanely poor people buy, that ordinary people know nothing about?” posed by a user unrelated to our study]. Decide, based on the research question identified in Step 1, how to sort posts. Some options include:
Most popular posts, called “Top Posts” in Reddit. Most controversial posts. Newest to oldest posts. Posts made during a specific time range, for example, immediately after an event or during a process, such as national voting. Decide the time frame in which you will collect the data; this may be difficult as people can continue posting long after a thread has been in play. Download comments into a word or excel document; we used the “copy and paste” approach; we also created a separate document for each comment thread and only copy and pasted until the thread went off-topic. Clean all data; assign pseudonyms for the screen name of the person posting; even though posts are written under anonymized screen names, the nature of Reddit and other social media sites means that a community often forms and people become “known” by screen name.
When treating social media posts as secondary data, a researcher can search Reddit or AskReddit threads until a topic comes up that is worthwhile and contains enough text to be viable. Researchers wishing to ask a different question using the same raw data set could do so. For example, if a researcher was interested in the dynamics between posters, then data could be analyzed as such, and would not be truncated in the same way. Or, if a study was concerned with the examination of the most “controversial” things said about the subject of inquiry then a decision would be made to pull only those with the lowest ratio of “upvotes” to “downvotes”; likewise, a study could be undertaken of the most disliked posts evidenced by highest number of “downvotes.” Finally, researchers may also choose to post questions themselves, in the quest to collect primary data.
Limitations
Projects using social media as data have several limitations, a few of which will be mentioned here. Most importantly, due to the anonymity of Reddit users, it is impossible to know and describe user demographic characteristics. Therefore, any analysis that considers race, gender, disability, sexual orientation, immigration status, national origin, or other identities/contexts and their intersectional ties cannot be achieved. This limitation was a defining feature of Holosko’s (2017) social work study using big data. There is also the legitimate concern about issues of social desirability and how they shape social media postings, for example, are responses accurate? made to promote or curtail discussion? or, untruthful? There is conflicting evidence in the literature about this phenomenon. For instance, while some research showed that there is a certain socially enforced conformity in perceptions of people posting on the internet (Weisbuch et al., 2009), other reports suggested that people show their true and unique personalities and opinions on social media, not idealized version of themselves (Back et al., 2011).
There is also the concern of data quality and integrity (Holosko, 2017). In the field of market research, Branthwaite and Patterson (2011) contended that qualitative inquiry using social media is inferior to conducting interviews and focus groups. They asserted that when conclusions are drawn from mined social media data sources for mention of brands or sentiment toward a product, in the case of marketing research, that researchers miss subtle or unspoken narratives that are present in real-time, live, dynamic, interactive conversations that may occur in focus groups. In our study, however, we were not mining for quantifiable information per se but instead were seeking a rich field of material with which to interact, analyze, and construct meaning. A number of methodological limitations may also be present in studies that qualitatively analyze social media posts, including selection bias, information and confirmation bias, confounding, and emotional contagion (Janssens and Kraft, 2012; Kramer et al., 2014). Finally, we acknowledge that this paper was written from our perspective of social work researchers in the United States, responding to American social work and social welfare research calls to attend to technology and big data (Coulton et al., 2015), which may delimit the transferability of our findings.
Discussion and implications
Typically for social work researchers, the extant literature is used to discuss one’s findings in research studies (Holosko, 2006). Given that there is no comparable data to render such a traditional discussion, the present discussion will be oriented toward candid ‘lessons learned’ from creating and using the methodology presented in this paper. This will be discussed in terms of philosophical, ethical, and practical considerations. Hopefully, taken together, these may be used by other researchers who have interest in qualitative methods and social media.
Throughout this knowledge journey, the researchers experienced considerable early angst and uncertainty, as there seemed to be no blueprint, map or framework to solve our initial data puzzle. There were thousands of pieces of information that could help illuminate collective opinions about our research subject, yet how do we actually verify, obtain, and reduce it for eventual analysis? The countless possible and plausible ways to work with these rather daunting data occasionally pulled us away from our original question, which was about the lived experiences of poverty. To keep us grounded along the way in the ocean of data, we simply held true to a basic tenant of social science – what will answer our anchoring research question? This necessitated an explicit and restated recognition of our ontological and epistemological orientation. Just like any trip into the internet for information is faced with myriad of interruptions and amusements a distracting experience, [read this ad! look at this meme! answer this email! read this post! look at this picture!] so did our project that uses big data.
Our study was deemed exempt from human subjects protocols by the University of Georgia Institutional Review Board, but as social media increasingly becomes a source of data, researchers should be aware of potential ethical implications. For example, to what extent is one’s personal privacy violated in analyses of online posts? Because social media users voluntarily register for social media sites, and they have the option to make profile and comment settings “public” or “private,” it would seem that anyone who posts publicly would be aware that they are open to observation, as would any person in a public setting (e.g., at a sporting event or playground). That said, even when social media users apply public settings to their profiles, they may not be fully aware that researchers could analyze and study their posts (Swirsky et al., 2014). In fact, at this writing, institutional review boards may have varying systematic protocols to address research studies that propose using social media as bona fide sources of data. There are, however, only nascent guidelines currently proposed by research ethicists that can guide this process (see Solberg, 2015).
In the early 2000s, qualitative researchers began to realize the utility of internet-based data for research and the ethical questions it poses (Kitchin, 2002; Waruszynski, 2002). Kitchin argued that ethical standards of traditional research do not apply to what is found in “cyberspace.” Following Mann and Sutton (1998 in Kitchin, 2002), she argued that online users are aware that their text will be read by other people whose identities are not known, that some online users believe that the internet is monitored by the police or other officials, and finally that because online users post under a “username” and not their real name, their identities are hidden. Therefore, online material should be exempted from the Canadian institutional review board approval process for three reasons: online research poses minimal risk, online text is in the public domain, and a person who posts online in such conditions is not necessarily a “human subject” but rather a “cyber person” (Kitchin, 2002: 170–171). On the contrary, in an essay on the implications of using the internet to conduct qualitative research, Waruszynski (2002) argues that individual researchers should make their presence known and ask permission of those they study. We would contend that this proves nearly impossible in the asynchronous setting of archived internet posts. Still, she asks “What are the responsibilities of researchers who are conducting research on ethics in cyberspace?” (Waruszynski, 2002: 156) and answers this question by citing earlier researchers (Schrum, 1995; Sharf, 1999). Schrum (1995) proposed a set of 11 guidelines for online researchers, which included the responsibility of the researcher to share results back with the “electronic community.” Sharf (1995) suggested that qualitative researchers ask ourselves whether the research will harm or benefit the group and plan research accordingly.
To date, academic studies that have used Reddit have focused primarily on technical issues regarding the so-called “hows” and “whys” of information flow on the actual site (Haralabopoulos et al., 2015). To our knowledge, this is the first study to propose a systematic method to qualitatively analyze social phenomenon using content from users on Reddit. Although a considerable amount of research has been conducted examining the way social media influences people (Ellison et al., 2007), our study fell into a unique category of Reddit research that uses the actual content of posts as the data examined through qualitative research methodology.
The research method that we described herein to mine social media posts from Reddit also has practical considerations. As is known, anything related to technology evolves at a rapid pace, and the practical suggestions we presented to download and analyze these data may be outdated as soon we table them. For example, we suggested downloading the text into a word processing document; there are newer ways to collect text from other social media platforms, such as the qualitative analysis software NVivo’s ability to quickly extract data from select social media platforms including Facebook, Twitter, and YouTube (Bazeley and Jackson, 2013; Paulus et al., 2014; QSR International, 2017). The qualitative social work researcher faces a range of new possibilities and challenges in using qualitative analysis software (see Drisko, 2013), and we found it a useful tool for both the inductive and deductive phases of our analysis.
Contained in social media comment threads are “public, permanent (although editable), well-formed and hierarchical” (Weninger, 2014: 1) and serve as a veritable gold mine of data for social scientists. That said, Eun-Ok and Chee (2006) identified several threats to achieving methodological rigor in using online discussion boards for qualitative inquiry, specifically the difficulty of achieving the theoretical saturation needed in certain methods, as well as a host of practical constraints. In our experience with these data, we did not find this to be true. The inductive/deductive approach to thematic analysis (Fereday and Muir-Cochrane, 2006), plus the sheer number of posts, enabled us to create valuable meanings from the text, in order to add to an understanding of the lived experiences of poverty (Purser and Caplan, 2017).
In response to AASWSW’s assertive professional charge to harness technology for social good, we presented fellow social work researchers a method for answering their questions using social media posts in the forum Reddit. As one of the “Top Ten Trends Driving Science” (American Chemical Society, 2017), computer-assisted inquiry enables any science researcher to reach far beyond tradition tools of research: Research has always meant data, but never quite like this. Modern research produces experimental data not just from in vitro and in vivo studies, but also from simulation-driven in silico [emphasis added] work. The ability to interpret, compare, and contrast data sets is essential (p. 16).
Footnotes
Acknowledgements
The authors would like to thank Dr. Michael Holosko for his guidance and support in the production of this manuscript, Dr. Peter Kindle for introducing us to the idea of analyzing social media data, Dr. Treena Paulus for intellectual and collegial support, and Dr. James Drisko for his review of the paper and constructive suggestions. To all, we are truly grateful.
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
