Abstract
In this article we present an analysis of the discursive connections between Islamophobia and anti-feminism on a large Internet forum. We argue that the incipient shift from traditional media toward user-driven social media brings with it new media dynamics, relocating the (re)production of societal discourses and power structures and thus bringing about new ways in which discursive power is exercised. This clearly motivates the need to critically engage this field. Our research is based on the analysis of a corpus consisting of over 50 million posts, collected from the forum using custom web crawlers. In order to approach this vast material of unstructured text, we suggest a novel methodological synergy combining critical discourse analysis (CDA) and topic modeling – a type of statistical model for the automated categorization of large quantities of texts developed in computer science. By rendering an overview or ‘content map’ of the corpus, topic modeling provides an enriching complement to CDA, aiding discovery and adding analytical rigor.
Keywords
Introduction
In a Scandinavian context, Islamophobia and anti-feminism have in recent years been central within the public debate. Feminists and Muslims have been the main targets of net-hatred in blogs and in comments fields, and these two groups are often characterized as two sides of the same ideological coin. This tendency is also reflected in the manifesto of the Norwegian extreme-right terrorist Anders Behring Breivik (2011), claiming that contained in the concepts ‘political correctness’ and ‘multi-culture’ lie cultural Marxism, pro-Islamism, and feminism, forming the ‘ideology’ which ‘now looms over western European society like a colossus’ (p. 21).
The focus in this article is to investigate and analyze the discursive connections between Islamophobia and anti-feminism on Flashback, a large Internet forum with a reputation for right-wing views. More specifically, how are Islam and feminism discursively connected on the forum? Are they cast as belonging to the same ideology? The data for the analysis are collected from the forum using custom web crawlers and consist of over 50 million posts. The reason for focusing on an Internet forum is the incipient shift toward social media as an increasingly important source for the (re)production of discursive power in society, but also because it is a unique source for studying everyday discourses outside the scope of mass media. In this sense, this large Internet forum is argued to have a function equivalent to that of traditional newspapers when it comes to producing and spreading societal discourses.
In order to approach this vast material, we suggest a novel methodological synergy between critical discourse analysis (CDA) and topic modeling: a new type of statistical model using hierarchical probabilistic modeling developed in computer science (Blei et al., 2003). By providing an overview, or ‘content map’, of the corpus, topic modeling constitutes an enriching complement to CDA, aiding discovery and adding analytical rigor. In addition, we complement this analysis with tools from social network analysis to illustrate how different discursive fields are connected through the engaged users.
The article begins by discussing the implication of the growth of social media for studying the (re)production of discursive power, arguing that we need to focus on social media in our analysis of societal discourses to a higher extent. Following this, we introduce topic modeling and illustrate how this method can be useful to approach and inductively structure large quantities of text. Then, we discuss some implications of combining CDA with topic modeling, relating to, for example, selection of data and the use of theory. The subsequent empirical analysis focuses on six different topics and topic categories that are manifesting various discursive connections between feminism and Islam. We show that the discursive connections between these issues are perhaps more complex than first expected and are mainly expressed through topics focusing on sexual assault, veils, and discrimination. We also show how gender equality seems to be used as a discursive strategy in order to criticize Islam.
From traditional to social media: a shift in the (re)construction of discursive power?
Traditionally, discourse analysts have often studied societal discourses by focusing on mass media, seen as playing a key role in the reproduction of dominant knowledge and ideologies in society and the main channel through which the elite exercise their power (Van Dijk, 1993a, 2005). Another central reason for this focus is the lack of alternatives: it has traditionally been difficult to penetrate the realm of everyday discourses outside the scope of mass media.
While traditional media undeniably still constitutes a highly significant source of news, its influence is declining as people globally and across the age groups increasingly use Internet and social media as a substitute (Nielsen, 2012). This makes the almost exclusive focus on traditional media increasingly problematic. We here follow Kaplan and Haenlein’s (2010) broad definition of social media as ‘a group of Internet-based applications that build on the ideological and technological foundations of Web 2.0, and that allow the creation and exchange of user-generated content’ (p. 61). Thus, the term obviously includes many rather different types of media and technologies, ranging from micro-blogs and online magazines to crowdsourcing, media sharing, and Internet forums.
New social media clearly shares many traits and functions often ascribed to traditional mass media, not least by framing issues and events and thus shaping people’s perceptions of reality and of social and political issues (Moscovici and Duveen, 2000). Following Van Dijk (1993b), we might say that they involve similar forms of social cognition. Social cognition mediates between the micro and macro levels of society by constituting the link between discourse and action, thus explaining how discourses transform social practice (see also Van Leeuwen, 2009). In this sense, social cognition links dominance and discourse by affecting how individuals and groups interpret the world and act upon this interpretation. This is also the focus of CDA, aiming to critically analyze these forms of texts and talks.
But social media is also in important ways different from traditional media. Social media builds upon interaction, communication, and networked individuals collaboratively sharing their narratives by creating and managing content. In this sense, the increase in social media usage thus also marks a shift from media consumers and passive observers to content creators. Through this shift toward user-driven, participatory information exchange, there are reasons to assume that the growth of new social media may bring with it new media dynamics, one example being online hate or online radicalization – a growing phenomenon that has received much attention over the last few years (Correa and Sureka, 2013). Online hate is generally characterized as content attempting to inflame public opinion against certain groups of people, generally based on race, religion, ethnicity, gender, or sexual orientation.
This apparent shift in the (re)production of societal discourses poses an implicit challenge to CDA: if CDA is indeed – as it claims to be – interested in describing, explaining, and criticizing the ways dominant discourses influence socially shared knowledge, attitudes, and ideologies, we must naturally continuously follow the source of these discourses as they shift in society. This indicates that we need to pay attention not only to top-down relations of dominance, but also acknowledge that power and dominance are not only imposed from the elite using mass media as channels through which they exercise their discursive power. Instead, as Van Dijk (1993b) notes, we need to pay more attention to how power and dominance can be jointly produced, and the aforementioned shift toward social media as an increasingly important societal actor indicates that this tendency will most likely become even more relevant in the future.
While many would agree that more focus needs to be placed on studying online social media, interested qualitatively oriented scholars have so far often been limited by multiple methodological challenges, perhaps foremost being the sheer amount of unstructured textual data often characterizing social media. Even relatively small data sets can be difficult to approach as it is hard to delineate, select, and confine materials of millions of texts, posts, or tweets. Making the matter worse, these texts are often short, lack discursive context, and are spread in complex and highly non-linear ways, making them difficult to extract and to study using established qualitative methodological and analytical approaches.
While these issues are indeed relevant for any discipline studying social media, they are perhaps particularly salient within discourse analysis where two main and interrela-ted criticisms are often raised: the arbitrary selection of texts (e.g. Koller and Mautner, 2004) and the small number of texts (e.g. Stubbs, 1994). The first criticism refers to the risk of ‘cherry-picking’, that is, that the author ‘picks a text to prove a point’, leading to obvious problems relating to representativeness and generalizability (Baker et al., 2008a; Stubbs, 1994, 1997). The second criticism concerns the often small data sets in discourse analytical studies, implying the risk of neglecting linguistic patterns that are less frequent or only cumulatively frequent. As Stubbs (1994) has observed, patterns of language use are often ‘not directly observable, because they are realized across thousands or millions of words of running text, and because they are not categorical but probabilistic’ (p. 204). In other words, many documents often contain only bits and pieces of ideologies, arguments, and discourses – small but systematic patterns and tendencies that may not be visible to the naked eye when restricted to small-n studies (Stubbs, 1994).
In the following section, we will suggest a way of attacking social media data by combining CDA and topic modeling, thus forming a methodological synergy that can address these issues and that can be mutually beneficial to both fields. Topic modeling is similar to some methods applied within corpus linguistics (CL) and constitutes a valuable complement to discourse analysis by providing an overview or ‘content map’ of immense sets of documents, revealing small but systematic patterns and tendencies in the data. CDA can contribute by enabling a more thorough and systematic qualitative analysis that goes beyond superficial explorative findings and increases the ambition by reaching into the realm of understanding and explanation.
Corpus-assisted discourse studies
Combining CL and discourse analysis goes back quite some time, and studies have used various forms of computer-based techniques for handling large volumes of data (Mautner, 2009). This field has recently received increased attention under the name of corpus-assisted discourse studies (CADS), which is basically an umbrella term for approaches that integrate discourse analysis and techniques for corpus enquiries from CL (Cheng, 2013; Hardt-Mautner, 1995; Partington, 2006; Stubbs, 1996; Wodak and Meyer, 2009), either in the form of a methodological synthesis (Baker et al., 2008a) or as separate components combined as a way of triangulation (Baker and Levon, 2015). CL consists of various empirical methods of linguistic analysis using corpora as the primary data and starting point, with the aim of finding ‘probabilities, trends, patterns, co-occurrences of elements, features or groupings of features’ (Teubert and Krishnamurthy, 2007: 6).
While many of these methods are rather simplistic, focusing on, for example, computing frequencies and related statistical significance of certain words, others are more advanced, enabling qualitative examination of the collocational environment of certain words and description of salient semantic patterns (see e.g. Baker, 2006; Baker et al., 2008a). Although applying methods associated with CL is perhaps not yet central within mainstream CDA research, awareness of the potential of these methods seems to be growing and there have been a range of recent CDA studies using methods from CL (Baker et al., 2008b; Baker and McEnery, 2005; Hellsten et al., 2010; Nelson, 2006; Orpin, 2005).
This study takes a novel approach to the CADS perspective by, for the first time (to the best of our knowledge), combining CDA and topic modeling, a technique that was recently developed in computer science but which shares common traits with some of the methods conventionally applied within CADS. Topic modeling inductively finds a number of topics describing the text corpus – recurring clusters of co-occurring words. An important difference from most other techniques within CL is that unsupervised topic models inductively structure the data without using any pre-set keywords. This means that it is corpus-driven, that is, the analysis is driven by whatever patterns are salient in the data itself (Tognini-Bonelli, 2001). While there exists other unsupervised, inductive CL techniques (e.g. cluster analysis, keyness analysis, and word frequency lists), it is more common within CL to search for certain keywords and study them using, for example, collocation analysis (Pollach, 2012). Before we further discuss how we combine topic modeling and CDA in this particular study, we will first elaborate more on how topic modeling works.
Topic modeling
Topic modeling is basically a catchall term for a collection of methods and algorithms that uncover the hidden thematic structure in document collections by revealing recurring clusters of co-occurring words. While there are several different algorithms for performing topic modeling, the most common and also the one we use in this article is Latent Dirichlet allocation (LDA) (Blei et al., 2003).
LDA views each document as a bag-of-words. A topic is defined as a list of words with different assigned probabilities. The goals of topic modeling are that (1) the words from each document occur in as few topics as possible, while (2) each topic has as few words as possible. LDA tries to find an optimal solution for satisfying these two contradictory goals. The output is a pre-set number of topics, where each topic has a list of words with different probabilities, and each document is linked to a list of topics with different probabilities. It is of course also possible to look at this probability distribution in the other direction: the documents that are most strongly linked to a topic are the ones that it best describes.
Let us take a simplifying example – without assigning varying probabilities – to see how this works in practice. Say that we have the three following documents: (1) ‘John eats sausage’, (2) ‘Eating sausage’, and (3) ‘John eats’. We want to find two topics from these. Remember that the solution we want to find is one where each document occurs in as few topics as possible and where each topic has as few words as possible. In this simple case, no complicated algorithms are required as it is trivial to see that the solution are the topics: (1) [‘john’] and (2) [‘sausage’]. (‘Eat’ is part of all documents and will therefore be disregarded, given that we use stemming.) Topic 1 will be connected to documents 1 and 3, and topic 2 will be connected to documents 1 and 2. In more realistic cases, more sophisticated algorithms are required to approximate the optimal topics, each topic has different probabilities for each of its words, and each document has different probabilities for each of its assigned topics – but the basic principle is the same.
The algorithms that LDA uses for these calculations are based on Bayesian statistical theory (Gelman et al., 2014), where the topics and the per-document topic proportions are seen as latent variables in a hierarchical probabilistic model. The conditional distribution of those variables is approximated given an observed collection of documents. When applied to the documents in a corpus, inference produces a set of topics, and for each document an estimate of its topic proportions and to which topic each observed word is assigned. Since the underlying mathematics and algorithms applied in LDA are too technical to further elaborate in detail here and have already been covered in multiple publications, we refer the reader to Blei (2012) and Blei et al. (2003) for a more thorough description.
Generating and interpreting topics
As we have seen, the recurring clusters of co-occurring words that are generated using topic modeling are called topics. While these often resemble what is conventionally referred to as ‘themes’ and ‘topics’ within qualitative text analysis and discourse analyses (Chafe, 2001), these concepts should not be confused since the characteristics of the topics generated in topic modeling vary in relation to the data being analyzed. 1
By automatically searching for word co-occurrences within textual units and identifying semantically coherent or internally homogeneous topics, topic modeling inductively discovers a structure of the corpus, largely unaffected by the researcher’s prior conceptualizations. While this does not eliminate the role of the researcher, it is often claimed that it turns the analytical work on its head by moving the work burden from identifying patterns internal to the text data, to the interpretation and theoretical conceptualization of patterns, and their relation to their social context (Krippendorff, 2004). For instance, as Mohr and Bogdanov (2013) argue, topic modeling
shift[s] the locus of subjectivity within the methodological program – interpretation is still required, but from the perspective of the actual modeling of the data, the more subjective moment of the procedure has been shifted over to the post-modeling phase of the analysis. (p. 560)
While we partly agree with this, it is nonetheless important to be aware that using topic modeling is not a neutral and rigorous process and the resulting topics do not reflect the one and only ‘true’ content of the corpus. Subjectivity not only leaks in through the theoretical assumptions underlying the model, but is also central throughout the entire research process. Topic modeling is often sensitive to the various settings and parameters used in the model, and changing these often yields rather different results. For example, one of the most central parameters is the number of topics specified by the user. While the suitable number of topics may in some cases be calculated statistically (Grimmer and Stewart, 2013), this is difficult when topics are used to identify themes for interpretation, which is usually the case in social scientific studies. 2 In these cases, the parameters are generally evaluated using more qualitative methods, according to whether they generate meaningful and analytically useful topics (Blei and Lafferty, 2007). Thus, similar to the standard procedure when applying most CL tools, this is a matter of experimenting and evaluating different numbers of topics and settings, combined with careful close readings, in order to estimate the quality of the topics. In general, one can say that a large number of topics leads to ‘higher resolution’, in other words more detailed topics, while fewer topics aggregate these into coarser overviews.
Consequently, using and calibrating topic models is, in our view, not an objective, strictly scientific process of optimization. Rather, following DiMaggio et al. (2013: 582), we think of topics rather as lenses for viewing a corpus of documents. As they argue, ‘[f]inding the right lens is different than evaluating a statistical model based on a population sample. The point is not to estimate population parameters correctly, but to identify the lens through which one can see the data most clearly’. Thus, just as different lenses are appropriate for different purposes, the ‘correct’ resolution and level of detail depend on the purpose of the study and the type of data. In the words of the statistician Box (1979), ‘All [models] are wrong; some are useful’ (p. 202).
Topic modeling in the social sciences
Topic modeling has been increasingly applied to various problems in the digital humanities, literature studies, and by historians, but also in a few cases in political science and within sociology (see e.g. Grimmer and Stewart, 2013; Roberts et al., 2014). It has also been the subject of a well-cited special issue in Poetics (2013). Many of these studies use LDA or related Bayesian approaches to infer latent topics in news articles, scientific journals, abstracts, and blog posts, but also in poetry and fiction. While topic modeling was initially intended primarily as a way of indexing and automatically categorizing large data sets, the applications in the humanities and social sciences are generally more theoretically driven. Topics have been claimed to capture various theoretical concepts such as media framing (DiMaggio et al., 2013; Hopkins, 2013), themes or thematic categories (Mohr and Bogdanov, 2013), dramatistic scene (Mohr and Bogdanov, 2013), and discourses, priming, and the relationality of meaning (see e.g. Bail, 2014).
In this study, however, we choose not to collapse topic modeling with any theoretical concept, but rather to use it as it was initially intended: as a tool for inductive empirical categorization. It is important to keep in mind that the technique was not particularly developed to address a certain theoretical issue or with certain theoretical concepts in mind. There is always a risk of overinterpretation when we are dealing with new, powerful methods of which the theoretical implications are yet to be fully understood. This is further exacerbated in this case due to the technical sophistication of the algorithms used and the resulting risks of blackboxing the method.
Following this, the current study uses topic modeling to render a valuable overview of the content, which can help us to categorize very large sets of documents and explore discursive changes over time, revealing small but systematic patterns and tendencies in the data. We will argue that such a methodological combination between CDA and topic modeling can be mutually beneficial for both fields. Before we head over to the analysis, we will first discuss how topic modeling and CDA can be combined.
Combining CDA and topic modeling
We here follow Van Dijk’s (2008) definition of discourses as (re)contextualization of practices. In other words, material events and social practices happening in the ‘material world’ are reformulated in texts and talks. 3 Generally, this means that most CDA studies are theoretically driven, in the sense that they often start with a macro-level focus on institutional power relations, dominance, and hegemonic ideologies, then investigate how these are exercised in the recontextualization of various events and social practices (Van Dijk, 1993b). In other words, focus often first lies on broader hegemonic ideologies or a certain ideological shift, then investigation takes place into how this ideology shapes and is shaped through the reconstruction or recontextualization of specific social practices. This is also apparent in the conventional procedure in CDA to use data selection through representative sampling in order to find ‘typical text’ where these ideological shifts are assumed to occur (Fowler, 1996; Lin, 2013). However, combining CDA with CL methods such as topic modeling turns this procedure around by instead enabling the analysis to start more inductively by focusing directly on the recontextualizations of various social practices and then exploring and identifying common discourses, underlying ideologies and tensions, and modifications within them.
For example, in this particular case, we first use topic modeling to automatically and inductively structure the text to find recurrent patterns, then we analyze the resulting topics more in-depth to find out what they are about, focusing both on the terms most closely associated with each topic and the related documents. In this way, we try to identify the broader discursive fields with the help of the topics. Discursive fields are here defined as the ‘dynamic terrain in which meaning contests occur’ (Steinberg, 1999: 748) and comprise various frames and discourses varying on a continuum ranging from consensus to heated contestation (Snow, 2004, 2008, 2013; Spillman, 1995; Steinberg, 1999). The boundaries of these discursive fields are never entirely fixed or clear, but are rather in constant flux, as the dynamic fields ‘emerge and evolve in the course of discussions and debates about contested issues and events’ and encompass protagonists, antagonists, and bystanders (Snow, 2004). In this sense, these discursive fields represent ‘public battlegrounds’ where different actors compete to ascribe meaning to a certain issue – a battleground that at the same time defines the limits of discussion on a particular issue.
In the next step, we code and analyze the ‘top documents’ relating to each of the relevant topics in order to (1) identify and analyze the dominant discourse(s) within this specific discursive field and (2) focus on the common discourses permeating through all or several of the topics, thus (potentially) representing an underlying ideology. By top documents, we refer to the documents that are most central to each topic, in the sense that they most prominently exhibit the word distribution of the topic (see the ‘Topic modeling’ section earlier). This is in fact also a key benefit of topic modeling: both finding central topics in a corpus and listing the documents that are most strongly associated with these topics. Focusing the analysis on these top documents enables a more systematic approach to the data and marks an important step away from the tendency of ‘cherry-picking’. This also means that the approach suggested here is not to merge topic modeling and CDA into a single synthesis, but rather to use them as complementing steps in the analysis.
Data and procedure
The corpus for this study is extracted from Flashback, which is currently one of the largest web forums in the world. At the time of writing, there are 948,950 registered users and 49,491,019 posts, contained in different subforums and threads treating subjects ranging from computers, drugs, and family to culture, politics, and current crime. The forum is fully open to the public and is moderated by a set of privileged users. The forum is expanding with about 15,000–20,000 new posts per day and has around 2,300,000 unique visitors per week. While the forum includes discussions on many different topics, it has a reputation for leaning toward extreme-right opinions and it is often mentioned in Swedish media as a hub for online hatred and xenophobia.
Due to the size and reach of the forum, Flashback can be considered a highly relevant platform with impact and influence comparable to other media actors. Thus, with the background of the above discussion on the pending shift toward social media as an increasingly important actor, we argue that this forum serves as a channel for discursive power and has an equivalent function in producing and spreading societal discourses just as any traditional newspaper. This clearly motivates the need to critically examine how the discursive connections between Islamophobia and anti-feminism are manifested.
Using web crawlers, we downloaded and anonymized the entire forum in the time period between May 2000 and May 2013 into a local database, comprising over 50 million posts and about 968,289 users.
The analysis consists of two steps. In the first step, focus lies on analyzing the connections between anti-feminist and Islamophobic discourses. In order to do so, we created a subcorpus by selecting all posts containing at least one keyword associated with Islam 4 (‘muslim’, ‘islam’, and ‘arab’), together with a keyword associated with feminism (‘femini’). We then ran an LDA set to 20 topics on this subcorpus, in total comprising 12,796 posts. This enables us to investigate more closely how anti-feminist and Islamophobic discourses are systematically discussed in relation to each other, in other words to reveal eventual discursive connections.
In the second step, we focus on whether these issues are directly connected through engaged users, and if so, which are the topics that connect them? Here, we created a new subcorpus by selecting all posts in all political subforums over all years without using any keywords and ran an LDA set to 80 topics. In total, this subcorpus comprises 6,038,773 posts, aggregated into 576,801 documents. Based on this, we used a network analytical software called Gephi 5 to generate a discursive network, illustrating how the topics are connected through the engaged users (see Graph 2). In other words, this graph shows the interconnections between topics with shared/overlapping users as the connection. The overlap is calculated by finding how many of the N most frequent posters are in common between the topics, divided by what would be statistically expected. In this way, we can investigate whether the same users tend to be active in more than one topic.
These two steps concerning how the corpus is selected should not be confused with the following step of empirical categorization using the LDA model, which is inductive. Furthermore, using the keywords to create the specific subcorpus in the first step, there is obviously a risk of accidentally excluding a number of potentially relevant posts that either discuss the specific topic more indirectly or refer to it through relatively common derogatory terms such as ‘feminazi’, ‘sandnigger’, or ‘culture-enricher’. However, we consider this is a minor problem as most posts cite a previous submission that uses the specific term and therefore will be included in the analysis. A close reading of a selection sample indicates that our selection criteria and keywords seem to capture the intended texts.
In both steps we set the α-parameter of the LDA model to 1.0. This defined the prior Dirichlet distribution the model assumes: the lower the α-parameter, the more concentrated topic distribution the model will assume and generate, and a higher α-parameter means more uniform topic distribution across each article. We have experimented with various topic numbers and α-parameters and found the results to be robust.
For technical reasons, standard LDA generally works best for documents with a size of at least 1000 words. We therefore aggregated all posts in the subcorpus from each individual user in a specific thread within the same time period into chunks of 1000-word documents (for a similar approach, see e.g. Hong and Davison, 2010; Weng et al., 2010). Posts that are significantly longer than 1000 words are split into smaller chunks. In fact, as Zhao et al. (2011) have suggested, this approach can be regarded as an application of the author-topic model (Rosen-Zvi et al., 2010), where each document has a single author. An alternative would be to instead compile a number of subsequent posts from various users in the same thread. However, we do not see any advantages of this. First, to merge various opinions and discourses from different users in the same document would make it more difficult to read. Second, this would impede the following analysis on connections between users, which requires user-specific documents.
Furthermore, as conventional in LDA models, we excluded a number of stopwords from the analysis, that is, very common words like conjunctions and articles, using off-the-shelf lists for Swedish, Norwegian, Danish, and English, further extended with slang and abbreviations often used on the forum. Finally, while lemmatizing the corpus can often be useful to produce better topics, this proved difficult in our case due to the often informal language and the amount of, for example, slang, abbreviations, and misspellings.
In the analysis of the resulting topics, we distinguish between topics and topic categories. While the former refer to single topics (referred to as T1, T2 …), the latter are a group of topics that are interpreted to belong to a common subject area, that is, all topics discussing an arguably similar issue or event, either in the same year or over time. These heuristic topic categories are hence manually constructed through close reading of the constituting topics and their top documents, with the purpose of enabling an overview of the results and to facilitate the analysis. This means that the decision as to how wide and inclusive these topic categories are set to be depends on the level and resolution of the analysis. Thus, a wide topic category consists of a number of topics that in turn can be thematized within smaller, more specific topic categories.
To perform the LDA models, we used Big Text Tool, 6 an online-based application that includes various tools for automated text analysis and graphical illustrations. This application is free, easy-to-use, and customized particularly for social scientific studies using large corpuses.
Result and analysis
We now move to the presentation of the results and the analysis. As we can see in Graph 1, the model presents a number of separate topics focusing explicitly on either feminism or Islam – such as the Assange case (T7), net-hatred against women (T5), and extreme-right parties (T19). Since we are particularly interested in the overlapping topics, we chose to exclude these topics from the analysis. Analyzing the remaining overlapping topics enabled us to study the various discursive fields within which both feminism and Islam are jointly discussed. Focusing on the posts within these topics enabled us to more closely dissect and investigate the character of these intersections and the discursive connections between them. Thus, the main thrust of analysis will be on identifying underlying discourses connecting Islamophobia and anti-feminism, rather than any detailed analysis of each topic.

LDA model with 20 topics of all posts containing the keywords ‘muslim’, ‘islam’, or ‘arab’ and ‘*femini*’, giving a total number of 12,796 posts.
In general, much of the focus in the discussions of these topics lies on (1) the perception of women within Islam, (2) the claimed inherent tension between Islam and feminism, and (3) the alleged hypocrisy among feminists that is claimed to propagate for Islamic immigration. These issues seem to prevail as an underlying discourse penetrating most of the topics, although the topics often focus on more specific issues. We will here briefly discuss some of the fields within which the discursive connections are played out, analyzing quotes from the top documents in the topics. We use topic categories to facilitate the analysis, although we also focus on the separate topics within each category for a more nuanced and detailed analysis.
In the analysis, we identify two main topic categories:
inconsistency between feminism and Islam (T1, T3, T4, T17, and T18);
inconsistency within the political left (T10, T11, and T16).
Besides these, we also identify four separate topics:
children, traditions, family, and gender roles (T12);
race, ethnicity, and sexuality (T2);
religious private schools (T6);
Jews (T9).
First, there is a broad topic category focusing explicitly on the claimed inherent female oppression within Islam, consisting of particularly T1, T3, T4, T17, and T18 (see Graph 1). Posts in which these topics’ terms dominate word assignment thus deal with discussions on the inconsistency between feminism and Islam. They also discuss why feminists are perceived not to protest against male chauvinism and female oppression among immigrants in general and among Muslims in particular. This general tendency is clearly manifested in these posts below which are central in this topic category:
[…] you can’t even begin to see that islam and feminism are inconsistent/opposed? that the increased portion of muslims in Sweden should be a Swedish feminists worst fear? I have probably never seen feminists demonstrating against the patriarchy of muslims (where you really can speak of a patriarchy and gradual [sic] female oppression). It seems almost only important to destroy the Swedish traditions, not the others. A little unfair, I think, as a swede. Why can Ahmed have his traditions preserved, while mine are destroyed?
What is apparent here is an idea of an immanent essence of female oppression within Islam. Muslim countries are presented as strong patriarchates characterized by gender inequality, which is seen as a foundational pillar of Islam, and thus in direct opposition to feminism and gender equality. In the second quote, the user also expresses a criticism that while feminists ‘destroy’ Swedish traditions due to their claimed patriarchal tendencies, they are at the same time ignoring similar, or even worse, tendencies within Islam. This indicates that Muslims are argued to unfairly receive special treatment and enjoy privileges that so-called ethnic Swedes do not, that is, that there is a ‘reverse discrimination’ against ethnic Swedes – an argument which is recurrent in many of the posts. This statement also indicates that there is a claimed inconsistency within feminism by criticizing patriarchal structures in one context while ignoring it in another, something which we will come back to later.
Furthermore, this claimed essence of female oppression in Islam is manifested in different ways, which we can see in the various topics within this topic category. For example, T3 focuses specifically on discussions on whether immigrants are statistically overrepresented in violence against women, sexual assaults, and issues relating to ‘honor violence’. In these posts, Muslims are portrayed as violent and focus often lies on the alleged reasons for this, for example, due to genetic/biological causes or whether it is ‘inherent in their culture/religion’. T17 focuses particularly on discussions on veils and burqas and whether these should be considered as a free, individual choice or as an expression of compulsion. This topic also focuses on female rights according to the Quran. T4 focuses somewhat more explicitly on the essence of Islam: the Quran and Sharia laws. These discussions focus on various parts of the Quran that are claimed to be hostile toward women. Connections are frequently drawn between Islam, female oppression, pedophilia, and child marriage as well as to circumcision and genital mutilation.
Coming from this idea of an inherent and inevitable female oppression within Islam, the claim that many Swedish feminists nonetheless support Islam, or at least do not openly oppose it, is thus argued to illustrate a widespread hypocrisy among feminists and an inconsistency within feminism. This is a recurrent pattern in many posts in this topic category, for instance:
This is the interesting thing with swedish feminists shameless hypocrisy. It’s hard to imagine anything more antifeminist then islam, so it would be as you say logic if feminists spent their time criticizing islam and their anti-equality perspective on women, particularly since this is a growing threat against the swedish gender equality. According to all logic feminists should hate islam, but you never hear any criticism. Why? Because feminists know that islam and feminism will never fit together. They choose to cover up and pretend as if it’s raining.
As we can see, the claimed inherent contradiction between Islam and feminism is thus again raised here by explicitly emphasizing the hypocrisy among feminists to avoid criticizing Islam and Muslims. Islam is portrayed as a ‘threat’ against gender equality, and feminists are claimed to ‘cover up’ this fact. Before analyzing this more in detail, we note that we find a similar argument also in the second topic category, consisting of T10, T11, and T16 and focusing particularly on the claimed lack of consistency within the political left. Similar to feminists, left-wing activists are argued to support Islam or at least avoid criticizing it:
You feminists and hbtq in the left why are you demonstrating together with men with African or Arabic origin? The men that you stand together with and scream think that hbtq persons are sick/suffers from an illness. Most of them think that you don’t have a right to live and should be shot or stoned to death, still you stand their side by side. [T]here is a burnout is the brain of the pc-ist (‘pkitens’) when two pc-topics contradict each other resulting in a tacit status quo.
Thus, as we have seen, there is a dominating tendency in the discussions to focus on Islam in relation to gender inequality, and this is frequently raised as an argument against immigration in general and against Muslims in particular. For instance, in one of the earlier quotes, Islam is argued to constitute a threat against Swedish gender equality. This reveals an interesting ostensible contradiction. Based on a broad overview of the central discussions on the forum, discrimination against women appears generally to be considered a minor or non-existent problem. At the same time, it is raised as a substantial issue when it comes to Muslims and immigrants. This suggests that gender equality is used deliberately as a discursive strategy among certain groups in order to criticize immigration, but also to ‘reveal’ what is seen as hypocrisy and contradiction within what is referred to as the ‘politically correct establishment’ that is alleged to both embrace feminism and immigration in general and Islam and Muslims in particular. In other words, this indicates that feminism and equality in these cases are used for the sake of the attack, rather than for any deeper, actual sympathies with feminism or concerns for gender equality.
Apart from these two topic categories, there are a number of separate topics treating other issues that are discursively and explicitly connecting feminism and Islam. One of these is T12, focusing on children, traditions, family, and traditional gender roles. Posts in which this topic dominates word assignment express criticism against the intention in feminism to ‘dissolve traditional gender roles’. For instance, as some users put it,
It’s not about whether gays are nice or not. Questioning the hetero-normativity (with the related pushing for ‘faggotry’) is a method to destroy the traditional man’s role. Since the gender-pedagogue is pursuing a highly dubious operation, it appears to be better to let believing muslims take care of the children, since they at least don’t intend to break down the gender roles and sexual identity. I am talking about Sweden today. And about that the choice is – hypothetically – between muslim values aiming to conserve/protect traditional gender roles, and feminists that want to get rid of them. It’s a question about how different cultures/religions relate to gender roles. I’d choose feminism all days a week instead of the islamic oppression, women are our allies in the fight against the green pest.
Interestingly, as illustrated in these quotes, we may discern two distinct camps discussing these issues. In the first camp, we find the afore mentioned tendency where certain users claiming to advocate gender equality in Sweden see women as ‘their allies’ against the Islamic oppression and use this approach to criticize Islam. In the second camp, there is widespread criticism against radical feminism and how this is perceived to constitute a threat to ‘the family’ and traditional gender roles as well as ‘sexual identities’. Here, Islam and Muslims are seen as potential collaborators in the battle against ‘faggotry’, gender pedagogue, and feminism in general.
Furthermore, T2 focuses on racial issues and sex and discusses sexual relations between different ethnic groups and different ‘races’:
While in one moment, the swedish woman raves about feminism and tolerance in their fight against male chauvinism, in the other they jump to bed with misogynist arabs and other dominant and sexists men. Just in the same way, they intend to oppose misogyny and oppression, but at the time they consistently and loyally rave to allow Islam political elements to emerge in society.
Again, the claimed inconsistency among feminists is brought to the fore, but this time focusing on sexual relations and ethnicity. Here, this inconsistency is illustrated in that Swedish women, on the one hand, fight male chauvinism, but at the same time go to bed with ‘misogynist Arabs’. The argument that Swedish (feminist) women prefer sexual subordination and dominant men in bed is a recurrent theme in the discussions and clearly intends to both illustrate the hypocrisy of feminists and consolidate traditional gender roles as well as to illustrate some deeper characteristics of ‘women’s nature’ that feminism is claimed to try to cover up.
T6 focuses on religious private schools, with particular focus on Islamic schools and the Pentecost. Here, criticism particularly toward Islam is thus articulated in that these religious schools are perceived as constituting a threat to Swedish values and to scientific perspectives. The connection to feminism is that religious schools are compared to the implementation of gender science in the education system, and these are claimed to constitute similar political regulations of scientific perspectives.
Finally, there is also a topic on Jews (T9). A central reason that this topic appears in the posts containing both feminism and Islam is that naturally Jews are claimed by some to be the root behind both feminism and ‘multiculturalism’. As one user quite straightforwardly puts it in a central post in the topic, ‘the problem is that since the entire PC [politically correct] multicult is a jewish invention it’s hard not to bring up the jews’.
In other words, the causality expressed in the majority of the posts within this topic is that the existing political elite (the ‘PC establishment’) – which is both feminist and pro-Islam – is in fact fundamentally Jewish. Or more correctly, it is connected to cultural Marxism, which in turn is seen as a Jewish invention. In fact, some users argue that cultural Marxism is the actual reason explaining the afore mentioned internal inconsistency among feminists supporting Islam and Muslim immigration.
Topic connections through engaged users
In the analysis so far, we have identified what seems to be an underlying ideology characterizing Islam as inherently misogynistic and feminists as inconsistent since they – despite this – are argued to support Islam and Muslim immigration. This ideology is recontextualized in several different fields and areas, including sexual relations and family and gender roles.
The next step is to investigate whether the same users in the forum are active in both discussions: in other words, whether they explicitly discuss both Islam and feminism as separate issues. This would enable us to see whether these issues are also directly connected through engaged users, and if so, which are the topics that connect them? This would indicate that these recontextualizations are explicitly related to each other.
In order to do this, we used the second subcorpus consisting of all posts in all political subforums over all years without using any keywords and ran an unsupervised LDA set to generate 80 topics. In total, the corpus comprises 6,038,773 posts, aggregated into 576,801 documents. We then investigated the overlap of users between these topics in order to see whether the same users tend to be active in more than one topic.
Since the discussions on Islam, Muslims, and feminism are relatively peripheral compared to all other topics and issues debated on the forum, a general network with the most connected topics 7 would not include these topics in a meaningful way. Therefore, we created a form of ego network focusing explicitly on four topics that explicitly relate to either feminism (T9: women, men, feminism, and feminists) or to discussions that we previously have seen are central in the intersection between Islamophobia and anti-feminism (T30: children, child, parents, and abortion; T41: women, sex, girls, and rape; and T53: law, veil, prohibit, and discrimination). The results are illustrated in Graph 2. To facilitate the analysis, these central topics are larger and colored in red in the graph (i.e. the size of the nodes is not connected to the size of the topics) and the strength of the edges represents proportion of shared users. The alters in the network represent the topics that are directly connected to these focal topics. Noisy background topics are excluded.

A weighted, undirected network of the topics that are central in the intersection between feminism and Islam. The focal topics are T9 (women, men, feminism, and feminists), T30 (children, child, parents, and abortion), T41 (women, sex, girls, and rape), and T53 (law, veil, prohibit, and discrimination). The network is based on an LDA model with 80 topics. The corpus constitutes all posts in the explicitly political subforums without using keywords, in total 6,038,773 posts. The weight threshold is set to 7.
As we can see in Graph 2, T9 that explicitly focuses on feminism is strongly connected with T30 (children, child, parents, and abortion) and T41 (women, sex, girls, and rape), but there are no direct connections to Islam, Muslims, or to any other topics. However, T41, T30, and particularly T53 – topics relating to discussions and issues that we previously have seen are central in the discursive connections between Islam and feminism – are indeed strongly connected to topics on immigrants and Islam.
This indicates that users who were explicitly discussing feminism were, in general, not also inclined to discuss Islam, Muslims, or immigrants. 8 However, users discussing sexual assault and rape (T41), children, family issues, and abortion (T30), and veil/burqa and discrimination (T53) are in fact to some extent also discussing both feminism and immigrants, Muslims and Islam. This indicates that these topics serve as indirect connections, linking feminism and Islam. These results confirm the results from the analysis earlier, highlighting these specific fields as central in the recontextualizations of the Islamophobic and anti-feminist discourses. This also strengthens the previous hypothesis that there seems to be a tendency to use gender equality as a discursive strategy to criticize Islam and Muslims.
Discussion
By focusing on the topics, we have identified certain discursive fields where the recontextualizations of certain ideologies are ‘played out’: they are exercised and negotiated in these contexts. While these recontextualizations indeed concede to the reality of these events and practices, they are nonetheless formulated from a certain point of view and in line with certain interests. We have focused on a number of topics connecting anti-feminist and Islamophobic discussions, through, for example, discussions focusing on the claimed overrepresentation of Muslims in the statistics of violence against women, in the question of religious private schools, the phenomenon of burqas/veils, sexual relations between different ethnic groups, and family and gender roles.
By analyzing these topics and their associated documents, we have identified a common and pervading discourse, declaring a claimed immanent oppression of women within Islam and an alleged contradiction among feminists and the political left to be both in favor of gender equality and at the same time pro-Islam and positive toward Muslim immigration. This is also claimed to constitute some form of reverse discrimination, whereby Swedish (men) are criticized while Muslims are not. Furthermore, we have identified a tendency to use gender equality as a discursive strategy in order to criticize Islam and Muslims. In this sense, feminism seems to be used for the sake of the attack, rather than any deeper conviction. Finally, by analyzing the overlaps between the topics through engaged user, we further strengthen these results, showing that while feminism and Islam seem not to be explicitly and directly connected through users, they are indeed indirectly connected through topics focusing on sexual assault, veil and discrimination, and children and gender roles, serving as recontextualizing fields for the underlying ideologies.
Through this approach, we have also illustrated how topic modeling can serve as a powerful supplement to CDA, allowing the inductive exploration of large quantities of unstructured text data. The corpus analyzed here comprises in total about 50 million posts, with around 4.7 billion words, corresponding to about 10 million normal-spaced pages of text. Just reading this manually would take approximately 3–3.5 years of continuous reading. We have shown how topic modeling provides an alternative to this by rendering a valuable overview or map of the content, thus helping to find interesting discursive patterns that can be more closely investigated and analyzed using discourse analysis.
Such methodological synergy can be mutually beneficial and help in addressing some of the open issues in both fields by combining a valuable overview and structure, which are hard to extract with the naked eye, with sensitivity for linguistic nuances and implicit and symbolic meanings, which may not be visible for the automatic eye. This approach also harvests the strength of topic modeling by being close to what it was designed to do: exploring and categorizing large amounts of text rather than being used as a rigorous, stand-alone scientific method, as has so far often been the case in the social sciences.
In combination with CDA, topic modeling reduces the researcher bias and increases the credibility of the analysis, not least by reducing tendencies of ‘cherry-picking’ the data and guarding against over- and underinterpretations (Baker et al., 2008a; Stubbs, 1994). But it is also important not to get carried away by the bewildering surplus of accessible text and to avoid the temptation to ravenously devour data indiscriminately. This approach should not, and cannot, replace careful consideration of what kinds of texts are necessary in order to answer one’s research questions and a critical awareness of what the corpus can actually be argued to represent. Furthermore, one should not overemphasize the automatic part of automated text analysis or downplay the effort involved in performing such analysis. It requires both time and effort to learn how to use these tools, but also persistent manual labor to make them work with the actual material, since data in these cases, as Baker (2006) puts it, often need to be ‘subtly massaged’ in order to produce the desired results (p. 179). This includes, for instance, various manual calibrations in order to produce coherent and interpretable topics. It is important not to conceal such subjective choices and to be aware of how this affects interpretation and analysis. Here, CDA can contribute with a more elaborated approach to hermeneutic interpretation processes and analytical methods for grasping and interpreting meanings in the selected texts, not least by making these processes more transparent. CDA also enables going beyond what is explicitly written and focusing on what could have been written but was not, or more subtle, coded strategies and symbolic meanings in the text.
In this particular study, we supplemented the analysis with meta-data of the users to produce a network with connections between topics through user activity, enabling us to connect and relate discourses to certain users. This is also an interesting aspect of Internet and social media data since it often includes relational data, thus enabling the study of how discourses and social groups are mutually constituted through interactions. This opens the possibility for future studies to investigate questions such as the following: ‘How do certain discourses form and diffuse within a network?’ ‘What influences do certain key actors and groups have on these processes?’ By focusing on the relation among various actors, we can also investigate how certain symbolically charged concepts and meaning constellations spread throughout and between social groups and the impact of what network analysts call network topologies. This indeed marks a potentially fruitful avenue for future research.
Finally, topic modeling has been criticized multiple times, on the one hand for not being rigorous enough to be used as a quantitative tool, and on the other for not being capable of moving beyond a mere descriptive exploration of the corpus when employed on its own. As we hope to have shown in this article, topic modeling becomes truly useful first when employed together with other methods that can grant both theoretical depth and more elaborated analytical techniques for interpretation. We think that such methodological synergy may contribute to transcending the boundaries between quantitative and qualitative approaches within text analysis.
Footnotes
Declaration of conflicting interests
The author(s) declared the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
