Abstract
Hate speech is a burning issue of today’s society that cuts across numerous strategic areas, including human rights protection, refugee protection, and the fight against racism and discrimination. The gravity of the subject is further demonstrated by António Guterres, the United Nations Secretary-General, calling it “a menace to democratic values, social stability, and peace”. One central platform for the spread of hate speech is the Internet and social media in particular. Thus, automatic detection of hateful and offensive content on these platforms is a crucial challenge that would strongly contribute to an equal and sustainable society when overcome. One significant difficulty in meeting this challenge is collecting sufficient labeled data. In our work, we examine how various resources can be leveraged to circumvent this difficulty. We carry out extensive experiments to exploit various data sources using different machine learning models, including state-of-the-art transformers. We have found that using our proposed methods, one can attain state-of-the-art performance detecting hate speech on Twitter (outperforming the winner of both the HASOC 2019 and HASOC 2020 competitions). It is observed that in general, adding more data improves the performance or does not decrease it. Even when using good language models and knowledge transfer mechanisms, the best results were attained using data from one or two additional data sets.
Keywords
Introduction
While there are some differences among countries in the legal system regarding free speech [94], freedom of expression – the freedom to express ideas and opinions – is essential in international law both for individuals and societies [44]. This fact, along with the harm caused by hate speech [34,71,87] (or in other interpretations, the harm that is hate speech [100]), has sparked and maintained a debate about hate speech [10,24,41,64], where the focus was on arguments for and against laws regulating hate speech [33,39,92]. Moreover, there is still no clear agreement on whether legal measures constitute the best response to hate speech, or it is other methods (e.g., education or counter-speech [14]), or the combination of the two [29]. Regardless of what methods one deploys to handle hate speech, the first step would be detecting it. Manual discovery of hateful and offensive content, however, will place a burden on those who are tasked with this duty [40]. Moreover, the amount of content that is generated in social media, and the online space, in general, make this unfeasible. For these reasons, there is a strong demand for methods to automatically detect hateful and offensive content.
In this paper, we share our approach for the detection and classification of hateful or offensive content (using English language social media posts) with a focus on how a small labeled corpus can be supplemented by leveraging external resources, such as other corpora, word representations [35], and pre-trained models. First, we would discuss the related literature and the limits of this study in this section. Then, we would describe in Section 2 the HASOC competitions [60,62], as well as the additional sources of data we used in our experiments to complement the limited data available in these competitions. This would be followed by the short depiction of methods applied in Section 3. Then, we would present and discuss our experimental results (Section 4), then we analyse some major take-aways (Section 5) and conclude our study, outlining possible directions for future work in Section 6.
Related literature
The Internet has fundamentally changed our lives and habits [16], from our interactions with friends to our news consumption [4,25,43] strategies. These changes have positive and negative consequences alike. On the one hand, possibilities for cooperation across continents have spiked, and access to information is easier than ever. Moreover, we can share information and our opinion at a larger speed and to a wider circle than before. On the other hand, when this opportunity is combined with our potential to hide our identity [117], these opportunities can also be (and are) used to spread offensive and hateful content. This has urged many researchers to examine the possibility of automatically detecting and classifying offensive content (for overviews of these works, we refer the reader to [59,113]), the first of these efforts dating back to 1997 [95]. Following this first model based on various rules, a considerable part of the research effort was devoted to similar systems, centering around rules [73], templates (for example, identifying patterns such as I *intensity* *user_intent* *hate target*) [70] or key expressions [37,59].
An approach more complex was pairing classical machine learning models with various feature extraction techniques. For example, Kwok and Wang, in their work, applied Naïve Bayes (NB) classifiers in conjunction with Bag-of-Words (BoW) features [54]. Others, like Grevy et al. applied the same feature extraction technique and used the results as input for the Support Vector Machine (SVM) algorithm [36]. The propensity of BoW features for producing high false positive errors, however, motivated researchers of the field to use different feature extraction methods. Warner and Hirschberg [102], for one, used a template-based strategy to extract features for an SVM classifier, and detect anti-semitic hate speech with an accuracy of up to
Another fruitful research direction in the detection of hateful and offensive content was that of the application of Deep Learning models. These models have achieved considerable success in the early two thousands [108], advancing the area of NLP [78], after already having succeeded in signal- and speech processing, as well as computer vision. An important milestone in the headway of Deep Learning in Natural Language Processing was the introduction of word representations, such as Word2Vec [68]. When combined with already established machine learning models, these representations, or embeddings showed their effectiveness in the task of offensive language detection [66,109], providing better results (Area Under the Curve – AUC) than those attained using the BoW approach, while requiring less memory and training time [22]. The introduction of word representations such as Word2Vec, FastText [13] and Glove [80] also made the application of other deep learning models easier, namely the use of Convolutional Neural Networks [8,30,79,118] (CNNs), along with the application of Recurrent Neural Networks (RNNs) – such as the Long Short Term Memory [8,20,23,101] (LSTM) unit – as well as the combination of the two [8,20,23,45,51,101]. Another crucial milestone in the use of deep learning for natural language processing in general (and hate speech detection in particular) was the publication of the transformer model [21]. The success of which is exemplified by the fact that in a specific sub-task of the 2019 OffensEval challenge [110], seven of the top ten teams were using transformer models. The above discussed deep learning models, besides providing high classification scores when used in standalone mode, can also be successfully applied in ensembles [72,75]. As a matter of fact, when considering the average performance on every sub-task in a recent challenge with more than fifty competing teams, the team attaining the best performance used the very same approach [93], combining such models as LSTMs, tranformers, and OpenAI’s GPT [82], with classical machine learning models, such as Random Forest, and SVMs.
Beside the research effort dedicated to the detection and classification of various aspects of hateful and offensive speech, another driving force behind the progress in the area was the organization of shared tasks, challenges and competitions, that provided incentives, and labeled data to researchers working on these tasks. Some examples of recent competitions in the area include the detection of insults in social commentary [46], aggression [53], as well as the detection of hateful or offensive content in different languages, such as German [105], Spanish [11], Italian [89], and English [60,61,110].
Scarcity of reliably and uniformly labeled data
The task of detecting hateful and offensive language comes with many difficult challenges, the reader can gain an overview of by for example reading the works of MacAvaney et al. [59], Fortuna et al. [26,27] or Kovács et al. [51]. In this paper, however, our main focus is how external sources can be leveraged to complement a small labeled dataset. For this reason, in this section we would discuss solely the problem of data scarcity – along with its causes – and will not discuss other issues, such as when users intentionally obfuscate their language that aims to avoid detection by automatic algorithms [74], or the context-dependent nature of hateful language [42,81,83,98] (including the differentiation between in-group and out-group language users [12]) and its effect on bias [18,50,91]. These challenges and difficulties are in more detail discussed in the papers listed above.
An important factor contributing to the scarcity of uniformly labeled data is the lack of universally accepted definition of hate speech [2,3,32,47,76,77,97,98]. As the 2018 report of the Free Word Centre puts it, “‘Hate speech’ is an emotive term which does not have a uniform definition under international human rights law” [29]. There is, however, a commonly applied definition, applied by major information technology companies (including Facebook, YouTube, and Microsoft), that is based on a code of conduct [17], which in turn is based on the 2008 framework decision of the European Union [28], which defines illegal hate speech as “as the public incitement to violence or hatred directed to groups or individuals on the basis of certain characteristics, including race, colour, religion, descent and national or ethnic origin”.1
All the above discussed problems lead to the scarcity of labeled data, and in the case of the use of multiple datasets, the incompatibility of different sets [27]. Which not only leads to difficulties to to the sometimes unreliable nature of labelling, but also due to the data-hungry nature of leading machine learning (and in particular deep learning) models [1].
In this study, our efforts were concentrated on alleviating the above discussed issues related to hate speech. We did so using two consecutive HASOC challenges (HASOC 2019 [62], and HASOC 2020 [60]) as examples. Given that the number of labeled instances in the training sets of these text classification competitions was limited, we consider them to be a great case study to examine the effect of leveraging external resources, which includes the use of pre-trained classification models, word representations, and additional data. Although the advantage of additional training data in general is already commonly known, there are conflicting cases as well [31,115]. We should note here, that this work is an extension of the work of Alonso et al. [5], and Kovács et al. [51], thus we consider our main contribution to be the extension of their experiments, and showing that their inferences remain true for the HASOC 2020 competition as well. Lastly, we consider it an important result that not only these techniques of leveraging external resources lead to state-of-the art recognition scores on the HASOC 2019 test set, the same was true for the test set of the HASOC 2020 competition as well, where on the sub-task we tackle, we increased the performance from 0.51 Macro
External data sources
Here, we introduce the various additional resources we examine in the leveraging of external data. These resources include the various datasets (including the datasets provided for the HASOC competitions), as well as word representations, and pre-trained classifiers.
Datasets
To systematically examine the effect of similarly labelled datasets, we examine three twitter datasets (with tweets primarily in English) that are all available online (including the text of the tweet, and not only the tweet ids, a criterion that lead to the exclusion of some often used data sets from consideration [103,104]). Two of these sets were collected by the same authors at different times (with a one-year gap). While the third was collected by different researchers, but was labelled based on similar principles (as can be seen below, at the detailed description of the data sets). We consider both of these differences worth to examine. As we consider leveraging labelled data from different data collections an especially interesting problem, as earlier, even for the similar data sources, a low generalization capability was reported [7] (though recently successful cross-lingual transfer learning experiments have been carried on HASOC2019 and OLID [85]). For one, the importance in the collections carried out in different years is highlighted by the fact that for example, in the training set of HASOC 2019, the name of Trump occurs 5492 times, while in the training set of HASOC 2020, it occurs only 92 times.
The token type counts in Train and Test sets of Hasoc 2019 dataset
The token type counts in Train and Test sets of Hasoc 2019 dataset
The Hasoc 2019 dataset [62] was developed for the competition to predict various classes of offensive or hateful speech. It is provided with a predefined train and test split. The total number of tweets in the training set is 5852, out of which 2261 are labeled as hate speech (HOF or OFF), and 3591 are labeled as non-hate speech (NOT). The total number of tweets in the test set is 1153, out of which 288 are labeled as HOF/OFF, and 865 are labeled as NOT. The tweets may contain #-tags, @-mentions, URLs, emojis, emoticons, and other mixed character strings. Table 1 shows the numbers each token type appeared in the Hasoc 2019 train and test sets.
Hasoc 2020
The Hasoc 2020 dataset [60] was developed for the competition to predict various classes of offensive or hateful speech. It is provided with a predefined train and test split. The total number of tweets in the training set is 3708, out of which 1856 are labeled as hate speech (HOF), and 1852 are labeled as non-hate speech (NOT). The total number of tweets in the test set is 1592, out of which 807 are labeled as HOF, and 785 are labeled as NOT. The tweets may contain #-tags, @-mentions, URLs, emojis, emoticons, and other special characters. Table 2 shows some samples tweets from Hasoc 2020 dataset with their ground truth labels. Table 3 shows the numbers each token type appeared in the Hasoc 2020 train and test sets. The mixed tokens are the tokens that are the combination of letters, digits, a hyphen, full stop short forms, and a single quote mark without any space character (e.g., C14).
Hasoc 2020 dataset samples of tweets with their ground truth labels
Hasoc 2020 dataset samples of tweets with their ground truth labels
The token type counts in Train and Test sets of Hasoc 2020 dataset
The last dataset of hateful and offensive speech we worked with was that of the OLID2
OLID was also provided with a predefined train and test split. The total number of tweets in the training set is 13240, out of which 4400 are labeled as hate speech (HOF or OFF), and 8840 are labeled as non-hate speech (NOT). The total number of tweets in the test set is 860, out of which 240 are labeled as HOF/OFF, and 620 are labeled as NOT. The tweets may contain #-tags, @-mentions, URLs, emojis, emoticons, and other mixed character strings. Table 4 shows the numbers each token type appeared in the Olid-v1 train and test sets.
The token type counts in Train and Test sets of Olid-v1 dataset
Here, we describe the data processing pipeline used in our experiments, including the pre-processing of text (Section 3.1), the cross-validation technique we used for validating our models (and for ensembling of the trained models – Section 3.2), as well as the short description of the algorithms we applied (Section 3.3).
Text pre-processing
The first step of our classification pipeline was that of text pre-processing. This step included the tokenization of tweets, as well as the removal or replacement of certain tokens. For example, extra space characters were removed, as long spaces were collapsed into one. Moreover, where necessary, we replaced web addresses with the URL token, and @-mentions with the token @-USER. We also removed hash characters (while retaining the text from hashtags), and emoticons. In addition, following Kovács et al. [51], we also removed emojis from the text. All other special characters (including punctuation) were retained. Moreover, we did not attempt any correction of spelling. Lastly, as we were approach the task from a deep learning perspective (largely motivated by the recent success of various deep learning models in different natural language processing tasks3
For the training of each deep learning model, we used a variation of the regular five-fold cross-validation. This was carried out in the following manner. First, the training data in each data set was divided into five non-overlapping partitions, in a way that class-distributions in the resulting partitions were as similar in each other (as well as the original training set) as possible. We used these sets as validation sets, for early-stopping and hyper-parameter optimization purposes. In the case of every partition, the instances left out were considered as the training set corresponding to the given fold. After this partitioning, for each model used, five separate models were trained, each one optimizing its learnable parameters on the training set of various folds. While the validation set was used for early-stopping, and validation purposes. For each method the decision to classify a tweet into the inoffensive, or hateful/offensive category was done by calculating the average predicted probabilities of the five models trained on the five different folds. Here, it should be noted, that unlike in the 2021 paper of Kovács et al. [51], when incorporating data from multiple corpora, we still adhered to this k-fold partitioning, meaning that a combination of for example the HASOC 2019 and HASOC 2020 data sets meant that in the first fold, the training set from first fold of HASOC 2019 was combined with the training set from the first fold of HASOC 2020 data set, and the same was true for all five folds. However, the validation set to be used only depended on the data set we were originally examining. Meaning that if we were examining the effect that adding the HASOC 2020 data to the HASOC 2019 data, then we used only the validation set of the HASOC 2019 fold for early-stopping, and evaluation purposes.
Deep learning methods
Motivated by the successful application of deep learning for text classification problems [96], as well as for the problem of detecting offensive or hateful language [8,118], including challenges built around this task [62,101,110], we experimented with various deep learning techniques. Similarly as Alonso et al. [5], we applied a deep neural network model that combined convolutional layers with recurrent layers. We also carried out experiments using the FastText classifier. Lastly, we trained a BERT variant, namely RoBERTa, to attain state-of-the-art results on both HASOC data sets. We shortly introduce the aforementioned methods in the sections below.

Architecture of the CNN-BiLSTM model used in our experiments.
When participating in the HASOC 2019 competition, a hybrid Convolutional Neural Network (CNN), and Long-Short Term unit network (LSTM) was used by Alonso et al. [5], and later this model was improved by Kovács et al. [51]. In our work we extended their experiments for meta-parameter optimisation. As Fig. 1 shows, the first part of the model is an input layer, where the batch size is defined using the BatchSize parameter. Here, we also define the maximum amount of tokens we would retain in a tweet, as 128. In order to bring tweets to a format that can be used in the embedding layer in a later stage, we first had to apply the
To optimize the value of the meta-parameters discussed, we drew parameter combinations randomly, and with each combination, trained independent networks on the different folds at least three separate times. Based on the resulting macro
Potential values for the meta-parameters of the CNN-LSTM architecture
Potential values for the meta-parameters of the CNN-LSTM architecture
Another text classification method examined in our experiments was that of FastText [49]. For this, we used the implementation available at github5
. During our experiments using the FastText models, we were applying the following scheme for training and valiation purposes. We trained each model using the parameter optimization built in (invoking the autotune-validation option – with the validation set of the fold – and the autotune-duration option – setting it to 600, resulting in a parameter-optimization process limited in ten minutes). The meta-parameter optimization was carried out in a way that the meta-parameters were chosen based on theBeside the parameter controlling the duration of parameter-optimization for each model, we set only one parameter as constant, that is the dimensionality of embeddings. However, we did so only in case we were using the pre-trained WikiNews word embedding6
A collection of a million word vectors, trained using the Wikipedia 2017 corpus, the statmst news dataset, and the webbase dataset of UMBC, accessible at:
One of the methods utilized in our study was the RoBERTa [57] model, an offspring of the BERT [21] family of models, which in turn is part of the Transformers [99] family tree. The transformers’ first appearance was in Vaswani et al. [99]; they were showcased as evidence that the recurrent cells were extraneous for a recurrent encoder-decoder [9]. For this purpose, Vaswani et al. in [99] showcased an improvement in machine translation by utilizing attention mechanisms without relying on recurrent neural networks, concreting the way for more transformers architectures, like BERT [21] as the main outcome. This achievement lay the first stone for the appearance of several variants, including DistilBert [90], AlBERT [55], TinyBERT [48], and RoBERTa [58] (used here). Our experiments hinged on previous research with DistillBERT and RoBERTa for OffensEval 2020 hate speech detection task [112], we were inclined to use RoBERTa [57] (that is part of the SimpleTransformers library [106]) due to its noticed performance on the task.
In our experiments, we have fine-tuned the RoBERTA model using the cross-validation method described in Section 3.2. As training data, we used both HASOC data sets. We did so by using the default meta-parameters presented in [106], except for the number of epochs (that was set to a maximum of 20, but as we also applied early stopping using the development set, we have never reached this limit), and the learning rate, which was equal to 1e-5 and 4e-5 (HASOC 2020). Results of these experiments are also described in Section 4.
Classifier ensemble
Following the approach of the 2019 SemEval winners [93], as well as the approach of Kovács et al. [51], we have also experimented with classifier ensembles. This was carried out similar to how we combined versions of the same model trained on different folds. That is, by averaging the probabilities predicted by different models.
Results and discussion
Here, the results attained on the two HASOC corpora will be listed and discussed. To facilitate the comparison of the performance of various models in different circumstances, the Section will be organized as follows. In Section 4.1 we discuss the results gotten using the hybrid CNN, BiLSTM model. First on the HASOC 2019 dataset, then on its 2020 counterpart. Then, in Section 4.2 we would do the same with results attained using the FastText classifiers. Lastly, in Section 4.3, we would list the results provided by the RoBERTa model.
Each section will be structured in the same manner. First, results (both average macro-
Meta-parameter values chosen for CNN-LSTM architecture on the different HASOC datasets
Meta-parameter values chosen for CNN-LSTM architecture on the different HASOC datasets
In this section we discuss the classification scores attained using a simple hybrid CNN-BiLSTM model on the two HASOC corpora. To emphasize the effect of additional data used, in the case of both data sets, we focus on results attained using the meta-parameters that were optimized using the core data set only. These meta-parameters are listed above in Table 6 for both the HASOC 2019, and the HASOC 2020 data sets. As can be seen in Table 6, for both data sets, we ended up with a model of similar (small – below 15 million trainable parameters) size, sharing some meta-parameter values (including the size of the vocabulary, the max pooling parameter, and the number of BiLSTM layers used).
Results for each data set will be listed in their respective subsection below. Here, for each meta-parameter setting, the five-fold cross-validation training was carried out five times, thus the reports listed in the tables in the subsections below are those attained as an average of the results from five independent runs. Training five independent ensembles has also allowed us to examine the significance of the improvements attained, which we did using a two-tailed heteroscedastic t-test.
CNN-LSTM results on the HASOC 2019 test set (scores are the average of five models; the best results, and those not significantly different –
– are emphasized in bold)
CNN-LSTM results on the HASOC 2019 test set (scores are the average of five models; the best results, and those not significantly different –
Results attained on the test set of HASOC 2019 are listed in Table 7. As can be seen in Table 7, the addition of the smaller (HASOC 2020) data set already significantly increased both the macro, and the weighted
CNN-LSTM results on the HASOC 2020 test set (scores are the average of five models; the best results, and those not significantly different –
– are emphasized in bold)
CNN-LSTM results on the HASOC 2020 test set (scores are the average of five models; the best results, and those not significantly different –
When looking at the results attained on the HASOC 2020 listed in Table 8, one can see that the resulting classification scores are markedly higher, despite the best reported result in the original competition [60] being only 0.51 [69]. Another observation one can make is that the macro and weighted
We can also see on Table 8, that with each data set added, the classification scores are improving. The bigger (in terms of the number of tweets contained) the data set added is, the bigger the improvement is in the classification scores. Lastly, we would note, that in the case of HASOC2020, when both (HASOC2019 and OLID) were added to the training data, the classification scores were improved compared to the case where only one of the additional data sets were leveraged. This improvement, however (when the results in the last line are compared to those presented in the preceding line) is not significant. Moreover, the improvement attained by adding the HASOC2019 data to the HASOC2020 data was also not significant. With the addition of OLID data to the training set, however, we attained significant improvements in the resulting classification scores.
CNN-LSTM meta-parameter values for the models that performed best on HASOC 2020 (based on the average performance on the validation set), when trained using all three data sets
CNN-LSTM meta-parameter values for the models that performed best on HASOC 2020 (based on the average performance on the validation set), when trained using all three data sets
Thus we decided to exploit the ensembling already present in the 5-fold cross-validation approach applied here, in the following manner. When combining the probabilities from the five models trained on the five different folds, we made sure that the predicted probabilities for each fold are coming from a model that we trained using the meta-parameter combination which performed best on average for the given fold. Using this method, we were able to avoid entirely dismissing models that did not perform significantly worse than the best performing model. Moreover, we ensured that for the ensembling step, there would be an increased heterogenity in the models combined.
We have evaluated the resulting ensemble on the test set of the HASOC 2020 corpus. The models on average (based on the average of ten independent runs) attained a macro
In this section we report and discuss our results attained using the FastText classifier. Result here will be presented in a similar manner as in the previous section. One difference, however, is that for each data set combination, an additional line will be present in the paper. This line will represent the results attained when in conjunction with the FastText classifier, we also use a pretrained Word2Vec word representation.
FastText results on the HASOC 2019 test set (scores are the average of five models; the best results, and those not significantly different –
– are emphasized in bold)
FastText results on the HASOC 2019 test set (scores are the average of five models; the best results, and those not significantly different –
The average classification scores and deviations resulting from our experiments on the HASOC 2019 data set using FastText are reported in Table 10. Here, our discussion would be two-fold. First, we would discuss the effect of using the WikiNews-300D word embeddings. This will be followed by a short discussion on the effect of additional data sets on our results. As can be seen in Table 10, regardless of the data sets used for training, the resulting classification scores on the HASOC 2019 test set are improved by the inclusion of the WikiNews-300D word vector representations. This improvement is significant (with
When discussing the effect of additional data sets, we can see in Table 10, that regardless of whether we use WikiNews-300D, or not, it is constant that by introducing a new data set in any setting improves the classification performance of our models. The difference lies in whether this improvement is significant or not. Over the basic HASOC 2019 data set, the introduction of any additional data results in improvements being significant (with
It should be noted here, that the models trained using the FastText classifier were much bigger (models saved taking up GigaBytes) than those resulting from the CNN-LSTM hybrid classification (models taking up around 100 MegaBytes). It is hence positive that by incorporating some additional training data (with less instances than the original training set), our hybrid model performed on the same level as the FastText classifier using HASOC 2019 data only. Moreover, when including more additional data, our CNN-LSTM models performed markedly better than FastText using HASOC 2019. It should also be noted, however, that the meta-parameter optimization for CNN-LSTM models took significantly longer than the ten minutes used by FastText. To examine how this affects the results, we have ran the experiments whose results are reported in the last line of Table 10 with doubled meta-parameter optimization time. The resulting classification scores (an average macro-
FastText results on the HASOC 2020 test set (scores are the average of five models; the best result, and results not significantly different –
– are emphasized in bold)
FastText results on the HASOC 2020 test set (scores are the average of five models; the best result, and results not significantly different –
In this section we would present our results similarly to those presented in the previous section. Classification scores of our experiments are listed in Table 11 Regarding the use of WikiNews-300D, when using any additional data sets, the involvement of these pre-trained word vector representations significantly improved the resulting classification scores. When the additional data set was OLID, this improvement was significant with
When discussing the effect of additional data sets, we should have this discussion in two parts. When WikiNews-300D is not employed we only find a significant improvement (with
What is more, is that on the HASOC 2020 data set, the simple hybrid CNN-LSTM method outperformed not only the best performing team in the original competition [69], but the FastText classifier as well.
RoBERTa
The cross-validation experiments using RoBERTa on the HASOC 2019 data set has already been carried out, by Kovács et al. [51], and by Alonso et al. [6]. For this, we would not repeat these experiments here. These researchers, however found, that applying the same five-fold cross validation method using RoBERTA lead to state-of-the-art results on the HASOC 2019 data set. While Alonso et al. [6] further improved these results with the addition of part of the OffensEval [112] training set. This indicated that not only the use of the pre-trained RoBERTa was beneficial, but even in that case, adding additional labelled data to the process further improved the results.
We have, however carried out the cross-validation experiments on HASOC 2020. Due to the size of the model, however (the model containing more than one hundred million trainable parameters), we only trained the five different folds once. Our results showed that individual RoBERTa models trained on the different folds, on average classification score of 0.8994 (both for the macro and weighted
Classifier ensemble
The classifier combination experiments described in Section 3.4 we carried out on the HASOC 2020 data set. We did so using the combination of the RoBERTa model (trained on the HASOC 2020 data set alone), with the optimized hybrid CNN-LSTM model (trained on the combination of all three data sets – as discussed in detail in the last paragraphs of Section 4.1.2). As for the CNN-LSTM hybrid we had ten models trained, we did so ten times, thus the final result reported being the average of ten independent results. When evaluating the resulting classifier ensemble on the HASOC 2020 test set, we attained a macro, and weighted
Analysis
In this study, we carried out extensive experiments for the examination of the effect leveraging additional resources has on the detection of offensive speech in social media posts. Our aim, however extends beyond what could be summarized as discussing whether “more data is better”. For this, it is beneficial to further analyse our experiments and results that examined leveraging three sources of external data: namely, pre-trained transformer models, pre-trained word representations, and data sets resulting from collection carried out based on similar guiding principles.
Utility of pre-trained transformers
Out of the three sources of additional data leveraged, the use of pre-trained transformers is one the analysis of which we would discuss the least. As the use of these models for NLP tasks is well researched [107,108], including the effect of different pre-training objectives [56]. The inclusion of these models in our experiments was mainly motivated by the following considerations: for one, to contribute to our ability to present state-of-the-art results for the given task. Furthermore, to show the comparative performance attained using the CNN-LSTM model (at comparable trainable parameter counts). And lastly, to demonstrate that RoBERTA models can still benefit from the contribution of the aforementioned CNN-LSTM models, when those models leverage the right data (see, our ensemble results in Section 4.4.
Utility of pre-trained word representations
What we consider an important take-away for pre-trained word representations is that for each combination of data sets used in the FastText experiments, the addition of WikiNews-300D further improved the classification scores, moreover in 6 out of 8 cases, this improvement was significant (with
WikiNews-300D is, however, only one embedding provided for FastText. Thus, to strengthen our conclusions, we repeated the HASOC 2020 experiments with word vectors trained on the CommonCrawl7
FastText results on the HASOC 2020 test set (scores are the average of five models; the best result, and results not significantly different –
FastText results on the HASOC 2019 test set (scores are the average of five models; the best result, and results not significantly different –
Table 13 presents the results attained when repeating the FastText experiments on HASOC 2019 using CommonCrawl. Here, we once again can see that the addition of CommonCrawl in most cases improves the performance, and that the best performance is attained with the inclusion of the three datasets as well as the embeddings from CommonCrawl [67].
When discussing the utility of additional labeled data sets, it is beneficial to discuss FastText and CNN-LSTM separately. The reason is that for FastText, the use of the pre-existing package meant that the meta-parameters for models were fine-tuned for each combination of data sets. While for the CNN-LSTM model, for each combination of data set we used the same meta-parameter settings that we found optimal for the core data set (be that HASOC 2019 or HASOC 2020) alone.
FastText
Upon reviewing the results attained using FastText on the HASOC 2019 data set (see Section 4.2.1), we find that the addition of each data set improved the results, even if not every improvement was significant. We can also see, that the larger the added data set was, the larger the improvement was (e.g. the addition of HASOC 2020 meant an additional
Meta-parameter values chosen on the HASOC 2019 data set for CNN-LSTM architecture using the different data set combinations for training
Meta-parameter values chosen on the HASOC 2019 data set for CNN-LSTM architecture using the different data set combinations for training
Results attained on CNN-LSTM models with optimized parameters on the HASOC 2019 test set (scores are the average of five models; the best results, and those not significantly different –
As we saw in Tables 7, and 8, the best results were attained when adding the larger (OLID) data set, and the inclusion of an additional data set on top of that did not significantly improve the results (and in Table 7, it even – albeit not significantly – worsened classification scores). In Section 4.1.1 we hypothesised that this could be due to having used the same meta-parameters for all data set combinations. To test this hypothesis, we repeated the experiments on the different data sets, first optimizing the meta-parameters for each data set combination. The meta-parameters found are listed in Table 14, while the resulting
We repeated the same experiments for the HASOC 2020 data set, and found very similar results (see Table 16 for the meta-parameters and Table 17 for the classification scores). Again we find that when the meta-parameters are optimized, the best results are attained when all data sets are combined for the training of the CNN-LSTM networks. Moreover, when using similar sizes of data set combinations, the meta-parameters selected as performing best (during cross-validation) are similar.
Meta-parameter values chosen on the HASOC 202020 data set for CNN-LSTM architecture using the different data set combinations for training
Meta-parameter values chosen on the HASOC 202020 data set for CNN-LSTM architecture using the different data set combinations for training
Results attained on CNN-LSTM models with optimized parameters on the HASOC 2020 test set (scores are the average of five models; the best results, and those not significantly different –
Overall, what we can say regarding the utility of additional labeled data sets for the classification of offensive speech is that with the right selection of data sets (based on their descriptions), for each data set combination, we improved the performance of the CNN-LSTM model with the inclusion of additional data, while attained an improved or significantly not different classification score with FastText. Given the circumstances, we can only hypothesize on how to avoid worsening the performance, and that is the careful selection of data sources, and the use of well-selected validation data. To eliminate the need for careful selection, however, our goal is to automatize the inclusion of additional data (as is also discussed in Section 6).
To provide a visual aid for the summary above, we have created a plot to visualize the utility of each data set (see Fig. 2). Here, for each data set, we compared the results of experiments where the only difference was the presence or absence of the data set in question. We did so for experiments with CNN-LSTM where meta-parameters were optimized for the given data set combination (Tables 15 and 17), as well as for all FastText experiments (Tables 10, 11, 13 and 12). Figure 2 shows for each data set in what percent of the cases the addition of the data set in question led to:
significantly improved classification scores (both for macro and weighted classification scores not significantly different than hose achieved without the additional data set.
As can be seen in Fig. 2, in our experiments, 37.50% of the time (3 out of 8 cases), the addition of HASOC 2019 data set significantly improved the performance while in 62.50% of the time there was no significant difference in performance. Similarly, the addition of the HASOC 2020 data set significantly improved the performance 62.50% of the time (5 out of 8 experiments). The addition of the largest (OLID) data set, however, improved the performance significantly 100% of the time (16 times out of the 16 pairs of experiments examined).

The utility of each dataset in several experiments where dataset addition improved the performance significantly/insignificantly.
Here, following the work of Alonso et al. [5,6] and Kovács et al. [51], we have examined what effect the use of various additional data sources has on the offensive language classification problem, using the two concluded HASOC competitions. This work, however, goes beyond [5,6,51] by (i) adding new data, (ii) having a more complete search space of datasets, word representations, and machine learning models (while previous work only investigates some of these aspects), and (iii) achieving state-of-the-art results on HASOC 2020. The additional dataset added was HASOC 2020 [60], which was made available after the initial submission of the papers listed above. The addition of this dataset enabled us to examine the effect of using data collected by the same researchers at a different time. It also enabled us to examine a more complete search space by extending the number of possible combinations. Another way for us to extend the search space was by including an additional pre-trained word-vector representation trained on CommonCrawl data [67]. The results of experiments carried out in this search space lead to state-of-the-art results on the HASOC 2020 dataset, as well as demonstrated the benefits of using additional labelled corpora. This was the case both when the data collection was carried out by the same people, as well as in the case when data collection was carried out by different people, but based on similar principles. In fact, in most cases better results were achieved when data sets collected by different researchers were combined. One potential explanation for this may be the difference in the size of the data sets.
In the future, our work can be extended to additional labelled data sets (since the initial submission of our paper, HASOC 2021 [63] has been organised and concluded, and at the time of the submission, the organisation of HASOC 20228
