XRBi-GAC: A hybrid deep learning framework for multilingual toxicity detection

Abstract

Social media platforms allow people across the globe to share their thoughts and opinions and conveniently communicate with each other. Apart from various advantages of social media, it is also misused by a set of users for hate-mongering with toxic and offensive comments. The majority of the earlier proposed toxicity detection methods are primarily focused on the English language, but there is a lack of research on low-resource languages and multilingual text data. We propose an XRBi-GAC framework comprising XLM-RoBERTa, Bi-GRU with self-attention and capsule networks for multilingual toxic text detection. A loss function is also presented, which fuses the binary cross-entropy loss and focal loss to address the class imbalance problem. We evaluated the proposed framework on two datasets, namely, the Jigsaw Multilingual Toxic Comment dataset and HASOC 2019 dataset and achieved F1-score of 0.865 and 0.829, respectively. The results of the experiments show that the proposed framework has outperformed the state-of-the-art multilingual models XLM-RoBERTa and mBERT on both datasets, which shows the versatility and robustness of the proposed XRBi-GAC framework.

Keywords

Toxicity multilingual text XLM-RoBERTa Bi-GRU self-attention capsule network

1 Introduction

Online social media platforms are undoubtedly among the most significant technological advancements of the 21st century and have had a tremendous cultural impact on people. In this era of social computing, interpersonal communication is intensifying, mainly through social media platforms and chat forums. Microblogging platforms allow individuals worldwide to express and share their opinions instantly and widely in various languages. The multiple advantages of these platforms include the extensive diffusion of content across geographic borders and the facilitation of interactions and exchanges in numerous languages largely unrestricted by physical barriers, excluding infrastructure. On the other hand, social media has also turned into a platform for hate-mongering, and toxic language since users feel shielded in some way by their virtual identities. Users might be intimidated on the basis of their political ideas, religious convictions, ethnicity, or colour. We cannot accede to this rhetoric being used on social media since it may lead to anxiety, depression, suicides, and societal disorder. Given the massive volume of information in different languages created each day on social media, it is impossible for humans to detect such toxic texts containing profanity and vulgarity manually. Therefore, there is a need to develop a model to detect toxicity in multilingual text. As of mid-2022, approximately 4.70 billion people are using social media around the world, that is 59 per cent of the total population of the entire world. The number of social media users has continued to increase over the past year, with 227 million new users joining since last year. That’s an average of more than seven new users every second, which works out to a growth rate of 5.1% per year [1]. At the beginning of 2021, Google Search supported 149 languages, while Bing supported 40. Twitter currently supports 45 languages, Facebook supports over 100, and Instagram supports 90. The English language has been the main focus of most of the research on the text, which is resource-rich, while much less attention is given to other languages. This disparity is mainly attributable to the lesser quality, quantity, or availability of training datasets and corpora for other languages. Progress for other languages is frequently impeded by a lack of resources or their exclusive use. Therefore, this study aims to develop a deep learning framework for addressing this complex multilingualism issue for toxic text detection. In our proposed model, we leverage the pre-trained XLM-RoBERTa [2] model discussed by Facebook as a base because it uses an enormous multilingual corpus such as Wikipedia for training. It improves multilingual migration and is one of the finest pre-trained categorization models. The proposed XLM-RoBERTa Bidirectional GRU Attention Capsule Network (XRBi-GAC) model exploits the cross-lingual transfer learning for toxicity detection and performs better than the state-of-the-art models on multilingual text.

Researchers have worked on toxic comment detection using conventional machine learning models, which require manual feature engineering with features like n-grams, tf-idf, bag-of-words, Part-of-speech (POS), etc. [3 –6]. These feature-based learning models performed satisfactorily in numerous text classification tasks. However, the reliability of text features heavily depends on the field knowledge of the developer of the model, which is typically challenging to obtain. On the other hand, deep learning models automatically extract features from text data without the need for manual feature engineering and perform efficiently. In recent years, the arrival of the Transformer language model [7] gave birth to the path-breaking pre-trained model BERT (Bidirectional Encoder Representations from Transformers) [8], and the era of transfer learning began in natural language processing, which is crucial for several tasks like translation, summarization, classification, chatbots etc. Many standard models have been developed on the concept of BERT, like DistilBERT [9], RoBERTa [10], ALBERT [11] etc. Recently, the focus has been shifted to multilingual text with the advances in the multilingual language models such as multilingual BERT (mBERT) and XLM-RoBERTa, which are trained on enormous datasets in more than 100 languages and show promising results in cross-lingual transfer as well. Identifying toxicity in the multilingual text is a challenging task. Firstly, language differences create a barrier that is difficult to span by the models developed in a monolingual environment. Also, in less commonly used languages, there is often a paucity of annotated training data. The datasets used in this study suffer from resource limitations.

The following are the primary contributions of this work:

We develop a model named XRBi-GAC for toxicity detection in multilingual text data. It leverages the word representations from the pre-trained layers of XLM-RoBERTa, which are fed to the bidirectional gated recurrent unit for capturing better connotations from both directions.

A self-attention mechanism is incorporated into Bi-GRU to capture vital information, followed by a capsule network to get a more refined semantic representation using a dynamic routing approach, which is then used to classify the comments.

We assess the XRBi-GAC on two publicly available benchmark datasets: the Jigsaw Multilingual Toxic Comment dataset and HASOC 2019 dataset.

The training set of each dataset contains data only in the English language, and the model is evaluated on multilingual text data. So, the cross-lingual learning approach is utilized in the model, which is a type of transductive transfer learning.

We use the fusion of binary cross-entropy loss and focal loss to address the imbalance problem in the dataset.

2 Related work

The ubiquity of toxicity on social media has increased the importance of toxicity detection research in recent years. There have been prior efforts on toxicity identification that used machine learning-based classifiers, but now, the focus of the study has shifted towards deep learning methods. In [12], an annotated corpus of hate speech with context information is introduced. A Logistic Regression (LR) model with context features and a Long Short Term Memory (LSTM) with learning components for context are discussed for automatic hate speech identification [13]. In [14], a hybrid model consisting of Convolutional Neural Network (CNN), Bidirectional LSTM (Bi-LSTM) and Bidirectional GRU (Bi-GRU) is discussed that uses data from Wikipedia talk pages to detect several types of toxicity generated on online platforms. An efficient augmentation method is introduced, which integrates unique words and random mask to empower the proposed model. In [15], the performance of different models that include Logistic Regression, LSTM, Recurrent Neural Network (RNN), CNN, Bi-LSTM, and GRU with attention and word embedding techniques for toxicity identification is discussed. Among all the classifiers, a Bidirectional GRU network with attention layer configuration works best. Earlier, most of the research work is done on the English language, but in recent years studies are conducted on other languages also. In [16], different neural network architectures are evaluated for identifying the offensive language in German, English, and Hindi. The results show that the fine-tuning the BERT framework surpasses all methods. In [17], two methods are discussed for cross-lingual language models: one method that only relies on monolingual data and is unsupervised, and the other method is the supervised method, which employs parallel data with a new cross-lingual language.

Now, the focus of the study is shifted towards multilingual models in comparison with monolingual models because multilingual models are more robust and versatile models and can handle a variety of languages. In contrast, monolingual models’ performance is confined to a single language only.

Table 1
Relevant work

Work Methods Applied Results

[12] LR+context features and NN+learning components AUC- 0.804

[13] Bi-LSTM and Bi-GRU F1- 0.6992

[14] Ensemble of CNN, Bi-LSTM and Bi-GRU F1- 0.828

[15] Bi-GRU+attention F1- 0.791

[16] GRU+capsule, LSTM+capsule+attention, BERT F1- 0.8030

[19] ERNIE and XLM-R F1- 0.8604

[20] XLM-R and mBERT F1- 0.833

[21] Fine-tuning of XLM-R F1- 0.822

[22] LSTM, CNN, MLP F1- 0.81

[23] mBERT and XLM-R F1- 0.8449

[24] XLM-R and MuRIL F1- 0.9

[25] Transformer, CNN, Bi-RNN, Bi-GRU, Bi-LSTM, NCP AUC- 0.942

Work	Methods Applied	Results
[12]	LR+context features and NN+learning components	AUC- 0.804
[13]	Bi-LSTM and Bi-GRU	F1- 0.6992
[14]	Ensemble of CNN, Bi-LSTM and Bi-GRU	F1- 0.828
[15]	Bi-GRU+attention	F1- 0.791
[16]	GRU+capsule, LSTM+capsule+attention, BERT	F1- 0.8030
[19]	ERNIE and XLM-R	F1- 0.8604
[20]	XLM-R and mBERT	F1- 0.833
[21]	Fine-tuning of XLM-R	F1- 0.822
[22]	LSTM, CNN, MLP	F1- 0.81
[23]	mBERT and XLM-R	F1- 0.8449
[24]	XLM-R and MuRIL	F1- 0.9
[25]	Transformer, CNN, Bi-RNN, Bi-GRU, Bi-LSTM, NCP	AUC- 0.942

In [18], authors have discussed a robust self-learning framework and combined the predictions of mBERT on unlabeled non-English data. This information is used further to fine-tune pre-trained multilingual representation models to excel in Multilingual Document Classification (MLDoc). In [19], a multilingual method using pre-trained language models, XLM-R and ERNIE, is discussed. The authors have come up with a method of knowledge distillation that is trained on soft labels generated by several supervised models. The results manifest that the BERT-large performed better than the BERT-base.

A method for classifying offensive speech in Dravidian languages that are codemixed and romanized by using selective translation and transliteration is discussed in [20]. For better results, fine-tuning and ensembling of XLM-RoBERTa and mBERT have been introduced. In [21], a method for identifying the offensive language in tweets in five different languages using cross-lingual inductive transfer learning is discussed. They have used an ensemble of XLM-R (base) and XLM-R (large) cross-lingual embeddings. The top 10 closely related teams adopted the BERT, RoBERTa, XLM-RoBERTa, CNNs and LSTMs models for hybridization.

In [22], authors performed preprocessing of data using different machine learning and ensemble algorithms in which LR and XGBoost performed best. They also explored various word embedding techniques, and the results are fed to DNN classifiers. The combination of CNN and BERT outperformed all other methods. In [23], a fusion-based method for detecting toxicity in multilingual text in uneven sample distribution is discussed. The authors used mBERT and XLM-RoBERTa for pretraining, which incorporate vital information in their models. The authors have used six base models and then combined them to produce three fusion models.

In [24], a model is discussed that flags toxicity and provides users with a safer platform by combining the XLM-RoBERTa and MuRIL frameworks. To tackle code-mixed classification challenges, the authors emphasise the use of multilingual transformer-based pre-trained and fine-tuned models. In [25], authors take comments from an online arena for public discourse for Georgian toxicity classification. They employed a novel NCP (Neural Circuit Policies) technique for toxic comment classification, which showed satisfactory results.

3 Dataset and methodology

We discuss the proposed XRBi-GAC model for toxicity identification in multilingual text. It comprises three parts: Datasets, Data Preprocessing, and XLM-RoBERTa Bi-GRU Attention Capsule Network (XRBi-GAC), as discussed below.

3.1 Dataset

Here, we will discuss two datasets, namely, the Jigsaw Multilingual Toxic Comment dataset and HASOC 2019 dataset, as they will be used for experimental purposes.

3.1.1 Jigsaw multilingual toxic comment dataset

This dataset is taken from Kaggle and contains comments collected from Civil Comments and Wikipedia. The training set contains English-only comments having 223549 samples with labels being toxic (0 or 1). 21384 samples, or 9.4%, are labelled as toxic. There are 8000 labelled samples in three languages in the validation set. 15.4% of the samples, or 1230, are labelled as toxic, as shown in Figure 1.

Fig. 1

Class distribution of training dataset (Left) and validation dataset (Right).

There are 3000 Turkish (tr) samples, 2500 Italian (it) samples, and 2500 Spanish (es) samples. The language distribution with class in the validation set is shown in Figure 2.

Fig. 2

Language distribution with class.

There are a total of 63,812 unlabeled samples in the test set, split between six different languages: Spanish (es) has 8,438 comments, French (fr) has 10,920 comments, Italian (it) has 8,494 comments, Portuguese (pt) has 11,012 comments, Russian (ru) has 10,948 comments, and Turkish (tr) has 14,000 comments.

The training data is skewed, containing almost 90% of the comments labelled as non-toxic. So, a metric like an accuracy cannot correctly detect the samples of the minority class. To tackle this challenge, the resampling method is employed to upsample the minority class, i.e., the toxic class. The F1-score is used to maintain a balance between precision and recall and enhances the score only if the classifier correctly identifies more instances of a certain class.

3.1.2 HASOC 2019 dataset

This dataset [26] is taken from HASOC (Hate Speech and Offensive Content Identification in Indo-European Languages), a shared task. It contains Twitter and Facebook posts in English, Hindi and German separately. These posts are to be classified into two classes: NOT and HOF. NOT class is for posts that do not contain hateful content, and HOF class is for those posts that contain hateful and offensive content. We choose English data for training, containing 5852 posts. The test data is composed by combining the Hindi test data, which contains 1318 posts, the English test data, which contains 1153 posts, and the German test data, which contains 850 comments, respectively. Thus, it contains 3321 posts in Hindi, English, and German. This classification is a similar task as we have in the Jigsaw Multilingual Toxic Comment dataset; that’s why this dataset is chosen to assess the versatility of the proposed model.

3.2 Data preprocessing

It is a critical process that takes the raw comments as input and converts them into a form that preserves the semantics, context and inherent linguistic information of the input comments as much as possible while minimizing the loss of information. Preprocessing of data in this work includes converting each comment into lower case, removal of special characters, URLs, and emails, but stop words and punctuation are preserved as they may reveal crucial information. The proposed method takes preprocessed comments as input. Table 2 shows an instance of comment preprocessing.

Table 2
An instance of preprocessing

Raw input comment Those links are dead, but this appears to have the same content: http://www.youtube.com/watch?v=mou73QdF-NU

Preprocessed comment those links are dead, but this appears to have the same content

Raw input comment	Those links are dead, but this appears to have the same content: http://www.youtube.com/watch?v=mou73QdF-NU
Preprocessed comment	those links are dead, but this appears to have the same content

3.3 XLM-RoBERTa Bi-GRU attention capsule network

The proposed XRBi-GAC model, as shown in Figure 3, mainly comprises three components. The first one is the XLM-RoBERTa base model, which is fine-tuned on the training data and generates the contextual encodings of the text. The second component is the Bidirectional GRU, followed by a self-attention mechanism, which helps in capturing finer semantics and important information. The last component is a capsule network, which enhances the learned semantic information and understands it in a more refined manner. Each component is explained as follows.

Fig. 3

The proposed XRBi-GAC framework.

3.3.1 XLM-RoBERTa

It is a multilingual masked language model built on the concept of transformers and pre-trained on 2.5 TB of CommonCrawl data that includes about 100 languages. XLM-R has acquired state-of-the-art performance in the labelling of sequences, cross-lingual classification and question answering outperforming other contemporary multilingual models like mBERT, XLM and various monolingual models. Pre-trained XLM-RoBERTa, as illustrated in the Figure 4, is fine-tuned on the labelled English training data utilizing the pre-trained weights to take the task-specific knowledge that is learned in English and can apply it to the data containing text in multiple languages. This concept of transfer learning applied from one language to another is called cross-lingual transfer. Raw input after preprocessing is tokenized using XLMRobertaTokenizer, which uses the SentencePiece model to tokenize the text and convert tokens into embeddings which inculcate both, the embedding vector and positional encoding of the token in the sequence. Special tokens like cls _ token and sep _ token are placed at the front and rear end of the sequence, respectively. sep _ token is also used to separate two sentences in a sequence. The embeddings are then fed to the encoder layers; XLM-R has two variants, XLM-R (base) and XLM-R (large), having 12 and 24 encoder layers, respectively. Each encoder layer performs multi-head self-attention on the input and feeds its output to the next encoder layer. The last hidden state obtained from the pre-trained XLM-R contains C, which contains the entire context of sequence and time distributed sequence, t₁, t₂, . . . . , t_n, which is utilized in the proposed method.

Fig. 4

Fine-tuning of pre-trained XLM-RoBERTa.

3.3.2 Bidirectional gated recurrent unit

The RNN [27] model is widely used in NLP, but it suffers from the vanishing gradients problem in which RNN fails to recall initial sequence values. This recalling is critical in text classification because a few terms depend on the words that appear early in the sequence. Both LSTM [28] and GRU [29] address the vanishing gradient problem in RNN. GRU makes use of a smaller number of training parameters, less memory, and requires fewer data to generalise compared to LSTM, although LSTM works better with datasets that have longer sequences. GRU is an ideal choice for encoding social media updates because they are often shorter. GRU’s cell structure is depicted in Figure 5. It has two gates: the update gate and the reset gate. The update gate controls how much prior knowledge must be transmitted forth towards the future, while the reset gate governs how much prior knowledge should be forgotten. The GRU cell state computation formula for time t is as follows:

Fig. 5

GRU Cell Architecture

$u_{t} = σ (w_{u} . [h_{t - 1}, i_{t}] + {bi}_{u})$ (1) $r_{t} = σ (w_{r} . [h_{t - 1}, i_{t}] + {bi}_{r})$ (2) $h_{t}^{'} = \tanh (w_{h} . [r_{t} * h_{t - 1}, i_{t}] + {bi}_{h})$ (3) $h_{t} = (1 - u_{t}) * h_{t - 1} + u_{t} * h_{t}^{'}$ (4) Where w_u, w_r, w_h are weight matrices, σ is the sigmoid function, is the dot product and bi_r, bi_u, bi_h are bias parameters. h_t is the hidden state, and is also the output vector, i_t is the input vector at time t, r_t and u_t are the reset gate and update gate, respectively. h_t-1 is the output of the previous cell. The information in the current unit that needs to be updated is represented by the symbol $h_{t}^{'}$ .

Bidirectional [30] GRU is composed of forward and backward GRU units. It encodes the information as the internal state vectors after taking the output sequence from XLM-RoBERTa as input. For the purpose of generating accurate forecasts about the current state, a Bi-GRU enables the collection of data from both earlier time steps and more recent time steps as given in the equation (5) and (6). $\vec{h_{t}} = GRU (i_{t}, {\vec{h}}_{t - 1})$ (5)

$\overset{\leftarrow}{h_{t}} = GRU (i_{t}, \overset{\leftarrow}{h_{t - 1}})$ (6) where $\vec{h_{t}}$ and $\overset{\leftarrow}{h_{t}}$ are the hidden layer of forward and backward GRU unit respectively. The hidden layer output of Bi-GRU at time t is given in the equation (7). Compared to GRU, a Bi-GRU provides greater insight into the meaning and context of a statement.

$h_{t} = [\vec{h_{t}}, \overset{\leftarrow}{h_{t}}]$ (7) where h_t is the hidden layer output of Bi-GRU at time t.

3.3.3 Self attention

The attention mechanism enables the output to focus on the input while generating output. In contrast, inputs can interact with each other in a self-attention paradigm, which figures out the attention of all other inputs relative to a single input. Self-attention [7] is effective in modelling sequence segment dependencies. It uses information about location and observation value, and instead of conditioning on the whole sequence, it uses pairwise comparisons that are shown by vectors for both.

The self-attention mechanism, as illustrated in Figure 6, is viewed as a query to key-value mapping. Each vector has three representations, namely query (Q), key (K), and value vector (V). When a word vector x_n seeks all of the other word vectors’ K-V pairs, including its own, Q is activated and trained. K and V are trained to provide attention values. Attention is defined as "Scaled Dot-Product Attention," which is shown in the following equation with Q, K, and V:

Fig. 6

Self-attention mechanism.

$Attention (Q, K, V) = softmax (\frac{{QK}^{T}}{\sqrt{d}}) V$ (8) where $\sqrt{d}$ is used for scaling adjustment that prevents the inner product from becoming too large. The output sequence of Bi-GRU with self-attention is H = (H₁, H₂, . . . . , H_n).

3.3.4 Capsule network

A capsule is a group of neurons whose outputs indicate distinct characteristics of the same entity. The activity vector of a capsule, which describes the instantiation parameters of a certain type of entity, has all the relevant information. CapsNet [31] is a capsule network that substitutes vector output capsules for scalar output feature detectors and max-pooling with routing-by-agreement. The CapsNet comprises two layers: the primary capsule layer, or lower level layer and the digit capsule layer or higher level layer. Primary capsules route to the digit capsule layer using dynamic routing. The primary and subsequent digit caps layers receive the high-level feature representation that the preceding Bi-GRU attention layer outputs as input. The primary caps layers produce capsules with vector outputs as given in equation (9), where ‘m’ is in the current lower level primary caps, and ‘n’ is in the higher level layer. ${\hat{u}}_{n | m} = w_{mn} u_{m}$ (9) where u_m represents the output of a capsule in the layer below, w_mn represents the weight matrix, and ${\hat{u}}_{n | m}$ represents the prediction vector.

Dynamic routing (Algorithm 1) will do the connection-weight updation during network training as given in equation (10).

$s_{n} = \sum_{m} c_{mn} {\hat{u}}_{n | m}$ (10) where c_mn represents the coupling coefficient and the total input to a capsule, s_n is the weighted sum of all prediction vectors ${\hat{u}}_{n | m}$ from the capsules in the layer below. c_mn is governed by the dynamic routing procedure.

Algorithm 1

Dynamic Routing Algorithm

Input: ${\hat{u}}_{n | m}, z$

Output: v _n

1: for all primary capsule m and digit capsule n do

2: b_mn ← 0

3: end for

4: for z iterations do

5: for all primary capsule m and digit capsule n do

6: c_mn ← softmax (b_mn)

7: end for

8: for all digit capsule n do

9: s_n ← $\sum_{m} c_{mn} {\hat{u}}_{n | m}$

10: end for

11: for all digit capsule n do

12: v_n ← squash (s_n)

13: end for

14: for all primary capsule m and digit capsule n do

15: b_mn ← $b_{mn} + {\hat{u}}_{n | m} . v_{n}$

16: end for

17: end for

18: return v_n

Figure 7 shows the capsule’s major functions consist of encoding significant textual features and their relationships by multiplying the input vector matrix and weights, dynamic routing for transmitting output from one low-level capsule to another, and summation of weighted input vectors.

Fig. 7

Operations within a capsule.

Also, a capsule has a "squash" function that "squashes" a vector with a maximum and minimum length of 1 and 0, respectively, while preserving the vector’s direction as given in equation (11).

$v_{n} = \frac{| | s_{n} | |^{2}}{1 + | | s_{n} | |^{2}} \frac{s_{n}}{| | s_{n} | |}$ (11) where v_n is the vector output of capsule n and s_n is its total input.

The output H from the self-attention layer is taken by the capsule layer as input. The capsule layer extracts prominent refined information from H, which is required for toxicity detection, and feeds it to the output layer.

Algorithm 2

Pseudocode for XRBi-GAC

Input: c = (c₁, c₂, . . . , c_n) c_i is the input comment

Output: y = (0, 1) 0 is non-toxic and 1 is toxic \begin algorithmic[1]

1: Preprocess the input comments by removing the useless words, special symbols and URLs.

2: Tokenize the preprocessed comments into tokens using XLMRobertaTokenizer and add special tokens cls_token (<s>) and sep_token (< /s>) at the beginning and end of the comment respectively, which convert comments in the form <s>, tok _ 1, tok _ 2, . . . . , tok _ n, < /s>.

3: Tokens from step 2 are converted into embedding sequence in the form of e_<s>, e₁, e₂, . . . . , e_n, e_</s>.

4: The embedding sequence from step 3 is fed to the pre-trained XLM-RoBERTa and last hidden state is captured as t₁, t₂, . . . . , t_n.

5: The time distributed sequence output from step 4 is input to the Bi-GRU layer that gives the forward hidden layer output of the Bi-GRU as $\vec{h} = (\vec{h_{1}}, \vec{h_{2}}, . . . ., \vec{h_{n}})$ and backward hidden layer output of the Bi-GRU as $\overset{\leftarrow}{h} = (\overset{\leftarrow}{h_{1}}, \overset{\leftarrow}{h_{2}}, . . . ., \overset{\leftarrow}{h_{n}})$ and combine both the output to get hidden layer output of the Bi-GRU as h = (h₁, h₂, . . . . , h_n).

6: The output from Bi-GRU is received by the self-attention layer to further capture the context related information of the comment and output is H = (H₁, H₂, . . . . , H_n).

7: H is fed to the Capsule Network followed by a sigmoid layer to classify whether the comment is toxic.

8: return y = (0, 1)

3.3.5 Output layer

The output layer uses a fully-connected layer with sigmoid activation as its final prediction layer to aid in estimating binary classification probabilities. The two classification categories are namely non-toxic and toxic.

The process of toxic text classification using the proposed XRBi-GAC framework is summarized in Algorithm 2.

4 Experimental evaluation

Here, we first discuss about the evaluation metrics and experimental setup, followed by the baseline models used in this study.

4.1 Evaluation metrics

We use accuracy, macro-average precision, macro-average recall and macro-average F1 -score as the evaluation metrics for the proposed work.

Accuracy measures the fraction of correct predictions. It is effective for balanced data but not efficient for imbalanced data.

$Accuracy (A) = \frac{T_{n} + T_{p}}{T_{p} + T_{n} + F_{p} + F_{n}}$ (12) where T_p are the True Positives, T_n are the True Negatives, F_n are the False Negatives, and F_p are the False Positives and are obtained from the confusion matrix.

Precision refers to the proportion of relevant outcomes, and Recall is the proportion of relevant results accurately classified.

$Precision (P) = \frac{T_{p}}{F_{p} + T_{p}}$ (13)

$Recall (R) = \frac{T_{p}}{T_{p} + F_{n}}$ (14)

F1-score is represented as a harmonic mean of precision and recall. It can reflect the performance of a model in a true sense, compared with accuracy, as it is a more indicative metric.

$F 1 = 2 * \frac{P * R}{P + R}$ (15)

Macro F1-score for the evaluation is computed using the arithmetic mean of all the per-class F1-scores.

$P_{mac} = \frac{1}{K} \sum_{i = 1}^{K} P_{i}$ (16)

$R_{mac} = \frac{1}{K} \sum_{i = 1}^{K} R_{i}$ (17)

$F 1_{mac} = \frac{1}{K} \sum_{i = 1}^{K} 2 * \frac{P_{mac} * R_{mac}}{P_{mac} + R_{mac}}$ (18) where P_i and R_i are the precision and recall of category i respectively. P_mac, R_mac and F1_mac are the macro precision, macro recall and macro F1-score respectively. K is the number of categories which is equal to two in this research work.

The loss function used in this work is a fusion of two loss functions, namely, binary cross-entropy loss (L_b) and focal loss (L_f). L_b is given in the equation (19) and L_f is given in the equation (20).

$L_{b} = - \sum_{i = 1}^{n} y_{i} \log (p_{i})$ (19) where y_i denotes the ground truth and p_i is the softmax probability for the i^th class and n = 2.

$L_{f} = - \sum_{i = 1}^{n} α_{i} (1 - p_{i})^{γ} \log (p_{i})$ (20) where p_i is the estimated probability of the i_th class, and γ is a focusing parameter that controls the loss function’s curve shape and directs the model’s attention to the rare class in case of class imbalance. α_i is the balancing factor.

Since detecting toxic comments in this study is a typical two-category problem; hence, L_b is used. Also, L_f addresses the class imbalance problem; hence it is chosen with L_b as the training data is highly imbalanced. The following is the formulation of the loss function:

$L = λ_{1} L_{b} + λ_{2} L_{f}$ (21)

where λ₁ and λ₂ are the fusion weights.

4.2 Experimental setup

Scikit-Learn and Keras with Tensorflow as the backend Python framework libraries are used to carry out the proposed work. Experiments are conducted using Kaggle Kernel, a free platform for executing Jupyter notebooks with CPU (default), Nvidia K80 GPU, and TPU v3-8 accelerator. We have leveraged XLM-R (base) transformer in our proposed work which has 12 encoder layers, 768 hidden, 12 attention heads and 12,270M parameters. XLM-R is fine-tuned on the training data optimizing various parameters like maximum sequence length, learning rate, batch size and the number of epochs using random search. The maximum sequence length is kept as 256, and the batch size is taken as 16. The dropout rate is adjusted to 0.1 to avoid the issue of overfitting. Table 3 shows the hyperparameters used in the proposed model.

Table 3
Hyperparameter setting

Parameters Value

Maximum sequence length 256

Word dimension 768

GRU unit dimension 92

Capsule vector dimension 16

Optimizer Adam

Batch size 16

Learning rate 1e-5

Dropout rate 0.1

Activation function Sigmoid

Parameters	Value
Maximum sequence length	256
Word dimension	768
GRU unit dimension	92
Capsule vector dimension	16
Optimizer	Adam
Batch size	16
Learning rate	1e-5
Dropout rate	0.1
Activation function	Sigmoid

Early stopping is used as a regularization technique and a default learning rate of 1e-5. The sigmoid activation function is used as our task is a binary classification. The loss function combines 1:1.2 ratios of binary cross-entropy loss and focal loss.

4.3 Baseline models

We use Logistic Regression [32] with tf-idf text representation, CNN [33], Bi-LSTM [34] with fastext word embedding, and Bi-GRU with fastext word embedding as the four baseline models. LR is the traditional supervised machine learning model that needs manual feature engineering. Bi-LSTM and Bi-GRU are RNN-based DL models utilised with fastext word embeddings. CNN is another DL model used as the baseline.

5 Results and discussion

We analyze the results of the proposed XRBi-GAC model on the Jigsaw Multilingual Toxic Comment dataset. The performance of the recent SOTA methods has been compared with the proposed XRBi-GAC model. The results on this dataset are shown in Table 4. The best-performing model is our proposed XRBi-GAC model, which manifests significant improvement over state-of-the-art multilingual models. The performance of the models that are considered as the baseline is shown in the first four rows of Table 4. It shows that the LR+tf-idf performs worst, having F1-score and accuracy of 0.739 and 0.814, respectively. CNN produce better accuracy than BiGRU+fastext and a lower F1 than Bi-LSTM+fastext and Bi-GRU+fastext. Bi-GRU+fastext performs the best among all the baseline models achieving the highest F1 of 0.808, followed by Bi-LSTM+fastext, having a slightly less F1 of 0.793.

Table 4
Experimental results on the jigsaw multilingual toxic comment dataset

Model Accuracy Precision Recall F1

Logistic Regression+tf-idf 0.814 0.730 0.748 0.739

CNN 0.858 0.738 0.842 0.786

Bi-LSTM+fastext 0.887 0.774 0.813 0.793

Bi-GRU+fastext 0.856 0.815 0.801 0.808

BERT 0.893 0.785 0.837 0.811

RoBERTa 0.872 0.862 0.773 0.820

mBERT (SOTA) 0.905 0.791 0.882 0.834

XLM-R (SOTA) 0.928 0.825 0.867 0.845

XRBi-GAC 0.925 0.851 0.879 0.865

Model	Accuracy	Precision	Recall	F1
Logistic Regression+tf-idf	0.814	0.730	0.748	0.739
CNN	0.858	0.738	0.842	0.786
Bi-LSTM+fastext	0.887	0.774	0.813	0.793
Bi-GRU+fastext	0.856	0.815	0.801	0.808
BERT	0.893	0.785	0.837	0.811
RoBERTa	0.872	0.862	0.773	0.820
mBERT (SOTA)	0.905	0.791	0.882	0.834
XLM-R (SOTA)	0.928	0.825	0.867	0.845
XRBi-GAC	0.925	0.851	0.879	0.865

The bottom five rows of Table 4 depicts the performance of the base version of BERT, RoBERTa, and multilingual SOTA models, mBERT and XLM-RoBERTa and the proposed XRBi-GAC model. BERT has achieved an F1 of 0.811 which is a 0.37% improvement over Bi-GRU+fastext, and RoBERTa has achieved an F1 of 0.820, which is an improvement over BERT. Both mBERT and XLM-RoBERTa are SOTA models for multilingual text data and have attained an F1 score of 0.834 and 0.845, respectively, performing better than BERT and RoBERTa. The F1-score of mBERT is improved by 2.8% and 1.7% than that of BERT and RoBERTa, respectively. XLM-R shows an improvement of 4.2% and 3% over BERT and RoBERTa, respectively in terms of F1-score. XLM-R performs better than mBERT and shows an improvement of 1.3% and 2.5% over mBERT in F1-score and accuracy, respectively. Our proposed XRBi-GAC model outperformed all the models used, attaining the highest F1-score of 0.865 with a good balance between precision and recall. In comparison to the multilingual SOTA models mBERT and XLM-R used in this work, XRBi-GAC shows an improvement of 3.7% over mBERT and of 2.3% over XLM-R in terms of F1-score. RoBERTa achieved the highest precision of 0.862 but with a precision-recall gap of 8.9% and mBERT achieved highest recall value of 0.882 with the largest gap of 9.1%. In contrast, XLM-R achieved the highest accuracy of 92.8%.

We have evaluated the proposed XRBi-GAC model on the HASOC 2019 dataset to check its versatility and robustness. The proposed model is trained on the English-only data and assessed on the combined dataset containing tweets in English, Hindi and German, which is a similar task as the one we had in the multilingual toxic comment dataset used previously.

We evaluated the performance of Bi-GRU+fastext, mBERT, XLM-R and the proposed XRBi-GAC model on the HASOC 2019 dataset, and the performance of these models is depicted in Figure 8. XRBi-GAC shows decent performance with an F1 of 0.829, outperforming all the other models. XLM-R performed second best, having F1-score of 0.801, followed by mBERT with the F1 value of 0.786. Bi-GRU+fastext has the least F1-score of 0.765. XRBi-GAC shows a significant improvement of 3.5% and 5.4% over XLM-R and mBERT, respectively, in terms of F1-score. The proposed framework maintains a decent balance between precision and recall with a gap of 1.9%, which is the least among all the models used on this dataset. Also, XRBi-GAC achieved the highest accuracy of 86.1%. Thus, XRBi-GAC outperformed the XLM-R and mBERT.

Fig. 8

Performances of models on HASOC 2019 dataset.

The proposed method captures better semantics in the multilingual text than the existing methods. The Bi-GAC (Bi-GRU with Self-attention Capsule Network) module, which is built on top of the pre-trained XLM-R, efficiently extracts vital information from the multilingual text, which is crucial for toxicity classification. It is evident from the overall analysis of results XRBi-GAC outperforms all the existing multilingual toxic text detection methods.

6 Conclusion and future work

In this paper, we have discussed the XRBi-GAC model, a deep-learning framework for multilingual toxic text detection. This framework leverages the word representations from the pre-trained XLM-R, which are fed to the Bi-GRU with a self-attention mechanism, which helps in capturing finer semantics and crucial information. We have integrated a capsule network which enhances the learned semantic information and understands it in a more refined manner. We have used the loss function as a fusion of binary cross-entropy loss and focal loss, which helps deal with the imbalanced data to an extent, as shown in the experimental results. We have assessed the performance of the proposed model on two publicly available datasets: the Jigsaw Multilingual Toxic Comment dataset and the HASOC 2019 dataset. XRBi-GAC framework performs better than the existing multilingual toxicity detection methods with significant margins. On the other hand, the proposed framework is limited to identifying the toxicity in the multilingual text data but fails to deduce the intensity of toxicity in the comments. Therefore, future works on toxicity detection should address the intensity of toxicity in multilingual text data and improvement of performances on low-resource languages.

References

https://www.smartinsights.com/social-mediamarketing/social-media-strategy/new-global-social-mediaresearch. Accessed: 2022-09-25.

Conneau

, Khandelwal

, Goyal

, Chaudhary

, Wenzek

, Guzman

, Grave

, Ott

, Zettlemoyer

and Stoyanov

, “Unsupervised cross-lingual representation learning at scale,” arXiv preprint arXiv:1911.02116, 2019.

Zhao

, Zhou

and Mao

, “Automatic detection of cyberbullying on social networks based on bullying features,” in Proceedings of the 17th international conference on distributed computing and networking, pp. 1–6, 2016.

Pranckevicius

and Marcinkevičius

, “Application of logistic regression with part-of-the-speech tagging for multi-class text classification,” in 2016 IEEE 4th workshop on advances in information, electronic and electrical engineering (AIEEE), pp. 1–5, IEEE, 2016.

Gaydhani

, Doma

, Kendre

and Bhagwat

, “Detecting hate speech and offensive language on twitter using machine learning: An n-gram and tfidf based approach,” arXiv preprint arXiv:1809.08651, 2018.

Neogi

A.S.

, Garg

K.A.

, Mishra

R.K.

and Dwivedi

Y.K.

, Sentiment analysis and classification of indian farmers’ protest using twitter data, International Journal of Information Management DataInsights 1(2) (2021), 100019.

Vaswani

, Shazeer

, Parmar

, Uszkoreit

, Jones

, Gomez

A.N.

, Kaiser

Ł.

and Polosukhin

, Attention is all you need, Advances in Neural Information Processing systems 30(2017).

Devlin

, Chang

M.-W.

, Lee

and Toutanova

, “Bert: Pretraining of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.

Sanh

, Debut

, Chaumond

and Wolf

, “Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter,” arXiv preprint arXiv:1910.01108, 2019.

10.

Liu

, Ott

, Goyal

, Du

, Joshi

, Chen

, Levy

, Zettlemoyer

M. L.

and Stoyanov

, “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.

11.

Lan

, Chen

, Goodman

, Gimpel

, Sharma

and Soricut

, “Albert: A lite bert for self-supervised learning of language representations,” arXiv preprint arXiv:1909.11942, 2019.

12.

Gao

and Huang

, “Detecting online hate speech using context aware models,” arXiv preprint arXiv:1710.07395, 2017.

13.

Singh

and Chand

, “Pardeep at semeval-2019 task 6: Identifying and categorizing offensive language in social media using deep learning,” in Proceedings of the 13th International Workshop on Semantic Evaluation, pp. 727–734, 2019.

14.

Ibrahim

, Torki

and El-Makky

, “Imbalanced toxic comments classification using data augmentation and deep learning,” in 2018 17th IEEE international conference on machine learning and applications (ICMLA), pp. 875–878, IEEE, 2018.

15.

Van Aken

, Risch

, Krestel

and Loser

, “Challenges for toxic comment classification: An in-depth error analysis,” arXiv preprint arXiv:1809.07572, 2018.

16.

Ranasinghe

, Zampieri

and Hettiarachchi

, “Brums at hasoc 2019: Deep learning models for multilingual hate speech and offensive language identification.,” in FIRE (working notes), pp. 199–207, 2019.

17.

Lample

and Conneau

, “Cross-lingual language model pretraining,” arXiv preprint arXiv:1901.07291, 2019.

18.

Dong

X.L.

and de Melo

, “A robust self-learning framework for cross-lingual text classification,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), pp. 6306–6310, 2019.

19.

Wang

, Liu

, Ouyang

and Sun

, “Galileo at semeval- 2020 task 12: Multi-lingual learning for offensive language identification using pre-trained language models,” arXiv preprint arXiv:2010.03542, 2020.

20.

Sai

and Sharma

, “Siva@ hasoc-dravidian-codemixfire- 2020: Multilingual offensive speech detection in codemixed and romanized text.,” in FIRE (Working Notes), pp. 336–343, 2020.

21.

Pant

and Dadu

, “Cross-lingual inductive transfer to detect offensive language,” arXiv preprint arXiv:2007.03771, 2020.

22.

Malik

, Aggrawal

and Vishwakarma

D.K.

, “Toxic speech detection using traditional machine learning models and bert and fasttext embedding with deep neural networks,” in 2021 5th International Conference on Computing Methodologies and Communication (ICCMC), pp. 1254–1259, IEEE, 2021.

23.

Song

, Huang

and Xiao

, “A study of multilingual toxic textdetection approaches under imbalanced sample distribution,”, Information 12(5) (2021), 205.

24.

Jhaveri

, Ramaiya

and Chadha

H.S.

, “Toxicity detection for indic multilingual social media content,” arXiv preprint arXiv:2201.00598, 2022.

25.

Lashkarashvili

and Tsintsadze

, “Toxicity detection in onlinegeorgian discussions,”, International Journal of InformationManagement Data Insights 2(1) (2022), 100062.

26.

Mandl

, Modha

, Majumder

, Patel

, Dave

, Mandlia

and Patel

, “Overview of the hasoc track at fire 2019: Hate speech and offensive content identification in indoeuropean languages,” in Proceedings of the 11th forum for information retrieval evaluation, pp. 14–17, 2019.

27.

Rumelhart

D.E.

, Hinton

G.E.

and Williams

R.J.

, Learningrepresentations by back-propagating errors, Nature 323(6088) (1986), 533–536.

28.

Hochreiter

and Schmidhuber

, Long short-term memory, Neural Computation 9(8) (1997), 1735–1780.

29.

Chung

, Gulcehre

, Cho

and Bengio

, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” arXiv preprint arXiv:1412.3555, 2014.

30.

Schuster

and Paliwal

K.K.

, Bidirectional recurrent neuralnetworks, IEEE Transactions on Signal Processing 45(11) (1997), 2673–2681.

31.

Sabour

, Frosst

and Hinton

G.E.

, Dynamic routing betweencapsules, Advances in Neural Information Processing Systems 30 (2017).

32.

Saif

M.A.

, Medvedev

A.N.

, Medvedev

M.A.

and Atanasova

, “Classification of online toxic comments using the logisticregression and neural networks models,”, in AIP conference proceedings 2048 (2018), 060011, AIP Publishing LLC.

33.

Dhamija

, Katarya

, et al., “Comparative analysis of machine learning and deep learning algorithms for detection of online hate speech,” in Advances in Mechanical Engineering, pp. 509–520, Springer, 2021.

34.

Ghosh

, Kumar

, Lepcha

and Jain

S.S.

, “Toxic text classification,” in Data Science and Security, pp. 251–260, Springer, 2021.