Abstract
Social media platforms allow people across the globe to share their thoughts and opinions and conveniently communicate with each other. Apart from various advantages of social media, it is also misused by a set of users for hate-mongering with toxic and offensive comments. The majority of the earlier proposed toxicity detection methods are primarily focused on the English language, but there is a lack of research on low-resource languages and multilingual text data. We propose an XRBi-GAC framework comprising XLM-RoBERTa, Bi-GRU with self-attention and capsule networks for multilingual toxic text detection. A loss function is also presented, which fuses the binary cross-entropy loss and focal loss to address the class imbalance problem. We evaluated the proposed framework on two datasets, namely, the Jigsaw Multilingual Toxic Comment dataset and HASOC 2019 dataset and achieved F1-score of 0.865 and 0.829, respectively. The results of the experiments show that the proposed framework has outperformed the state-of-the-art multilingual models XLM-RoBERTa and mBERT on both datasets, which shows the versatility and robustness of the proposed XRBi-GAC framework.
Introduction
Online social media platforms are undoubtedly among the most significant technological advancements of the 21st century and have had a tremendous cultural impact on people. In this era of social computing, interpersonal communication is intensifying, mainly through social media platforms and chat forums. Microblogging platforms allow individuals worldwide to express and share their opinions instantly and widely in various languages. The multiple advantages of these platforms include the extensive diffusion of content across geographic borders and the facilitation of interactions and exchanges in numerous languages largely unrestricted by physical barriers, excluding infrastructure. On the other hand, social media has also turned into a platform for hate-mongering, and toxic language since users feel shielded in some way by their virtual identities. Users might be intimidated on the basis of their political ideas, religious convictions, ethnicity, or colour. We cannot accede to this rhetoric being used on social media since it may lead to anxiety, depression, suicides, and societal disorder. Given the massive volume of information in different languages created each day on social media, it is impossible for humans to detect such toxic texts containing profanity and vulgarity manually. Therefore, there is a need to develop a model to detect toxicity in multilingual text. As of mid-2022, approximately 4.70 billion people are using social media around the world, that is 59 per cent of the total population of the entire world. The number of social media users has continued to increase over the past year, with 227 million new users joining since last year. That’s an average of more than seven new users every second, which works out to a growth rate of 5.1% per year [1]. At the beginning of 2021, Google Search supported 149 languages, while Bing supported 40. Twitter currently supports 45 languages, Facebook supports over 100, and Instagram supports 90. The English language has been the main focus of most of the research on the text, which is resource-rich, while much less attention is given to other languages. This disparity is mainly attributable to the lesser quality, quantity, or availability of training datasets and corpora for other languages. Progress for other languages is frequently impeded by a lack of resources or their exclusive use. Therefore, this study aims to develop a deep learning framework for addressing this complex multilingualism issue for toxic text detection. In our proposed model, we leverage the pre-trained XLM-RoBERTa [2] model discussed by Facebook as a base because it uses an enormous multilingual corpus such as Wikipedia for training. It improves multilingual migration and is one of the finest pre-trained categorization models. The proposed XLM-RoBERTa Bidirectional GRU Attention Capsule Network (XRBi-GAC) model exploits the cross-lingual transfer learning for toxicity detection and performs better than the state-of-the-art models on multilingual text.
Researchers have worked on toxic comment detection using conventional machine learning models, which require manual feature engineering with features like n-grams, tf-idf, bag-of-words, Part-of-speech (POS), etc. [3–6]. These feature-based learning models performed satisfactorily in numerous text classification tasks. However, the reliability of text features heavily depends on the field knowledge of the developer of the model, which is typically challenging to obtain. On the other hand, deep learning models automatically extract features from text data without the need for manual feature engineering and perform efficiently. In recent years, the arrival of the Transformer language model [7] gave birth to the path-breaking pre-trained model BERT (Bidirectional Encoder Representations from Transformers) [8], and the era of transfer learning began in natural language processing, which is crucial for several tasks like translation, summarization, classification, chatbots etc. Many standard models have been developed on the concept of BERT, like DistilBERT [9], RoBERTa [10], ALBERT [11] etc. Recently, the focus has been shifted to multilingual text with the advances in the multilingual language models such as multilingual BERT (mBERT) and XLM-RoBERTa, which are trained on enormous datasets in more than 100 languages and show promising results in cross-lingual transfer as well. Identifying toxicity in the multilingual text is a challenging task. Firstly, language differences create a barrier that is difficult to span by the models developed in a monolingual environment. Also, in less commonly used languages, there is often a paucity of annotated training data. The datasets used in this study suffer from resource limitations.
The following are the primary contributions of this work: We develop a model named XRBi-GAC for toxicity detection in multilingual text data. It leverages the word representations from the pre-trained layers of XLM-RoBERTa, which are fed to the bidirectional gated recurrent unit for capturing better connotations from both directions. A self-attention mechanism is incorporated into Bi-GRU to capture vital information, followed by a capsule network to get a more refined semantic representation using a dynamic routing approach, which is then used to classify the comments. We assess the XRBi-GAC on two publicly available benchmark datasets: the Jigsaw Multilingual Toxic Comment dataset and HASOC 2019 dataset. The training set of each dataset contains data only in the English language, and the model is evaluated on multilingual text data. So, the cross-lingual learning approach is utilized in the model, which is a type of transductive transfer learning. We use the fusion of binary cross-entropy loss and focal loss to address the imbalance problem in the dataset.
Related work
The ubiquity of toxicity on social media has increased the importance of toxicity detection research in recent years. There have been prior efforts on toxicity identification that used machine learning-based classifiers, but now, the focus of the study has shifted towards deep learning methods. In [12], an annotated corpus of hate speech with context information is introduced. A Logistic Regression (LR) model with context features and a Long Short Term Memory (LSTM) with learning components for context are discussed for automatic hate speech identification [13]. In [14], a hybrid model consisting of Convolutional Neural Network (CNN), Bidirectional LSTM (Bi-LSTM) and Bidirectional GRU (Bi-GRU) is discussed that uses data from Wikipedia talk pages to detect several types of toxicity generated on online platforms. An efficient augmentation method is introduced, which integrates unique words and random mask to empower the proposed model. In [15], the performance of different models that include Logistic Regression, LSTM, Recurrent Neural Network (RNN), CNN, Bi-LSTM, and GRU with attention and word embedding techniques for toxicity identification is discussed. Among all the classifiers, a Bidirectional GRU network with attention layer configuration works best. Earlier, most of the research work is done on the English language, but in recent years studies are conducted on other languages also. In [16], different neural network architectures are evaluated for identifying the offensive language in German, English, and Hindi. The results show that the fine-tuning the BERT framework surpasses all methods. In [17], two methods are discussed for cross-lingual language models: one method that only relies on monolingual data and is unsupervised, and the other method is the supervised method, which employs parallel data with a new cross-lingual language.
Now, the focus of the study is shifted towards multilingual models in comparison with monolingual models because multilingual models are more robust and versatile models and can handle a variety of languages. In contrast, monolingual models’ performance is confined to a single language only.
Relevant work
Relevant work
In [18], authors have discussed a robust self-learning framework and combined the predictions of mBERT on unlabeled non-English data. This information is used further to fine-tune pre-trained multilingual representation models to excel in Multilingual Document Classification (MLDoc). In [19], a multilingual method using pre-trained language models, XLM-R and ERNIE, is discussed. The authors have come up with a method of knowledge distillation that is trained on soft labels generated by several supervised models. The results manifest that the BERT-large performed better than the BERT-base.
A method for classifying offensive speech in Dravidian languages that are codemixed and romanized by using selective translation and transliteration is discussed in [20]. For better results, fine-tuning and ensembling of XLM-RoBERTa and mBERT have been introduced. In [21], a method for identifying the offensive language in tweets in five different languages using cross-lingual inductive transfer learning is discussed. They have used an ensemble of XLM-R (base) and XLM-R (large) cross-lingual embeddings. The top 10 closely related teams adopted the BERT, RoBERTa, XLM-RoBERTa, CNNs and LSTMs models for hybridization.
In [22], authors performed preprocessing of data using different machine learning and ensemble algorithms in which LR and XGBoost performed best. They also explored various word embedding techniques, and the results are fed to DNN classifiers. The combination of CNN and BERT outperformed all other methods. In [23], a fusion-based method for detecting toxicity in multilingual text in uneven sample distribution is discussed. The authors used mBERT and XLM-RoBERTa for pretraining, which incorporate vital information in their models. The authors have used six base models and then combined them to produce three fusion models.
In [24], a model is discussed that flags toxicity and provides users with a safer platform by combining the XLM-RoBERTa and MuRIL frameworks. To tackle code-mixed classification challenges, the authors emphasise the use of multilingual transformer-based pre-trained and fine-tuned models. In [25], authors take comments from an online arena for public discourse for Georgian toxicity classification. They employed a novel NCP (Neural Circuit Policies) technique for toxic comment classification, which showed satisfactory results.
We discuss the proposed XRBi-GAC model for toxicity identification in multilingual text. It comprises three parts: Datasets, Data Preprocessing, and XLM-RoBERTa Bi-GRU Attention Capsule Network (XRBi-GAC), as discussed below.
Dataset
Here, we will discuss two datasets, namely, the Jigsaw Multilingual Toxic Comment dataset and HASOC 2019 dataset, as they will be used for experimental purposes.
Jigsaw multilingual toxic comment dataset
This dataset is taken from Kaggle and contains comments collected from Civil Comments and Wikipedia. The training set contains English-only comments having 223549 samples with labels being toxic (0 or 1). 21384 samples, or 9.4%, are labelled as toxic. There are 8000 labelled samples in three languages in the validation set. 15.4% of the samples, or 1230, are labelled as toxic, as shown in Figure 1.

Class distribution of training dataset (Left) and validation dataset (Right).
There are 3000 Turkish (tr) samples, 2500 Italian (it) samples, and 2500 Spanish (es) samples. The language distribution with class in the validation set is shown in Figure 2.

Language distribution with class.
There are a total of 63,812 unlabeled samples in the test set, split between six different languages: Spanish (es) has 8,438 comments, French (fr) has 10,920 comments, Italian (it) has 8,494 comments, Portuguese (pt) has 11,012 comments, Russian (ru) has 10,948 comments, and Turkish (tr) has 14,000 comments.
The training data is skewed, containing almost 90% of the comments labelled as non-toxic. So, a metric like an accuracy cannot correctly detect the samples of the minority class. To tackle this challenge, the resampling method is employed to upsample the minority class, i.e., the toxic class. The F1-score is used to maintain a balance between precision and recall and enhances the score only if the classifier correctly identifies more instances of a certain class.
This dataset [26] is taken from HASOC (Hate Speech and Offensive Content Identification in Indo-European Languages), a shared task. It contains Twitter and Facebook posts in English, Hindi and German separately. These posts are to be classified into two classes: NOT and HOF. NOT class is for posts that do not contain hateful content, and HOF class is for those posts that contain hateful and offensive content. We choose English data for training, containing 5852 posts. The test data is composed by combining the Hindi test data, which contains 1318 posts, the English test data, which contains 1153 posts, and the German test data, which contains 850 comments, respectively. Thus, it contains 3321 posts in Hindi, English, and German. This classification is a similar task as we have in the Jigsaw Multilingual Toxic Comment dataset; that’s why this dataset is chosen to assess the versatility of the proposed model.
Data preprocessing
It is a critical process that takes the raw comments as input and converts them into a form that preserves the semantics, context and inherent linguistic information of the input comments as much as possible while minimizing the loss of information. Preprocessing of data in this work includes converting each comment into lower case, removal of special characters, URLs, and emails, but stop words and punctuation are preserved as they may reveal crucial information. The proposed method takes preprocessed comments as input. Table 2 shows an instance of comment preprocessing.
An instance of preprocessing
An instance of preprocessing
The proposed XRBi-GAC model, as shown in Figure 3, mainly comprises three components. The first one is the XLM-RoBERTa base model, which is fine-tuned on the training data and generates the contextual encodings of the text. The second component is the Bidirectional GRU, followed by a self-attention mechanism, which helps in capturing finer semantics and important information. The last component is a capsule network, which enhances the learned semantic information and understands it in a more refined manner. Each component is explained as follows.

The proposed XRBi-GAC framework.
It is a multilingual masked language model built on the concept of transformers and pre-trained on 2.5 TB of CommonCrawl data that includes about 100 languages. XLM-R has acquired state-of-the-art performance in the labelling of sequences, cross-lingual classification and question answering outperforming other contemporary multilingual models like mBERT, XLM and various monolingual models. Pre-trained XLM-RoBERTa, as illustrated in the Figure 4, is fine-tuned on the labelled English training data utilizing the pre-trained weights to take the task-specific knowledge that is learned in English and can apply it to the data containing text in multiple languages. This concept of transfer learning applied from one language to another is called cross-lingual transfer. Raw input after preprocessing is tokenized using XLMRobertaTokenizer, which uses the SentencePiece model to tokenize the text and convert tokens into embeddings which inculcate both, the embedding vector and positional encoding of the token in the sequence. Special tokens like cls _ token and sep _ token are placed at the front and rear end of the sequence, respectively. sep _ token is also used to separate two sentences in a sequence. The embeddings are then fed to the encoder layers; XLM-R has two variants, XLM-R (base) and XLM-R (large), having 12 and 24 encoder layers, respectively. Each encoder layer performs multi-head self-attention on the input and feeds its output to the next encoder layer. The last hidden state obtained from the pre-trained XLM-R contains C, which contains the entire context of sequence and time distributed sequence, t1, t2, . . . . , t n , which is utilized in the proposed method.

Fine-tuning of pre-trained XLM-RoBERTa.
The RNN [27] model is widely used in NLP, but it suffers from the vanishing gradients problem in which RNN fails to recall initial sequence values. This recalling is critical in text classification because a few terms depend on the words that appear early in the sequence. Both LSTM [28] and GRU [29] address the vanishing gradient problem in RNN. GRU makes use of a smaller number of training parameters, less memory, and requires fewer data to generalise compared to LSTM, although LSTM works better with datasets that have longer sequences. GRU is an ideal choice for encoding social media updates because they are often shorter. GRU’s cell structure is depicted in Figure 5. It has two gates: the update gate and the reset gate. The update gate controls how much prior knowledge must be transmitted forth towards the future, while the reset gate governs how much prior knowledge should be forgotten. The GRU cell state computation formula for time t is as follows:

GRU Cell Architecture
Bidirectional [30] GRU is composed of forward and backward GRU units. It encodes the information as the internal state vectors after taking the output sequence from XLM-RoBERTa as input. For the purpose of generating accurate forecasts about the current state, a Bi-GRU enables the collection of data from both earlier time steps and more recent time steps as given in the equation (5) and (6).
The attention mechanism enables the output to focus on the input while generating output. In contrast, inputs can interact with each other in a self-attention paradigm, which figures out the attention of all other inputs relative to a single input. Self-attention [7] is effective in modelling sequence segment dependencies. It uses information about location and observation value, and instead of conditioning on the whole sequence, it uses pairwise comparisons that are shown by vectors for both.
The self-attention mechanism, as illustrated in Figure 6, is viewed as a query to key-value mapping. Each vector has three representations, namely query (Q), key (K), and value vector (V). When a word vector x n seeks all of the other word vectors’ K-V pairs, including its own, Q is activated and trained. K and V are trained to provide attention values. Attention is defined as "Scaled Dot-Product Attention," which is shown in the following equation with Q, K, and V:

Self-attention mechanism.
A capsule is a group of neurons whose outputs indicate distinct characteristics of the same entity. The activity vector of a capsule, which describes the instantiation parameters of a certain type of entity, has all the relevant information. CapsNet [31] is a capsule network that substitutes vector output capsules for scalar output feature detectors and max-pooling with routing-by-agreement. The CapsNet comprises two layers: the primary capsule layer, or lower level layer and the digit capsule layer or higher level layer. Primary capsules route to the digit capsule layer using dynamic routing. The primary and subsequent digit caps layers receive the high-level feature representation that the preceding Bi-GRU attention layer outputs as input. The primary caps layers produce capsules with vector outputs as given in equation (9), where ‘m’ is in the current lower level primary caps, and ‘n’ is in the higher level layer.
Dynamic routing (Algorithm 1) will do the connection-weight updation during network training as given in equation (10).
Dynamic Routing Algorithm
1: for all primary capsule m and digit capsule n
2: b mn ← 0
3:
4:
5:
6: c mn ← softmax (b mn )
7:
8:
9: s
n
←
10:
11:
12: v n ← squash (s n )
13:
14:
15: b
mn
←
16:
17:
18:
Figure 7 shows the capsule’s major functions consist of encoding significant textual features and their relationships by multiplying the input vector matrix and weights, dynamic routing for transmitting output from one low-level capsule to another, and summation of weighted input vectors.

Operations within a capsule.
Also, a capsule has a "squash" function that "squashes" a vector with a maximum and minimum length of 1 and 0, respectively, while preserving the vector’s direction as given in equation (11).
The output H from the self-attention layer is taken by the capsule layer as input. The capsule layer extracts prominent refined information from H, which is required for toxicity detection, and feeds it to the output layer.
Pseudocode for XRBi-GAC
1: Preprocess the input comments by removing the useless words, special symbols and URLs.
2: Tokenize the preprocessed comments into tokens using XLMRobertaTokenizer and add special tokens cls_token (<s>) and sep_token (< /s>) at the beginning and end of the comment respectively, which convert comments in the form <s>, tok _ 1, tok _ 2, . . . . , tok _ n, < /s>.
3: Tokens from step 2 are converted into embedding sequence in the form of e<s>, e1, e2, . . . . , e n , e</s>.
4: The embedding sequence from step 3 is fed to the pre-trained XLM-RoBERTa and last hidden state is captured as t1, t2, . . . . , t n .
5: The time distributed sequence output from step 4 is input to the Bi-GRU layer that gives the forward hidden layer output of the Bi-GRU as
6: The output from Bi-GRU is received by the self-attention layer to further capture the context related information of the comment and output is H = (H1, H2, . . . . , H n ).
7: H is fed to the Capsule Network followed by a sigmoid layer to classify whether the comment is toxic.
8: return y = (0, 1)
The output layer uses a fully-connected layer with sigmoid activation as its final prediction layer to aid in estimating binary classification probabilities. The two classification categories are namely non-toxic and toxic.
The process of toxic text classification using the proposed XRBi-GAC framework is summarized in Algorithm 2.
Experimental evaluation
Here, we first discuss about the evaluation metrics and experimental setup, followed by the baseline models used in this study.
Evaluation metrics
We use accuracy, macro-average precision, macro-average recall and macro-average F1 -score as the evaluation metrics for the proposed work.
Accuracy measures the fraction of correct predictions. It is effective for balanced data but not efficient for imbalanced data.
Precision refers to the proportion of relevant outcomes, and Recall is the proportion of relevant results accurately classified.
F1-score is represented as a harmonic mean of precision and recall. It can reflect the performance of a model in a true sense, compared with accuracy, as it is a more indicative metric.
Macro F1-score for the evaluation is computed using the arithmetic mean of all the per-class F1-scores.
The loss function used in this work is a fusion of two loss functions, namely, binary cross-entropy loss (L b ) and focal loss (L f ). L b is given in the equation (19) and L f is given in the equation (20).
Since detecting toxic comments in this study is a typical two-category problem; hence, L b is used. Also, L f addresses the class imbalance problem; hence it is chosen with L b as the training data is highly imbalanced. The following is the formulation of the loss function:
where λ1 and λ2 are the fusion weights.
Scikit-Learn and Keras with Tensorflow as the backend Python framework libraries are used to carry out the proposed work. Experiments are conducted using Kaggle Kernel, a free platform for executing Jupyter notebooks with CPU (default), Nvidia K80 GPU, and TPU v3-8 accelerator. We have leveraged XLM-R (base) transformer in our proposed work which has 12 encoder layers, 768 hidden, 12 attention heads and 12,270M parameters. XLM-R is fine-tuned on the training data optimizing various parameters like maximum sequence length, learning rate, batch size and the number of epochs using random search. The maximum sequence length is kept as 256, and the batch size is taken as 16. The dropout rate is adjusted to 0.1 to avoid the issue of overfitting. Table 3 shows the hyperparameters used in the proposed model.
Hyperparameter setting
Hyperparameter setting
Early stopping is used as a regularization technique and a default learning rate of 1e-5. The sigmoid activation function is used as our task is a binary classification. The loss function combines 1:1.2 ratios of binary cross-entropy loss and focal loss.
We use Logistic Regression [32] with tf-idf text representation, CNN [33], Bi-LSTM [34] with fastext word embedding, and Bi-GRU with fastext word embedding as the four baseline models. LR is the traditional supervised machine learning model that needs manual feature engineering. Bi-LSTM and Bi-GRU are RNN-based DL models utilised with fastext word embeddings. CNN is another DL model used as the baseline.
Results and discussion
We analyze the results of the proposed XRBi-GAC model on the Jigsaw Multilingual Toxic Comment dataset. The performance of the recent SOTA methods has been compared with the proposed XRBi-GAC model. The results on this dataset are shown in Table 4. The best-performing model is our proposed XRBi-GAC model, which manifests significant improvement over state-of-the-art multilingual models. The performance of the models that are considered as the baseline is shown in the first four rows of Table 4. It shows that the LR+tf-idf performs worst, having F1-score and accuracy of 0.739 and 0.814, respectively. CNN produce better accuracy than BiGRU+fastext and a lower F1 than Bi-LSTM+fastext and Bi-GRU+fastext. Bi-GRU+fastext performs the best among all the baseline models achieving the highest F1 of 0.808, followed by Bi-LSTM+fastext, having a slightly less F1 of 0.793.
Experimental results on the jigsaw multilingual toxic comment dataset
Experimental results on the jigsaw multilingual toxic comment dataset
The bottom five rows of Table 4 depicts the performance of the base version of BERT, RoBERTa, and multilingual SOTA models, mBERT and XLM-RoBERTa and the proposed XRBi-GAC model. BERT has achieved an F1 of 0.811 which is a 0.37% improvement over Bi-GRU+fastext, and RoBERTa has achieved an F1 of 0.820, which is an improvement over BERT. Both mBERT and XLM-RoBERTa are SOTA models for multilingual text data and have attained an F1 score of 0.834 and 0.845, respectively, performing better than BERT and RoBERTa. The F1-score of mBERT is improved by 2.8% and 1.7% than that of BERT and RoBERTa, respectively. XLM-R shows an improvement of 4.2% and 3% over BERT and RoBERTa, respectively in terms of F1-score. XLM-R performs better than mBERT and shows an improvement of 1.3% and 2.5% over mBERT in F1-score and accuracy, respectively. Our proposed XRBi-GAC model outperformed all the models used, attaining the highest F1-score of 0.865 with a good balance between precision and recall. In comparison to the multilingual SOTA models mBERT and XLM-R used in this work, XRBi-GAC shows an improvement of 3.7% over mBERT and of 2.3% over XLM-R in terms of F1-score. RoBERTa achieved the highest precision of 0.862 but with a precision-recall gap of 8.9% and mBERT achieved highest recall value of 0.882 with the largest gap of 9.1%. In contrast, XLM-R achieved the highest accuracy of 92.8%.
We have evaluated the proposed XRBi-GAC model on the HASOC 2019 dataset to check its versatility and robustness. The proposed model is trained on the English-only data and assessed on the combined dataset containing tweets in English, Hindi and German, which is a similar task as the one we had in the multilingual toxic comment dataset used previously.
We evaluated the performance of Bi-GRU+fastext, mBERT, XLM-R and the proposed XRBi-GAC model on the HASOC 2019 dataset, and the performance of these models is depicted in Figure 8. XRBi-GAC shows decent performance with an F1 of 0.829, outperforming all the other models. XLM-R performed second best, having F1-score of 0.801, followed by mBERT with the F1 value of 0.786. Bi-GRU+fastext has the least F1-score of 0.765. XRBi-GAC shows a significant improvement of 3.5% and 5.4% over XLM-R and mBERT, respectively, in terms of F1-score. The proposed framework maintains a decent balance between precision and recall with a gap of 1.9%, which is the least among all the models used on this dataset. Also, XRBi-GAC achieved the highest accuracy of 86.1%. Thus, XRBi-GAC outperformed the XLM-R and mBERT.

Performances of models on HASOC 2019 dataset.
The proposed method captures better semantics in the multilingual text than the existing methods. The Bi-GAC (Bi-GRU with Self-attention Capsule Network) module, which is built on top of the pre-trained XLM-R, efficiently extracts vital information from the multilingual text, which is crucial for toxicity classification. It is evident from the overall analysis of results XRBi-GAC outperforms all the existing multilingual toxic text detection methods.
In this paper, we have discussed the XRBi-GAC model, a deep-learning framework for multilingual toxic text detection. This framework leverages the word representations from the pre-trained XLM-R, which are fed to the Bi-GRU with a self-attention mechanism, which helps in capturing finer semantics and crucial information. We have integrated a capsule network which enhances the learned semantic information and understands it in a more refined manner. We have used the loss function as a fusion of binary cross-entropy loss and focal loss, which helps deal with the imbalanced data to an extent, as shown in the experimental results. We have assessed the performance of the proposed model on two publicly available datasets: the Jigsaw Multilingual Toxic Comment dataset and the HASOC 2019 dataset. XRBi-GAC framework performs better than the existing multilingual toxicity detection methods with significant margins. On the other hand, the proposed framework is limited to identifying the toxicity in the multilingual text data but fails to deduce the intensity of toxicity in the comments. Therefore, future works on toxicity detection should address the intensity of toxicity in multilingual text data and improvement of performances on low-resource languages.
