Abstract
Text classification is a fairly explored task that has allowed dealing with a considerable amount of problems. However, one of its main difficulties is to conduct a learning process in data with class imbalance, i.e., datasets with only a few examples in some classes, which often represent the most interesting cases for the task. In this context, text classifiers overfit some particular classes, showing poor performance. To address this problem, we propose a scheme that combines the outputs of different classifiers, coding them in the encoder of a transformer. Feeding also a BERT encoding of each example, the encoder learns a joint representation of the text and the outputs of the classifiers. These encodings are used to train a new text classifier. Since the transformer is a highly complex model, we introduce a data augmentation technique, which allows the representation learning task to be driven without over-fitting the encoding to a particular class. The data augmentation technique also allows for producing a balanced dataset. The combination of both methods, representation learning, and data augmentation, allows improving the performance of trained classifiers. Results in benchmark data for two text classification tasks (stance classification and online harassment detection) show that the proposed scheme outperforms all of its direct competitors.
Introduction
Text classification is a pervasive task in many real-world applications [44]. Text-based classifiers facilitate the analysis of massive text collections, helping to label content in classes of interest for researchers and practitioners. The arise of social media has driven the development of more and better text classification techniques, helping to address complex problems such as online harassment detection [10] and stance classification [39], among other relevant tasks. Many times, the data used to train text classifiers reflect natural imbalances in the classes of interest, with many examples in some categories and only a few examples in others. This imbalance is because events of interest are often infrequent [27]. Imbalance drives towards over-fitting, that is, an over-learning of frequent classes. Consequently, text classifiers achieve good performance in the majority classes but at the cost of obtaining poor performance in other classes.
Several strategies have been proposed for dealing with unbalanced classes. Techniques based on data resampling [13] have explored oversampling on minority classes and undersampling on majority classes to produce balanced datasets. The performance of data resampling combined with data augmentation techniques, such as SMOTE [11], has also been explored. These techniques have shown unconvincing results in text classification [29]. One of the causes that can explain its poor success in text classification is that these techniques tend to discard very relevant information as the sequential nature of the text and its semantics. Semantic-agnostic methods may produce more examples but with little meaning.
Some techniques with better results in text classification have sought to combine classifiers outputs, in a sort of encoding of the example at a meta-level feature level. Bravo-Marquez et al. [7] showed that the use of these strategies gets improvements in some text classification tasks as sentiment analysis. These strategies take inspiration from committee machines [30], a kind of classifier aggregation schema that allows combining classifiers outputs to produce a boosted label imputation process. These techniques use label-aggregation functions at the output as majority voting strategies.
Despite the success shown by resampling techniques combined with data augmentation strategies such as SMOTE in related areas as image processing [35], the results in text classification are still unconvincing. With the rise of deep neural network models, which increase the number of parameters, the need for big datasets with a balanced proportion of examples per class has increased. Deep learning models such as convolutional [32] or recurrent neural networks [25] have been applied to text classification, showing the same limitations as their predecessors [8]. Moreover, since these architectures require more data during the training phase, they present higher risks of over-fitting. The need to address these difficulties in the face of the dominance of powerful architectures, such as the transformer [49], make this need a priority requirement.
To address this problem, we combine different strategies in a single coherent proposal. We introduce a framework for text classification that makes use of a transformer encoder to combine the outputs of several text classifiers, recoding each example of the training dataset. Our proposal also considers the BERT representation of each sample [17], a text embedding strategy that is also based on the transformer, obtaining a joint representation of the classifiers outputs and the text, producing a residual connection architecture for representation learning. The architecture learns a new encoding of the dataset, which is feed to a text classifier. To avoid over-fitting the most frequent classes, we introduce a data augmentation method that generates more examples in minority classes. The data augmentation technique is dependent on the semantics and syntax of the original cases. To produce the new examples, we resample the GLoVe space [40], a text embedding space generated from web data, conditioned to the syntax of each text and the class to which each sample belongs. In this way, the new examples generated by resampling are semantically consistent with the classes from which they were created. The combination of both strategies, that is, representation learning based on classifiers outputs and data augmentation based on resampling of the GloVe space, allows training text classifiers which consistently obtain better results than their competitors.
To achieve these results, we performed the following tasks:
Several text classification strategies are analyzed and compared with our solution based on the transformer model. We address the problem of the combination of classifiers outputs using the transformer, and we show how our proposal outperforms state-of-the-art solutions. We consider the problem of data imbalance, and we design a strategy to reduce the differences among the classes in terms of the number of examples; our data augmentation strategy can produce more examples in minority classes keeping the semantics and syntax of the original cases safe. We study how to utilize our proposal in two relevant tasks in social media, online harassment detection, and stance classification; our experiments consistently show improvements regarding state-of-the-art solutions.
By addressing all the tasks mentioned above, we show how we can improve the performance of text classifiers in tasks relevant to social media content processing. The chosen approach considers the complexities that arise when performing this kind of task in these scenarios; this is machine learning with few labeled text data and data imbalance.
This article extends and ties together, in a complete framework, the following contributions:
How to apply self-attention to produce a new encoding of training examples using the encoder of the transformer (previously introduced in [8] and applied to stance classification). How to use the encoder of the transformer to produce a joint representation of the text and classifiers outputs of each training example (previously introduced in [9] and applied to online harassment detection).
Novel, unpublished contributions of this article are:
A new method for data augmentation based on GloVe resampling; the method produces more examples in minority classes keeping safe the semantic of the original cases. A complete explanation of the proposed combination of classifiers outputs, connecting it with the original proposal of the transformer. A new set of tests, which evaluate the proposed framework by combining our data augmentation technique with the representation learning stage based on the transformer encoder.
The article is organized as follows. After an overview of the current related work in Section 2, we discuss in Section 3 how to use the encoder of the transformer to learn a joint representation of classifiers outputs and text examples. After this, we introduce in Section 4 our data augmentation strategy based on GloVe resampling. Experiments validating our solution, conducted on benchmark datasets, are presented and discussed in Section 5. Finally we conclude in Section 6, commenting on applications and extensions of our framework.
In this section, we review work related to our proposal. The section is divided into three parts; the first discusses text classification, the particularities of this problem, and its main challenges. Descriptions of ensembles and data augmentation techniques for classifiers are included in this subsection. Then, the next two subsections are dedicated to the description of the related work in the specific tasks in which we evaluate our proposal. The works reviewed in stance classification, and online harassment detection include works in text classification based on deep learning architectures, with which we perform the comparison of our proposal.
Text classification
Automatic text classification seeks to solve several problems related to the handling of large volumes of documentary data, such as the organization of collections, document partition for indexing systems, or content categorization [44]. The first automatic text classifiers worked on a representation of the content of each document based on features defined by experts. In all these representations, the features were the terms used in each document. Using rules of membership of keywords, the first automatic document categorizers were based on decision trees [18].
The vector representations of documents on the term-space used weight schemes based on, for example, the Tf-Idf model [41], initially proposed for information retrieval tasks. These vector representations made it possible to avoid over-representation of common terms by giving higher relevance to discriminative terms. Using support vector machines, it was shown that text classifiers could obtain good performance in benchmark datasets, close to the results obtained by human experts [26]. However, in multi-class classification schemes, automatic techniques generally showed over-fitting to some particular classes. To address this problem, some strategies based on committee machines were explored [30]. A first strategy based on bagging [31], that is, the training of several independent classifiers, showed good results in text classification based on label imputation by majority voting. However, there were no improvements in datasets with unbalanced classes. In this scenario, another ensemble strategy obtained better results; adaptive boosting (AdaBoost) [42] had the particularity of retraining classifiers using re-sampling, giving a higher probability to those documents in which the classifiers obtained worse results. In a kind of specialization strategy, AdaBoost showed improvements in some unbalanced datasets.
Many of the difficulties of text classifiers lie in the lack of labeled data to train machine learning algorithms, or in the imbalance of these datasets. The use of re-sampling strategies has been explored in classification, with emphasis on techniques of oversampling to minority classes or sub-sampling of majority classes [13]. The well-known SMOTE technique [11], which produces synthetically generated examples of minority classes by performing linear combinations between feature vectors, has shown good results in related areas such as image classification [35] but has unconvincing results in text classification. This limitation is due to SMOTE does not take into account the semantics and syntax of the original examples to generate new synthetic cases. We believe that a data augmentation technique that keeps safe the semantic of the original dataset can outperform the results achieved by its predecessors.
The rise of deep learning, that is, the prevalence of learning methods based on deep neural network architectures, has modified the text classification scenario. A first new facet of text classification influenced by deep learning consists of the pervasive incorporation of word embeddings for the representation of documents. Word embeddings compute dense vectors on continuous representations in a low dimensional space of features. Word embeddings are trained on massive text collections, such as Wikipedia, to capture more semantic relationships between words, encoding them in the text representation space. Different NLP tasks have been evaluated using word embeddings, with BERT [17] being the dominant text encoding today. BERT embeddings are calculated by computing a language model through a transformer architecture [49]. The encodings that the machine computes to solve are those that are finally extracted from the transformer to represent each word.
The performance of text classifiers has improved by incorporating word embeddings during the learning process. Recurrent neural networks [25] have been explored in tasks related to text classification. Generally, these networks are ingested word by word, taking advantage of the fact that word embeddings are computed at this level of abstraction. In the case of convolutional networks [32], a representation layer is often jointly learned with the classification task. In the following subsections, we will show how these strategies have been used in the two tasks in which we will explore the performance of our proposal.
Stance classification
One of the first works that addressed automatic stance classification pointed out to the recognition of stances in on-line ideological debates [48]. In that work, the authors explored the utility of sentiment opinions, building an arguing lexicon from a manually annotated corpus. Using the entries of the lexicon as features, the authors used supervised learners for stance classification on four different on-line debate forums. Anand et al. [1] examined stance classification on a corpus of 4873 posts across 14 topics gathered from the website ConvinceMe.net. The authors showed that rebuttal posts were hard to classify using text features. They also demonstrate that methods that take into account the context of the post (in an on-line social network this refers to conversational threads) might be helpful for this task. Walker et al. [50] coded an extensive collection of posts, tagging the level of agreement between consecutive posts. The released corpus named the Internet Argument Corpus (IAC) comprises posts manually sided for the topic of discussion extracted from 4forums.com, a site devoted to on-line debate. The authors concluded that rebuttal posts are hard to detect even for human annotators. The main reason for these difficulties is based on the extensive use of stylistic ambiguity in rebuttal as sarcasm and irony. Faulkner [20] investigated stance classification at document-level, proposing a set of text-based features to capture the stance of student’s essays (answer) concerning an essay prompt (affirmation). Several machine learning methods were studied for stance classification in two-sided debates by Hasan et al. [23]. The authors concluded that there is no a clear winner method, but sequence models as Hidden Markov Models outperform its competitors in some cases.
The first work on stance classification for news comments was authored by Sobhani et al. [47]. The authors used topic modeling for argument tagging, who are subsequently used for stance classification (agree/disagree) of news comments. Stance classification on Twitter was addressed by Lukasic et al. [36], who explored the use of temporal dependencies along sequences of tweets to improve the performance of a stance classifier based on Hawkes processes. The relation between stance and rumor veracity was studied in the PHEME project [15], where several resources were developed to tag misinformation, disinformation, rumors, and speculations. As part of this project, several posts were labeled according to the stance towards target information. In this line of research, Zubiaga et al. [56] studied how a tree-CRF classifier performed on stance classification modeling conversational threads in Twitter replies. Another contribution of this work is the construction of a fine granularity taxonomy for stance classification, extending the scenario from two classes (agree/disagree) to four categories (supporting/denying/questioning/commenting). There is a consensus that these four classes represent in a better way the complexity of the task.
During the last years, deep learning architectures have become a dominant approach in stance classification. LSTM networks were applied for the first on this task by Zubiaga et al. [55]. The authors showed that the use of sequential learning architectures is useful for this task. Bidirectional LSTM-based encodings of tweets were used for stance classification by Augenstein et al. [3] on the Twitter Stance Detection corpus, a dataset released by Mohammad et al. during the SemEval 2016 challenge [38]. Chen et al. [12] used convolutional neural networks (CNN) to jointly address stance classification and rumor detection on the corpus released for the SemEval 2017 challenge [16] while Lozano et al. [34] used an ensemble classification approach, achieving the first place in that competition. In the same competition, Bahuleyan et al. [6] used XGBoost, an additive, and iterative tree-based supervised machine learning approach based on AdaBoost. XGBoost achieved second place in the competition. Recent efforts point to jointly learn rumor detection and stance classification using two-layered gated recurrent units networks (GRU) [37].
Online harassment detection
Several works have approached the detection of online harassment from a classic machine learning perspective [10, 14, 51]. These articles generally combine features extracted from messages with features retrieved from user profiles, using a feature-engineering strategy. Combining both sources of information, several of these methods train supervised learning algorithms like support vector machines or random forests. A limitation of many of these works is that they are sensitive to the imbalance of labeled data. In practice, many of these methods fail to generalize well to other datasets, which limits their use in real environments. A thorough review of these types of techniques is addressed in [43].
More sophisticated models, such as those studied in deep learning, have also been applied to the problem of hate speech detection. One of the advantages of deep learning architectures is that they allow the neural network to acquire an adequate representation of the problem. The use of text encoders has offered benefits to these types of models over conventional machine learning models. For example, convolutional networks [22] have shown good results in the Wasem and Hovy dataset [52]. Recurrent neural networks have also demonstrated good results in this dataset, based on the GRU architecture [54]. Nearly perfect results in this dataset were also reported using deep learning by Badjatiya et al. [5]. Unfortunately, many of these models have overfitting problems, and then, they are not transferable to production. Recently, Arango et al. [2] showed that there are also problems in the generation of these datasets considered as standard for the evaluation of this type of tasks. Among these problems, the most worrying is the population bias used to generate the samples that make up the dataset. These works show that the hate speech detection problem is far from being solved.
A major problem that these datasets have is the imbalance between classes. Hate speech detection must be carried out in scenarios where most of the conversations are mostly neutral, and the harassment is exceptional. However, not being exceptional is less critical. The consequences that harassment and hate speech produce on social network users is fierce. To address the problem of imbalance, in [45], the authors use techniques to increase and generate texts that allow generating training data with balanced classes. In this same line, Sharifirad et al. [46] showed that a promising way to address the problem is to define a finer level of granularity for this task. Based on this latest work, the “SIMAH” challenge1
Far from showing itself as a task with mature and robust solutions, this task shows many challenges. For more details on all hate speech detection variants, the reader is recommended to review the Fortuna and Nunes survey [21].
The techniques of ensembles of classifiers inspire our proposal. In these techniques, classifiers are trained to solve a specific task. A key aspect of classifier ensembles is to define a label consolidation strategy. For each example, each base classifier imputes a class; the consolidation phase consists of determining which of these labels to assign. The imputation can be resolved in many ways, among them one of the most popular is the strategy based on majority voting.
There are two families of ensembles strategies according to the type of training that is done to the base classifiers. Bagging [30] consists of training several independent and decoupled classifiers. AdaBoost [42] trains some classifiers in sequence, using the errors of the first classifiers to define the probability of sampling. Through a resampling strategy, AdaBoost learns new classifiers on the examples in which previous classifiers have got worse results.
Our proposal is inspired by Bagging since it is built on independently and decoupled classifiers. An essential difference between our proposal and Bagging is in the consolidation phase of classifiers outputs. Instead of using a predefined function (for example, majority voting), we encoded each sample using the labels imputed by the classifiers. Our strategy replaces the consolidation phase with one based on representation learning. To achieve this goal, we define a set of learnable parameters that are adjusted to learn an encoding of each example that consolidates both the representation of the text and the classifiers’ outputs. To determine this representation, we ask the encoder to solve the original classification problem by adding a softmax classifier to the output of the encoder.
A key element of our proposal is the type of encoder used for the representation learning phase. We use the transformer encoder [49], a widespread and pervasive deep-learning encoder-decoder architecture. The transformer has been used to learn text encodings, BERT [17] being the most famous of them. The key to BERT is in the use of the transformer architecture, which incorporates a self-attention mechanism. We take inspiration from BERT in the way it uses the transformer for representation learning. Like BERT, we use the transformer encoder to encode each example, taking advantage of the transformer’s self-attention mechanism. Unlike BERT, instead of computing a language model, we use the encoder to encode the classifiers’ outputs asking the transformer encoder to solve a classification task. This purpose is achieved by introducing a set of learnable parameters that helps to encode the outputs of the classifiers along with the encoding of the text of each example. As a result, our proposal learns a joint representation of the dataset that combines text and classifier outputs, in a sort of residual architecture that adds inputs and outputs in the same architecture layer.
Next, we will explain our proposal, detailing its connection with the transformer architecture and with BERT.
The transformer
The rationale of the transformer is based on the encoder-decoder architecture. Given an input sequence
The transformer architecture is defined using stacked blocks. This rationale allows the architecture to learn intermediate representations of the input sequence at different levels of abstraction. Usually, the transformer makes use of six stacked blocks, both in the encoder and in the decoder. Each of these blocks makes use of two essential elements for the transformer to work, residual connections [24] and layer normalization [4]. The residual connections of each block combine the encoding of the input with the output of the block, avoiding the problem of degradation that occurs in deep networks. Layer normalization centers the weights of each layer around the mean and scales the weights according to their variance. This correction allows addresses gradient vanishing and gradient explosion by countering the internal covariate shift between different sequences.
The essential component of the transformer architecture is the self-attention block. The motivation to include this block is to allow the coding of each example to have non-local encodings, favoring the detection and coding of long-term dependencies in the input vectors. To achieve this goal, the transformer’s self-attention mechanism defines three learnable parameters sets for the encoding of each training example; these are query, key, and value. These parameter matrices reduce the dimensionality of the original encoding of each case. To capture the long-term dependencies in the input, each word in the entry is encoded by computing the dot product of the query vector and the key to the given position. This product is scaled and ingested to a softmax to calculate a word attention score. The last step of the mechanism consists of multiplying the attention score by the vector of values, which allows obtaining an attention vector for the example. This step can be done on a matrix basis, for reasons of efficiency. Consequently, the self-attention layer is expressed as follows:
where
The transformer defines
The output of the self-attention block is connected to a residual connection layer with the input and to a normalization layer. A second block consists of a feed-forward layer, also mediated by a residual connection and a layer normalization. These blocks (self-attention
The transformer decoder is similar, but it includes a variant that allows computing a language model task with long-term dependencies with context both to the left and to the right of each target word. The modification is called the masked language model, and it consists of a block that randomly masks some words in the input and output in the self-attention layer. The rest of the transformer layer considers the same encoder blocks, that is, the self-attention and feed-forward layer mechanism, all of them combined with residual connections and layer normalization.
At the output of the decoder, a linear layer and a softmax are inserted, which compute scores on the vocabulary. The task with which the transformer typically trains is a language model (next word prediction). There are variants of the tasks with which the architecture is trained, which can be reviewed in the work of Vaswani et al. [49].
BERT
BERT [17] makes use of the transformer encoder to build their word embeddings. A variant that BERT uses to produce the input sequence is to tokenize the text using a sub-word tokenizer known as the word-piece model (WPM) [53]. BERT uses fixed-length sequences, for which it truncates the sequences produced by WPM. Then, the input sequence is encoded using a token embedding matrix. BERT combines the input with two additional encodings, a position embedding that encodes the position of each symbol in the input, and a sentence encoding. BERT defines in addition to the language model task (next word prediction) a second task called next sentence prediction, which consists of indicating whether the following sentence is the continuation of the predecessor sentence or not. To solve this task, the sequences of sentences intersperse random sentences in 50% of the cases. The BERT authors indicate that thanks to this second task, embeddings can encode characteristics of a higher level of abstraction. Therefore, the coding of the entry in BERT is given by:
where
At the output, BERT connects a softmax, interspersed with a fully connected layer with a GELU (Gaussian rectified linear unit) activation function. The softmax layer computes scores on the vocabulary:
where
Combining classifiers outputs with the transformer
Our proposal makes use of the transformer encoder and BERT word embeddings in a scheme that brings together both strategies. The proposed strategy combines the outputs of a set of base classifiers, which are concatenated to the BERT encoding of each input. To achieve this goal, a matrix of learnable parameters is defined, transforming the output vectors of each base classifier to the dimensionality of the BERT encoding. To encode each sentence, we use the encodings that BERT provides at this level of granularity. This method is implemented in a web service named BERT-as-service,2
Let
A matrix
where
assuming that
Each layer of the transformer maintains the dimensionality of the input, through a chain of blocks of the type LayerNorm (SubLayer
where
The last part of our encoding strategy includes a softmax, which produces a classification in the dataset classes from the example encoding:
where
The data augmentation strategy that we introduce is based on words substitution. Using the sentences from the original dataset, a word-level sampling process is conducted, replacing the original words with other words that keep the original semantics and syntax of the examples safe. The resampling is done at the sentence level, and the substitution is driven by relationships found between GloVe vectors. For this purpose, we define a word substitution operator, which operates using linguistic regularities of the GloVe encodings. Linguistic regularities have been one of the great findings that these encodings offer, allowing the discovery of relationships between related terms or excluded terms. We take inspiration from Levy and Goldberg’s linguistic regularities work [33] to define a word substitution operator that is useful to our data augmentation strategy.
The proposed data augmentation strategy follows a rationale similar to that of SMOTE, in the sense that it uses resampling on minority classes to produce class balance. Our approach is also inspired by the type of training BERT uses [17]. BERT uses a word resampling rate in the masked language model task, which is crucial for word embeddings, considering the left and right contexts around each target word. Unlike BERT, which is generally used in other types of datasets (for example, Wikipedia), resampling of sentences in social media such as Twitter imposes the difficulty of working with shorter sentences.
The idea of our data augmentation strategy is to keep the semantics and syntax of the original examples safe when generating new sentences. For this, our approach takes two complementary actions. The first introduces syntactic invariance regarding the original sentence, constraining the type of words to be replaced. Syntactic invariance is fulfilled by constraining the substitution of a sampled word to another that has the same Part-of-Speech tag (POS-tag) in the generated sentence.
A key element of our data augmentation strategy is to keep the semantics of the original sentence safe relative to the generated sentence. To achieve this goal, we define a word substitution operator that makes use of the linguistic regularities of the GloVe vectors. Given a word to substitute, our operator considers the context words in the original sentences as words related to the target word. The related words correspond to the terms contiguous to the target word. The notion of an adjacent word also considers skip-grams; that is, words that are at a distance greater than one. The number of words in the context window is a parameter of the operator. Usually, the length of the context window is five words, including the target term. The operator also considers contrary related terms, called negative words in the work of Levy and Goldberg [33]. To determine negative words for the target word, the operator retrieves at random sentences from the other classes in the dataset in which the target word occurs. Negative words to the target word are terms that are used in the context of the target word in the sentences of the other classes and that do not belong to the list of positive words. Therefore, the substitution operator uses the list of positive and negative words to implement the most similar target word operation, using a proximity function.
Sampling words in sentences for data augmentation
Our data augmentation technique operates at the sentence level. The method samples at random sentences in the minority classes of the dataset until class balance is achieved. The method works with a word-level sampling fraction. Let
The selection of terms contrary to the target is made by sampling at random a sentence in
Word substitution using linguistic regularities
The word substitution operator makes use of the linguistic regularities found in Levy and Golberg’s work [33]. In that work, the authors used a set of related words and a set of negative words to implement a similarity search in the embedded space. We define an operator to replace the target word, using
where
Note that the number of words in the sets
Note that Eq. (7) allows carry out the process of replacing
Equation (6) gives us a list of related terms sorted by similarity. The word substitution method picks from this list the term with the highest similarity score that has the same POS-tag of
To study the performance of our proposal, we worked on two tasks: stance classification and harassment detection. These two tasks correspond to multi-class classification problems. For both tasks, we work with datasets widely known in the area, labeled by experts and validated by the community. In each of these tasks, we test the performance of our proposal with and without data augmentation.3
Git repository:
In this section, we first explain some essential characteristics of datasets and tasks in which we study our proposal. Then, we provide details of the preprocessing performed on the text, the metrics and validation strategies used, and the specific strategies used to train each method. Finally, the results are shown, and the main findings derived from the validation are discussed.
Stance classification
We use a publicly available Twitter dataset named RumourEval, used on SemEval 2017 (Task 8) [16], consisting of two subtasks: (a) stance classification, and (b) veracity classification.
RumourEval data has already been annotated for veracity and stance following a published annotation scheme [56]. The labeling process was conducted as part of the PHEME project [15], where the relation between rumor veracity and stance was studied. This way, each tweet presents a stance related to a claim defined as follows:
Supporting (S): The tweet supports the veracity of the claim. Denying (D): The tweet denies the veracity of the claim. Questioning (Q): The tweet demands additional evidence. Commenting (C): The tweet is related to the claim but it is not helpful to infer its veracity.
The dataset considers three partitions: training, validation, and testing. Training and validation partitions comprise 297 threads collected for eight events in total, which include 4,519 tweets in total. These events include popular breaking news such as the Charlie Hebdo shooting in Paris, the Ferguson unrest in the US, and the German wings plane crash in the French Alps. The testing partition includes 1,021 tweets in total. These include 20 threads extracted from the same events as the training set and eight threads from two other events. The distribution of tweets per partition is summarized in Table 1. All the tweets are in English.
Distribution of tweets in training, validation, and testing partitions for RumourEval, subtask a
The task to be solved is to impute the stance of the tweet concerning the original claim. This task is a multi-class classification problem, with four classes with imbalance, as indicated in Table 1. The majority classes (commenting and supporting) are probably the least interesting since they confirm or amplify the reach of the original claim. The minority classes are at the same time the most interesting since they question or deny the original claim. In this type of tweets, usually, new information is added that allows questioning or denying the original claim. The unbalance ratio between the smallest class (denying) and the majority class (commenting) is almost 1 to 9, which shows a severe class imbalance in the dataset. The imbalance reflects what happens in social media, where a large volume of tweets tends to amplify the effect of a claim, and only a few provide new information, denying or questioning the validity of the original claim.
One of the discovery challenges raised within the ECML/PKDD 2019 conference consisted of the automatic detection of online harassment in social media.4
Distribution of tweets in training, validation, and testing partitions for SIMAH
The Table 2 shows that the task corresponds to a multi-class classification problem. The challenge was introduced as two complementary tasks; one of them is a binary classification, in which the harassment class joins the three specific classes defined for the second task. We approach the task as a single classification problem, with four classes, with class imbalance. The class imbalance is more severe than in the stance dataset. The majority class (non-harassment) corresponds to almost 2/3 of the total number of tweets in the dataset. The imbalance in the dataset between the majority and minority class (physical harassment) has a ratio of 1 is to 30. The number of tweets provided in the training partition for the minority classes is extremely low, with a ratio of 1:70 regarding the majority class. As in the stance dataset, the task is to impute the class at the tweet level.
Text preprocessing
As social media data sources are unstructured and noisy, we require a careful preprocessing procedure before ingesting the data. It is a well-known fact that text preprocessing saves space and computational time during the learning stage. Also, text preprocessing prevents the ingest of noisy data, limiting the effect of artifacts during the learning process. Accordingly, the text normalization procedure applied to the datasets of tweets considered remove punctuation marks, digits and transform to lowercase. We also applied a stemmer in order to remove morphological affixes from words. To process jargon, we removed emojis, and then we applied the following rules:
HTML marks were replaced by the symbol <html>. Hashtags ( Mentions (@word) were replaced by the symbol <user>. Cardinals were replaced by the symbol <number>.
Once each tweet was preprocessed, we used GloVe to encode words. GloVe vectors were pre-trained on a Twitter corpus of two billion tweets (27 billion tokens) with a vocabulary of 1.2 million words [40]. These vectors were used in the base classifiers and were ingested one-at-a-time as a sequence of word vectors per tweet. We used word vectors with 200 dimensions in the baselines.
We used five baselines to test our method. Two of them are based on convolutional neural networks (CNNs) and the other three on recurrent neural networks (RNNs). CNN1 was implemented using one convolutional layer, and CNN2 with two layers. For both CNNs, we used ReLU activation functions. In the case of RNNs, we used GRU cells. For one RNN, we used a two-layered GRU, while the other RNNs were implemented using a three-layered GRU. For the five baselines, the output was produced using a softmax. As a loss function, we used focal loss. For the RNN2 baseline, we replaced the focal loss function with categorical cross-entropy. Focal loss was defined with gamma 2.0 and class weights inversely proportional to the amount of data instances per class in the dataset. In Table 3, we show the parameters of each architecture.
Architecture of our baseline models
Architecture of our baseline models
Training settings vary depending on the specific scenario of classification. We indicate the setting for each learning scenario, either with the original data or using data augmentation.
Unbalanced case
The baselines used in stance detection were trained with 20 epochs for CNNs and RNN2 models, while RNN1 and RNN3 used ten epochs. For harassment detection, the baselines were trained with 15 and 8 epochs, respectively. All the baselines were trained using mini-batch gradient descent with batches of 32 samples and Adam optimizer for parameter updating. After preparing the baselines, we used the transformer encoder, as explained in Section 3, to learn a joint representation of each example and the classifier’s outputs. To train the transformer, stance detection fixed the number of epochs to 75, while harassment detection used 60 epochs. The transformer used focal loss with class weights inversely proportional to each class size. The size of the hidden units for feed-forward layers was set to 128 with a dropout of 0.35 using 32 elements per mini-batch. Both tasks define a transformer using Adam with warmup 6000 and a factor of 1. This setting corresponds to increasing the learning rate linearly for the first warmup training steps and decreasing it after that proportionally to the inverse square root of the step number. The number of encoders used was 2 with 768 or 200 dimensions depending on the encoding of the text (BERT as Service or GloVe, respectively). We used multi-head attention with four attention heads.
Balanced case (Augmented data)
SemEval 2017 was augmented from 4238 tweets to 10956 tweets, while the SIMAH dataset was augmented from 6374 to 14644 tweets. With this augmentation, the number of training epochs was redefined to keep the number of iterations constant. Then, for a fixed mini-batch size, the number of epochs was defined by:
where
For validation, we used accuracy and F1 score metrics. The latter metric is recommended for evaluating multi-class classification with class imbalance. The training was performed on the training partitions of each dataset, using the validation partition to adjust parameters during training. Each testing partition was reserved to evaluate the performance of classifiers after training. The results that we show in the next section correspond to results in the testing partitions.
In this section, we show the results per task in three different modalities, without data augmentation, using SMOTE, and using our data augmentation technique. In this way, we managed to separate the effect of the proposed method from the effect produced by data augmentation. In each table, the results are shown according to accuracy, F1 score at the macro level, and F1 score by class. This validation design allows us to distinguish the effects that occur globally from those that affect each class in particular. To evaluate the variability of our method produced by the data augmentation technique, we use the training and validation partitions to produce five models in five independent runs. The reported results correspond to the averages of these five models in the testing partition.
Results reported on test data for stance classification. Results reported by the state of the art methods correspond to the testing partition’s best trial performance. When we evaluate our proposal using our best trial, the margin with which we surpass the other methods increases. For example, our performance in the best experiment in F1 macro with masking at 0.15 rises to 0.486. We prefer to report the average results even though the comparison does not benefit us
Results reported on test data for stance classification. Results reported by the state of the art methods correspond to the testing partition’s best trial performance. When we evaluate our proposal using our best trial, the margin with which we surpass the other methods increases. For example, our performance in the best experiment in F1 macro with masking at 0.15 rises to 0.486. We prefer to report the average results even though the comparison does not benefit us
We provide a description of the experimental results in stance classification. Table 4 presents the performance of the studied methods on test data. Results achieved by state-of-the-art methods are shown in the sequel. We show the results achieved using the CNN proposed by Chen et al. [12] and Lozano et al. [34] and the XGBoost proposed by Bahuleyan et al. [6], which achieved the first places in the SemEval 2017 competition in this dataset. We also show the results achieved by Ma et al. using a two-layered GRU [37]. Furthermore, we show the results obtained using Random Forest, linear SVM, and Gaussian SVM. The performance achieved by our proposal is shown in the last row of the table. We also trained a variant of our proposal as a baseline using just the BERT vectors on the sentences to ingest the transformer, indicated as BERT. The variant of our proposal, reported in the Table 4, use data augmentation with masking at 15%.
The results in Table 4 show that the performance of the state of the art methods is quite uneven. Lozano’s method [34] achieves the best result in the supporting class, and XGBoost [6] does the same in the commenting class. Both classes are the majority of the dataset, which is why the performance in accuracy that both methods exhibit is high. In terms of accuracy, XGBoost achieves the best result in this experiment. The results of these methods in minority classes are poor, with the denying class being the most difficult to predict.
Chen’s method [12] achieves a slightly more even performance between classes, at the cost of lowering performance in the supporting class. This result reduces their performance in accuracy and F1 score. Something interesting happens with the Ma et al. [37] method, which manages to overcome the three state-of-the-art methods in the questioning and denying class. However, the method fails to beat them in F1 score. Random Forest, Linear SVM, and Gaussian SVM show over-fitting to the majority class. The transformer ingested only with BERT sentences achieves poor results, showing over-fitting to the commenting class. Our proposal is the one that achieves the best balance between classes in the experiment at the cost of reducing performance in the supporting class. The proposal achieves state of the art results in the denying and questioning classes, with significantly better performances than the rest of the evaluated methods. Given this improvement, our proposal achieves the best result in F1 score. Using data augmentation enables improvements in the questioning and denying class, at the cost of decreasing performance in the supporting class. This result indicates that our data augmentation strategy managed to improve performance in the most challenging classes.
Data augmentation
To evaluate the proposed data augmentation technique, we performed a comparison with SMOTE. To do this, SMOTE was carried out on each baseline. Using the baseline models, we trained three variants of committee machines, using the output with the highest confidence between the baselines (best-fit), using the normalized Hadamard product between classifiers’ outputs (norm) and using a majority voting approach (voting). SMOTE was applied, achieving a balance between classes, over-sampling minority classes. The results of this experiment using SMOTE for data augmentation are shown in Table 5. Note that our proposal based on the transformer cannot directly use SMOTE since it requires the encoding of the text of each example. When producing new cases for data augmentation based on linear combinations of existing cases, SMOTE does not generate examples in the text space but in the continuous feature vector encoding space.
Results reported on test data using SMOTE for stance classification
Results reported on test data using SMOTE for stance classification
Table 5 shows that SMOTE produces better performance in minority classes. The most significant improvements are seen in the denying class, at the cost of worsening performance in the majority classes. The improvement produced by SMOTE does not outperform our proposal, having lower performances than us in all categories except in the supporting class. Committee machines benefit from the use of SMOTE, improving their performance in the minority classes. However, its performance in the commenting class is low, being below all the state of the art methods shown in Table 4.
We study the performance of our proposal in several experimental settings. First, we explore our proposal without using the data augmentation technique. In this configuration, we test two variants, one with flat class weights, without penalizing one class over another, and another with class weights whose value is inversely proportional to the number of examples in each category. The second variant focuses more on minority classes, so it is expected that its results will exceed those obtained with a flat penalty. We study the performance of the committee machines in both scenarios. Our proposal is studied with two word embeddings. One is GloVe, and the other is BERT. With these, two baselines were included, one for each embedding without using the classifiers’ outputs. The results of this experimental setting are shown in Table 6.
Experimental results in stance classification using flat weights and class weights inversely proportional to class sizes. Both variants are evaluated without data augmentation. The results correspond to averages across five independent trials
Table 6 shows that CNN2 without class weights is very strong in accuracy. Its performance is due to the remarkable performance it gets in the commenting class. However, CNN2 exhibits a deficient performance in the other classes, which evidences the presence of over-fitting. Both the transformer and the committee machines do not perform well with flat class weights. By incorporating class weights into the loss function, the transformer significantly improves its performance. In this configuration, our BERT-based proposal achieves the best results from these experiments in the minority classes. However, the cost paid by the method in the majority classes is high, which implies that its performance in F1 macro is low. The transformer performs better with BERT than with GloVE, in both experimental settings. In terms of F1 macro, the best performing method is voting committee, illustrating that the combination of classifiers’ outputs is beneficial in this problem. The performances of both text-based baselines are deficient, mainly when class weights are not used.
Our data augmentation strategy is parameterized by the word sampling fraction, a parameter that we call masking. We test various masking settings, with values of 0.15, 0.5, and 0.85. Low masking values represent slight variations around the original sentences (low variability). High masking values produce more significant fluctuations in the sentences concerning the original sentences (high variability). We used our data augmentation technique, over-sampling minority classes until the balance between classes was achieved. The results of this experiment are shown in Table 7.
Experimental results in stance classification using different settings for our data augmentation technique. The results correspond to averages across five independent trials
Table 7 shows that higher variability in our technique, produced by higher masking, allows methods to improve accuracy. Almost all models achieve improvements in the majority classes by using masking at 0.5 or 0.85. The best result in accuracy is achieved by our proposal when used with GloVe with masking at 0.85. Masking at 0.85 means a very high fluctuation of the generated text concerning the original sentences. This variability benefits the behavior in the majority classes, allowing models to separate these classes from the minority classes better. However, in these configurations, no improvements are observed in the minority classes. All models suffer a deterioration in their performance in terms of F1 macro. Consequently, a higher variability in the data augmentation technique allows to improve the separability of the majority classes but does not generate better performance in the minority ones. Our technique, in terms of F1 macro, gets its best results with masking at 0.15. Minor fluctuations in the texts generated from the minority classes allow for improvements in these classes, without deteriorating the majority classes. The transformer based on BERT with masking at 0.15 achieves a good performance in minority classes, quite close to the optimum obtained using class weights without data augmentation, but with much less deterioration in the majority classes. The class that is most difficult for this configuration is the supporting class. Despite its poor performance in this class, the overall performance in terms of macro F1 is the best of all observed in the experiments.
To study the performance of our proposal in the harassment detection problem, we used the same experimental setting used to study stance classification. This fact means that we train five trials of our proposal in this dataset, using the validation partition for tuning during training. We evaluated each of these models in the testing partition once the training of the five trials ended. We report results averaged across the five models. We compared the results of our proposal with other methods published in the SIMAH competition of ECML/PKDD 2019, in which we obtained the first place. The comparison was made with a SMOTE-based data augmentation model [19], and with a model based on neural networks with attention layers [28]. We also compare our results with those obtained by Random Forest, Linear SVM, and Gaussian SVM. The best result in terms of F1 macro of our proposal was achieved using class weights without data augmentation on the transformer combined with BERT. The results of this experiment are shown in Table 8.
Results reported on test data in harassment detection. F-score results are shown per class and at macro level
Results reported on test data in harassment detection. F-score results are shown per class and at macro level
Table 8 shows that Gaussian SVM performs very well in the non-harassment class, which is the majority class of this dataset. The cost of this performance is paid by the minority classes, among which indirect-harassment stands out, in which Gaussian SVM fails to predict any example of the testing set. The result in physical harassment, another minority class, is also deficient. The neural network method with attention layers [28] achieves the best performance in the indirect harassment class, but at the cost of substantially lowering its performance in the non-harassment majority class. That makes its F1 macro performance not good. The SMOTE-based method [19] achieves a somewhat more even performance in the minority classes but at the cost of lower performance in the sexual harassment class. Our approach delivers good performance in the two majority classes, non-harassment, and sexual harassment, and the best performance of this experiment in the physical harassment class. Accordingly, its performance in F1 macro is the best in this experiment.
We study the impact of SMOTE on this specific task. To do this, we train each baseline of our proposal using SMOTE for data augmentation. The augmented dataset achieved balance between the classes by oversampling the minority classes. Unlike the dataset used in stance classification, this dataset has a much more marked imbalance. Therefore, the effect that SMOTE produces on the dataset is much more noticeable. In this experiment, we evaluated the performance of committee machines, using the three variants studied in this work. The results are shown in Table 9.
Table 9 shows that SMOTE manages to improve the performance of these models in the minority classes. This effect is similar to that observed by Espinoza and Weiss [19], in which the improvement in the minority classes implies a deterioration in the majority classes. As a result of this deterioration, the results in F1 macro are not good, all being less than the results obtained using our proposal. The results between baselines are quite uneven. While CNNs show better performance in the sexual harassment class, RNNs seem not to be able to learn in this scenario. Only the RNN2 achieves competitive accuracy, due to its good performance in the majority class. On the other hand, none of the three methods of committee machines perform well, obtaining low results in the sexual harassment class.
Results reported on test data using SMOTE for harassment detection
We study different variants of our proposal in this specific task. First, we examine the performance of our proposal without using data augmentation. Two options are evaluated for this purpose, one with flat class weights and the other with class weights inversely proportional to the class sizes. It should be noted that this dataset is more unbalanced than the one used for stance classification, which is why the effect of class weights is more noticeable. The imbalance between the majority and the minority classes in the training partition has a ratio of 1:70, showing an extremely severe class imbalance. The results of this experiment are shown in Table 10.
Table 10 shows that when using flat class weights, all configurations perform poorly in minority classes. The best result in the indirect harassment class is achieved by RNN2, with only an F1 of 0.159. Improvements in this class make F1 macro better. It is fascinating that this machine, with flat class weights, manages to learn (weakly) a minority class without using data augmentation. This class may have some characteristic in its examples that are well conditioned for sequential learning, for example, word sequences that are typically used in this type of case.
Another recurrent neural network, RNN1, is the one that achieves the best result in this class by using class weights, obtaining an F1 at 0.310, the best of all the experiments observed by us in this class. This same baseline achieves a good result in the physical harassment class, another minority class. As a result, this baseline is the best performer in F1 macro. This important finding indicates that a classic model, without attention, without data augmentation, without a combination of machines, manages to overcome the other methods, rather more complex, in these classes. A plausible explanation for this finding is that recurrent networks can learn word sequences that are very important to this class. Our method, in this scenario of imbalance, fails to overcome this baseline, and overfit the majority classes, following a very similar pattern to that of the voting committee. This finding shows that in extreme imbalance scenarios, our proposal is not helpful. By using class weights, our proposal improves its performance in minority classes, being better in indirect harassment when using BERT, and having better results in physical harassment when using GLoVe. This dependence on the word embedding used indicates that the performance in these classes strongly depends on particular linguistic characteristics.
Results reported on test data without data augmentation. The reported results correspond to averages across five independent trials. F-score results are shown per class and at macro level
We study our data augmentation technique in this problem with three different configurations in the masking parameter, with values at 0.15, 0.5, and 0.85. As this dataset has a much more marked imbalance of classes than in the case of stance classification, it is expected that the effect of the data augmentation will be more noticeable. Over-sampling of minority classes, which in this case are very unbalanced, stresses the data augmentation method to achieve class balance. This scenario allows evaluating this technique in extreme data imbalance scenarios. The results of these experiments are shown in Table 11.
Table 11 shows that a higher masking factor implies a deterioration in the methods’ performance, which is more evident in the majority classes. This fact makes the performance in accuracy low when using a higher masking factor. The effect produced by high maskings in minority classes is unclear, noting that there is no improvement in these classes in most configurations. The poor performance of our data augmentation technique for high masking factors is attributed to the extreme imbalance observed between the minority and majority classes. Our proposal with BERT obtains a performance very similar to that of the voting committee machine when using masking at 0.15, showing that in this scenario, the transformer approximates such a machine. The voting committee machine obtained the best result from this study in the non-harassment class, which, although it is the majority class, is at the same time the least interesting class in a harassment detection scenario. None of the machines perform well in minority classes, showing how difficult this problem is in these particular classes. The sexual harassment class is more approachable, and our proposal, combined with BERT, gets a good result, with F1 at 0.791, which is the highest of this experiment.
Results reported on test data applying data augmentation (balanced data). F-score results are shown per class and at macro level. The results correspond to the average obtained from 5 executions of each experiment
Limitations of this study
We have introduced a transformer-based method to work with multi-class text classification problems with a strong class imbalance. Our contribution goes in two lines: first, based on a set of baselines, the transformer is used to use the classifier’s outputs to combine them in a sort of parameterized committee machine. The flexibility of the transformer allows these inputs to be easily mixed with text encodings. The proposal has been evaluated in two very complex tasks. In both tasks, our method achieves the best performance. These results illustrate how powerful the transformer architecture is. Our proposal’s novelty is to use the transformer to learn how to combine the results of other machines. In these scenarios, our proposal outperforms the rest of the methods. Since the class imbalance is severe, we introduce a data augmentation technique. Our strategy is invariant to the syntactic of the original examples. In addition, the strategy aims to maintain the semantics of the original texts using replacements in the space defined by the GloVe encoding. The results of the technique show that in scenarios with moderate imbalance, as in the case of stance classification, the method improves the results obtained without increasing data. When the imbalance is more severe, as in the case of harassment detection, the technique is unable to overcome this difficulty. Although our proposal achieves the best results in both challenges, the results are far from satisfactory because these tasks are challenging. The complexity of these tasks is in the low availability of data for the classes of interest, which are shown to be severely unbalanced. This fact is a reality in many contexts, such as in social media analytics, for example, the field to which the two tasks studied belong. A final thought that illustrates a limitation of this study is that, given the limited availability of examples, it is challenging even for a highly complex model to solve a machine learning task. A data augmentation technique, such as the one introduced by us, requires a sufficient number of representative examples for each class. It cannot address the absence of observations in the classes of interest. The question about statistics and measurable conditions that allow determining when a sufficient amount of data is available to generalize in a class is a fundamental question for this area.
We have introduced a method based on the transformer that allows baselines to be combined. The flexibility of this architecture enables baselines to be combined with text encodings. The proposal, together with a new data augmentation technique, allows obtaining good results in two multi-class text classification tasks with class imbalance. Despite the relative success that we show in this work, these tasks are far from being considered as resolved.
This work can be extended in several lines. At an experimental level, the study of our proposal in other problems, for example, in classic multi-class classification (e.g., 20 newsgroups) in which larger data volumes are managed, and more classes are available, is an interesting scenario to study the performance of our proposal. The use of attention in conjunction with explanatory mechanisms can also help elucidate which characteristics of each class are most relevant during our proposal training. Studying this proposal in other classification tasks is also another exciting line of work. Its use in classification of images with severe imbalance, as in the case of medical images, could also be of interest for this type of architecture.
Footnotes
Acknowledgments
This research was possible due to the funding of Programa de Iniciación Científica PIIC-DGIP of Universidad Técnica Federico Santa María.
