Learning to combine classifiers outputs with the transformer for text classification

Abstract

Text classification is a fairly explored task that has allowed dealing with a considerable amount of problems. However, one of its main difficulties is to conduct a learning process in data with class imbalance, i.e., datasets with only a few examples in some classes, which often represent the most interesting cases for the task. In this context, text classifiers overfit some particular classes, showing poor performance. To address this problem, we propose a scheme that combines the outputs of different classifiers, coding them in the encoder of a transformer. Feeding also a BERT encoding of each example, the encoder learns a joint representation of the text and the outputs of the classifiers. These encodings are used to train a new text classifier. Since the transformer is a highly complex model, we introduce a data augmentation technique, which allows the representation learning task to be driven without over-fitting the encoding to a particular class. The data augmentation technique also allows for producing a balanced dataset. The combination of both methods, representation learning, and data augmentation, allows improving the performance of trained classifiers. Results in benchmark data for two text classification tasks (stance classification and online harassment detection) show that the proposed scheme outperforms all of its direct competitors.

Keywords

Unbalanced data transformer data augmentation representation learning text classification

1. Introduction

Text classification is a pervasive task in many real-world applications [44]. Text-based classifiers facilitate the analysis of massive text collections, helping to label content in classes of interest for researchers and practitioners. The arise of social media has driven the development of more and better text classification techniques, helping to address complex problems such as online harassment detection [10] and stance classification [39], among other relevant tasks. Many times, the data used to train text classifiers reflect natural imbalances in the classes of interest, with many examples in some categories and only a few examples in others. This imbalance is because events of interest are often infrequent [27]. Imbalance drives towards over-fitting, that is, an over-learning of frequent classes. Consequently, text classifiers achieve good performance in the majority classes but at the cost of obtaining poor performance in other classes.

Several strategies have been proposed for dealing with unbalanced classes. Techniques based on data resampling [13] have explored oversampling on minority classes and undersampling on majority classes to produce balanced datasets. The performance of data resampling combined with data augmentation techniques, such as SMOTE [11], has also been explored. These techniques have shown unconvincing results in text classification [29]. One of the causes that can explain its poor success in text classification is that these techniques tend to discard very relevant information as the sequential nature of the text and its semantics. Semantic-agnostic methods may produce more examples but with little meaning.

Some techniques with better results in text classification have sought to combine classifiers outputs, in a sort of encoding of the example at a meta-level feature level. Bravo-Marquez et al. [7] showed that the use of these strategies gets improvements in some text classification tasks as sentiment analysis. These strategies take inspiration from committee machines [30], a kind of classifier aggregation schema that allows combining classifiers outputs to produce a boosted label imputation process. These techniques use label-aggregation functions at the output as majority voting strategies.

Despite the success shown by resampling techniques combined with data augmentation strategies such as SMOTE in related areas as image processing [35], the results in text classification are still unconvincing. With the rise of deep neural network models, which increase the number of parameters, the need for big datasets with a balanced proportion of examples per class has increased. Deep learning models such as convolutional [32] or recurrent neural networks [25] have been applied to text classification, showing the same limitations as their predecessors [8]. Moreover, since these architectures require more data during the training phase, they present higher risks of over-fitting. The need to address these difficulties in the face of the dominance of powerful architectures, such as the transformer [49], make this need a priority requirement.

To address this problem, we combine different strategies in a single coherent proposal. We introduce a framework for text classification that makes use of a transformer encoder to combine the outputs of several text classifiers, recoding each example of the training dataset. Our proposal also considers the BERT representation of each sample [17], a text embedding strategy that is also based on the transformer, obtaining a joint representation of the classifiers outputs and the text, producing a residual connection architecture for representation learning. The architecture learns a new encoding of the dataset, which is feed to a text classifier. To avoid over-fitting the most frequent classes, we introduce a data augmentation method that generates more examples in minority classes. The data augmentation technique is dependent on the semantics and syntax of the original cases. To produce the new examples, we resample the GLoVe space [40], a text embedding space generated from web data, conditioned to the syntax of each text and the class to which each sample belongs. In this way, the new examples generated by resampling are semantically consistent with the classes from which they were created. The combination of both strategies, that is, representation learning based on classifiers outputs and data augmentation based on resampling of the GloVe space, allows training text classifiers which consistently obtain better results than their competitors.

To achieve these results, we performed the following tasks:

•
Several text classification strategies are analyzed and compared with our solution based on the transformer model.
•
We address the problem of the combination of classifiers outputs using the transformer, and we show how our proposal outperforms state-of-the-art solutions.
•
We consider the problem of data imbalance, and we design a strategy to reduce the differences among the classes in terms of the number of examples; our data augmentation strategy can produce more examples in minority classes keeping the semantics and syntax of the original cases safe.
•
We study how to utilize our proposal in two relevant tasks in social media, online harassment detection, and stance classification; our experiments consistently show improvements regarding state-of-the-art solutions.

By addressing all the tasks mentioned above, we show how we can improve the performance of text classifiers in tasks relevant to social media content processing. The chosen approach considers the complexities that arise when performing this kind of task in these scenarios; this is machine learning with few labeled text data and data imbalance.

This article extends and ties together, in a complete framework, the following contributions:

•
How to apply self-attention to produce a new encoding of training examples using the encoder of the transformer (previously introduced in [8] and applied to stance classification).
•
How to use the encoder of the transformer to produce a joint representation of the text and classifiers outputs of each training example (previously introduced in [9] and applied to online harassment detection).

Novel, unpublished contributions of this article are:

•
A new method for data augmentation based on GloVe resampling; the method produces more examples in minority classes keeping safe the semantic of the original cases.
•
A complete explanation of the proposed combination of classifiers outputs, connecting it with the original proposal of the transformer.
•
A new set of tests, which evaluate the proposed framework by combining our data augmentation technique with the representation learning stage based on the transformer encoder.

The article is organized as follows. After an overview of the current related work in Section 2, we discuss in Section 3 how to use the encoder of the transformer to learn a joint representation of classifiers outputs and text examples. After this, we introduce in Section 4 our data augmentation strategy based on GloVe resampling. Experiments validating our solution, conducted on benchmark datasets, are presented and discussed in Section 5. Finally we conclude in Section 6, commenting on applications and extensions of our framework.
2. Related work

In this section, we review work related to our proposal. The section is divided into three parts; the first discusses text classification, the particularities of this problem, and its main challenges. Descriptions of ensembles and data augmentation techniques for classifiers are included in this subsection. Then, the next two subsections are dedicated to the description of the related work in the specific tasks in which we evaluate our proposal. The works reviewed in stance classification, and online harassment detection include works in text classification based on deep learning architectures, with which we perform the comparison of our proposal.

2.1 Text classification

Automatic text classification seeks to solve several problems related to the handling of large volumes of documentary data, such as the organization of collections, document partition for indexing systems, or content categorization [44]. The first automatic text classifiers worked on a representation of the content of each document based on features defined by experts. In all these representations, the features were the terms used in each document. Using rules of membership of keywords, the first automatic document categorizers were based on decision trees [18].

The vector representations of documents on the term-space used weight schemes based on, for example, the Tf-Idf model [41], initially proposed for information retrieval tasks. These vector representations made it possible to avoid over-representation of common terms by giving higher relevance to discriminative terms. Using support vector machines, it was shown that text classifiers could obtain good performance in benchmark datasets, close to the results obtained by human experts [26]. However, in multi-class classification schemes, automatic techniques generally showed over-fitting to some particular classes. To address this problem, some strategies based on committee machines were explored [30]. A first strategy based on bagging [31], that is, the training of several independent classifiers, showed good results in text classification based on label imputation by majority voting. However, there were no improvements in datasets with unbalanced classes. In this scenario, another ensemble strategy obtained better results; adaptive boosting (AdaBoost) [42] had the particularity of retraining classifiers using re-sampling, giving a higher probability to those documents in which the classifiers obtained worse results. In a kind of specialization strategy, AdaBoost showed improvements in some unbalanced datasets.

Many of the difficulties of text classifiers lie in the lack of labeled data to train machine learning algorithms, or in the imbalance of these datasets. The use of re-sampling strategies has been explored in classification, with emphasis on techniques of oversampling to minority classes or sub-sampling of majority classes [13]. The well-known SMOTE technique [11], which produces synthetically generated examples of minority classes by performing linear combinations between feature vectors, has shown good results in related areas such as image classification [35] but has unconvincing results in text classification. This limitation is due to SMOTE does not take into account the semantics and syntax of the original examples to generate new synthetic cases. We believe that a data augmentation technique that keeps safe the semantic of the original dataset can outperform the results achieved by its predecessors.

The rise of deep learning, that is, the prevalence of learning methods based on deep neural network architectures, has modified the text classification scenario. A first new facet of text classification influenced by deep learning consists of the pervasive incorporation of word embeddings for the representation of documents. Word embeddings compute dense vectors on continuous representations in a low dimensional space of features. Word embeddings are trained on massive text collections, such as Wikipedia, to capture more semantic relationships between words, encoding them in the text representation space. Different NLP tasks have been evaluated using word embeddings, with BERT [17] being the dominant text encoding today. BERT embeddings are calculated by computing a language model through a transformer architecture [49]. The encodings that the machine computes to solve are those that are finally extracted from the transformer to represent each word.

The performance of text classifiers has improved by incorporating word embeddings during the learning process. Recurrent neural networks [25] have been explored in tasks related to text classification. Generally, these networks are ingested word by word, taking advantage of the fact that word embeddings are computed at this level of abstraction. In the case of convolutional networks [32], a representation layer is often jointly learned with the classification task. In the following subsections, we will show how these strategies have been used in the two tasks in which we will explore the performance of our proposal.

2.2 Stance classification

One of the first works that addressed automatic stance classification pointed out to the recognition of stances in on-line ideological debates [48]. In that work, the authors explored the utility of sentiment opinions, building an arguing lexicon from a manually annotated corpus. Using the entries of the lexicon as features, the authors used supervised learners for stance classification on four different on-line debate forums. Anand et al. [1] examined stance classification on a corpus of 4873 posts across 14 topics gathered from the website ConvinceMe.net. The authors showed that rebuttal posts were hard to classify using text features. They also demonstrate that methods that take into account the context of the post (in an on-line social network this refers to conversational threads) might be helpful for this task. Walker et al. [50] coded an extensive collection of posts, tagging the level of agreement between consecutive posts. The released corpus named the Internet Argument Corpus (IAC) comprises posts manually sided for the topic of discussion extracted from 4forums.com, a site devoted to on-line debate. The authors concluded that rebuttal posts are hard to detect even for human annotators. The main reason for these difficulties is based on the extensive use of stylistic ambiguity in rebuttal as sarcasm and irony. Faulkner [20] investigated stance classification at document-level, proposing a set of text-based features to capture the stance of student’s essays (answer) concerning an essay prompt (affirmation). Several machine learning methods were studied for stance classification in two-sided debates by Hasan et al. [23]. The authors concluded that there is no a clear winner method, but sequence models as Hidden Markov Models outperform its competitors in some cases.

The first work on stance classification for news comments was authored by Sobhani et al. [47]. The authors used topic modeling for argument tagging, who are subsequently used for stance classification (agree/disagree) of news comments. Stance classification on Twitter was addressed by Lukasic et al. [36], who explored the use of temporal dependencies along sequences of tweets to improve the performance of a stance classifier based on Hawkes processes. The relation between stance and rumor veracity was studied in the PHEME project [15], where several resources were developed to tag misinformation, disinformation, rumors, and speculations. As part of this project, several posts were labeled according to the stance towards target information. In this line of research, Zubiaga et al. [56] studied how a tree-CRF classifier performed on stance classification modeling conversational threads in Twitter replies. Another contribution of this work is the construction of a fine granularity taxonomy for stance classification, extending the scenario from two classes (agree/disagree) to four categories (supporting/denying/questioning/commenting). There is a consensus that these four classes represent in a better way the complexity of the task.

During the last years, deep learning architectures have become a dominant approach in stance classification. LSTM networks were applied for the first on this task by Zubiaga et al. [55]. The authors showed that the use of sequential learning architectures is useful for this task. Bidirectional LSTM-based encodings of tweets were used for stance classification by Augenstein et al. [3] on the Twitter Stance Detection corpus, a dataset released by Mohammad et al. during the SemEval 2016 challenge [38]. Chen et al. [12] used convolutional neural networks (CNN) to jointly address stance classification and rumor detection on the corpus released for the SemEval 2017 challenge [16] while Lozano et al. [34] used an ensemble classification approach, achieving the first place in that competition. In the same competition, Bahuleyan et al. [6] used XGBoost, an additive, and iterative tree-based supervised machine learning approach based on AdaBoost. XGBoost achieved second place in the competition. Recent efforts point to jointly learn rumor detection and stance classification using two-layered gated recurrent units networks (GRU) [37].

2.3 Online harassment detection

Several works have approached the detection of online harassment from a classic machine learning perspective [10, 14, 51]. These articles generally combine features extracted from messages with features retrieved from user profiles, using a feature-engineering strategy. Combining both sources of information, several of these methods train supervised learning algorithms like support vector machines or random forests. A limitation of many of these works is that they are sensitive to the imbalance of labeled data. In practice, many of these methods fail to generalize well to other datasets, which limits their use in real environments. A thorough review of these types of techniques is addressed in [43].

More sophisticated models, such as those studied in deep learning, have also been applied to the problem of hate speech detection. One of the advantages of deep learning architectures is that they allow the neural network to acquire an adequate representation of the problem. The use of text encoders has offered benefits to these types of models over conventional machine learning models. For example, convolutional networks [22] have shown good results in the Wasem and Hovy dataset [52]. Recurrent neural networks have also demonstrated good results in this dataset, based on the GRU architecture [54]. Nearly perfect results in this dataset were also reported using deep learning by Badjatiya et al. [5]. Unfortunately, many of these models have overfitting problems, and then, they are not transferable to production. Recently, Arango et al. [2] showed that there are also problems in the generation of these datasets considered as standard for the evaluation of this type of tasks. Among these problems, the most worrying is the population bias used to generate the samples that make up the dataset. These works show that the hate speech detection problem is far from being solved.

A major problem that these datasets have is the imbalance between classes. Hate speech detection must be carried out in scenarios where most of the conversations are mostly neutral, and the harassment is exceptional. However, not being exceptional is less critical. The consequences that harassment and hate speech produce on social network users is fierce. To address the problem of imbalance, in [45], the authors use techniques to increase and generate texts that allow generating training data with balanced classes. In this same line, Sharifirad et al. [46] showed that a promising way to address the problem is to define a finer level of granularity for this task. Based on this latest work, the “SIMAH” challenge1

¹
https://ecmlpkdd2019.org/submissions/discovery/.

defines a dataset with three types of harassment, which was addressed by our proposal, winning this competition in the 2019 version of ECML/PKDD [9].

Far from showing itself as a task with mature and robust solutions, this task shows many challenges. For more details on all hate speech detection variants, the reader is recommended to review the Fortuna and Nunes survey [21].

3. Combining classifiers outputs with the transformer

The techniques of ensembles of classifiers inspire our proposal. In these techniques, classifiers are trained to solve a specific task. A key aspect of classifier ensembles is to define a label consolidation strategy. For each example, each base classifier imputes a class; the consolidation phase consists of determining which of these labels to assign. The imputation can be resolved in many ways, among them one of the most popular is the strategy based on majority voting.

There are two families of ensembles strategies according to the type of training that is done to the base classifiers. Bagging [30] consists of training several independent and decoupled classifiers. AdaBoost [42] trains some classifiers in sequence, using the errors of the first classifiers to define the probability of sampling. Through a resampling strategy, AdaBoost learns new classifiers on the examples in which previous classifiers have got worse results.

Our proposal is inspired by Bagging since it is built on independently and decoupled classifiers. An essential difference between our proposal and Bagging is in the consolidation phase of classifiers outputs. Instead of using a predefined function (for example, majority voting), we encoded each sample using the labels imputed by the classifiers. Our strategy replaces the consolidation phase with one based on representation learning. To achieve this goal, we define a set of learnable parameters that are adjusted to learn an encoding of each example that consolidates both the representation of the text and the classifiers’ outputs. To determine this representation, we ask the encoder to solve the original classification problem by adding a softmax classifier to the output of the encoder.

A key element of our proposal is the type of encoder used for the representation learning phase. We use the transformer encoder [49], a widespread and pervasive deep-learning encoder-decoder architecture. The transformer has been used to learn text encodings, BERT [17] being the most famous of them. The key to BERT is in the use of the transformer architecture, which incorporates a self-attention mechanism. We take inspiration from BERT in the way it uses the transformer for representation learning. Like BERT, we use the transformer encoder to encode each example, taking advantage of the transformer’s self-attention mechanism. Unlike BERT, instead of computing a language model, we use the encoder to encode the classifiers’ outputs asking the transformer encoder to solve a classification task. This purpose is achieved by introducing a set of learnable parameters that helps to encode the outputs of the classifiers along with the encoding of the text of each example. As a result, our proposal learns a joint representation of the dataset that combines text and classifier outputs, in a sort of residual architecture that adds inputs and outputs in the same architecture layer.

Next, we will explain our proposal, detailing its connection with the transformer architecture and with BERT.

3.1 The transformer

The rationale of the transformer is based on the encoder-decoder architecture. Given an input sequence $(x_{1},\ldots,x_{n})$ , the encoder computes a representation (encoding) $(z_{1},\ldots,z_{n})$ that is feed in a decoder, producing an output sequence $(y_{1},\ldots,y_{m})$ . Typically, both encoder and decoder coincide in their architecture, and both components define a set of learnable parameters that are adjusted to solve a task. Usually, the task that the transformer solves is a sequence to sequence task (seq2seq), that is, a sequence is placed at the input and a target sequence at the output. The type of ingestion is auto-regressive, i.e. $y_{1},\ldots,y_{m-1}\rightarrow y_{m}$ , asking the transformer to learn how to generate the output sequence according to the input sequence. Different tasks can be defined in a seq2seq strategy. Without losing generality, we will consider that the input sequence of the transformer consists of words. With this type of inputs, it is common for these tasks to solve machine translation problems, setting input sequences in one language and output sequences in another. Another type of task is the computation of a language model, that is, learning to predict the next word in the input sequence, offsetting the input and output sequence in a symbol. This last type of task is the one that BERT uses to compute its text encodings.

The transformer architecture is defined using stacked blocks. This rationale allows the architecture to learn intermediate representations of the input sequence at different levels of abstraction. Usually, the transformer makes use of six stacked blocks, both in the encoder and in the decoder. Each of these blocks makes use of two essential elements for the transformer to work, residual connections [24] and layer normalization [4]. The residual connections of each block combine the encoding of the input with the output of the block, avoiding the problem of degradation that occurs in deep networks. Layer normalization centers the weights of each layer around the mean and scales the weights according to their variance. This correction allows addresses gradient vanishing and gradient explosion by countering the internal covariate shift between different sequences.

The essential component of the transformer architecture is the self-attention block. The motivation to include this block is to allow the coding of each example to have non-local encodings, favoring the detection and coding of long-term dependencies in the input vectors. To achieve this goal, the transformer’s self-attention mechanism defines three learnable parameters sets for the encoding of each training example; these are query, key, and value. These parameter matrices reduce the dimensionality of the original encoding of each case. To capture the long-term dependencies in the input, each word in the entry is encoded by computing the dot product of the query vector and the key to the given position. This product is scaled and ingested to a softmax to calculate a word attention score. The last step of the mechanism consists of multiplying the attention score by the vector of values, which allows obtaining an attention vector for the example. This step can be done on a matrix basis, for reasons of efficiency. Consequently, the self-attention layer is expressed as follows:

$\displaystyle\text{Attention}(Q,K,V)=\text{Softmax}\left(\frac{Q\cdot K^{T}}{% \sqrt{d_{k}}}\right)\cdot V,$

where $d_{k}$ is a parameter for dot scaling and $Q$ , $K$ and $V$ are the learnable matrices of parameters for the query, key, and value abstractions.

The transformer defines $h$ different linear projections of the input to feed each of them in a different block of self-attention. This step defines $h$ blocks of self-attention, a mechanism known as multi-head attention. Multi-head attention allows encoding the input in different representation subspaces.

The output of the self-attention block is connected to a residual connection layer with the input and to a normalization layer. A second block consists of a feed-forward layer, also mediated by a residual connection and a layer normalization. These blocks (self-attention $+$ feed-forward) are known as transformer layers. Typically, the transformer architecture stacks six layers of this type.

The transformer decoder is similar, but it includes a variant that allows computing a language model task with long-term dependencies with context both to the left and to the right of each target word. The modification is called the masked language model, and it consists of a block that randomly masks some words in the input and output in the self-attention layer. The rest of the transformer layer considers the same encoder blocks, that is, the self-attention and feed-forward layer mechanism, all of them combined with residual connections and layer normalization.

At the output of the decoder, a linear layer and a softmax are inserted, which compute scores on the vocabulary. The task with which the transformer typically trains is a language model (next word prediction). There are variants of the tasks with which the architecture is trained, which can be reviewed in the work of Vaswani et al. [49].

3.2 BERT

BERT [17] makes use of the transformer encoder to build their word embeddings. A variant that BERT uses to produce the input sequence is to tokenize the text using a sub-word tokenizer known as the word-piece model (WPM) [53]. BERT uses fixed-length sequences, for which it truncates the sequences produced by WPM. Then, the input sequence is encoded using a token embedding matrix. BERT combines the input with two additional encodings, a position embedding that encodes the position of each symbol in the input, and a sentence encoding. BERT defines in addition to the language model task (next word prediction) a second task called next sentence prediction, which consists of indicating whether the following sentence is the continuation of the predecessor sentence or not. To solve this task, the sequences of sentences intersperse random sentences in 50% of the cases. The BERT authors indicate that thanks to this second task, embeddings can encode characteristics of a higher level of abstraction. Therefore, the coding of the entry in BERT is given by:

$\displaystyle h_{i}^{0}=M_{i}\cdot W_{e}+W_{p}+W_{s},$

where $M_{i}$ is the input at position $i$ and $W_{e}$ , $W_{p}$ , and $W_{s}$ are the matrices of parameters of token, position, and sentence embeddings, respectively. BERT makes use of the transformer encoder, stacking several of these layers:

$\displaystyle h_{i}^{j}=\text{encoder}(h_{i}^{j-1}),\forall j\in\{1,n\}.$

At the output, BERT connects a softmax, interspersed with a fully connected layer with a GELU (Gaussian rectified linear unit) activation function. The softmax layer computes scores on the vocabulary:

$\displaystyle\text{Softmax}(h_{i}^{*}\cdot W_{e}^{T}+b_{e}),$

where $W_{e}$ and $b_{e}$ are the parameters of the softmax and $h_{i}^{*}$ is the concatenation of the encodings $[h_{i}^{j}]$ . Usually, BERT stacks 12 layers, and then $h_{i}^{*}$ corresponds to the concatenation of the last 4 encodings. To address the second task (next sentence prediction), BERT connects a second softmax to the output of the stacked layers, whose output scores two classes. After training, the $h_{i}^{*}$ vectors are used as word embeddings.

3.3 Combining classifiers outputs with the transformer

Our proposal makes use of the transformer encoder and BERT word embeddings in a scheme that brings together both strategies. The proposed strategy combines the outputs of a set of base classifiers, which are concatenated to the BERT encoding of each input. To achieve this goal, a matrix of learnable parameters is defined, transforming the output vectors of each base classifier to the dimensionality of the BERT encoding. To encode each sentence, we use the encodings that BERT provides at this level of granularity. This method is implemented in a web service named BERT-as-service,2

²
https://github.com/hanxiao/bert-as-service.

a sentence encoding service based on BERT that maps a variable-length sentence to a fixed-length vector. The encoding length of each sentence has 768 components. Accordingly, the projection matrix of classifiers outputs projects each output to a vector of 768 elements.

Let $s_{i}$ be the $i$ -th sentence to encode. Let $h_{j}$ be the $j$ -th classifier in the set of base classifiers. We obtain the output $\vec{o}_{i,j}$ of each classifier by feeding the sentence in each classifier, after which we have the classifiers’ outputs:

$\displaystyle\vec{o}_{i,j}=h_{j}(s_{i}).$ (1)

A matrix $W_{e}$ of learnable parameters is defined, which projects the outputs of the classifiers to a space of the same dimensionality as that used by the encoding at the sentence level:

$\displaystyle\vec{x}_{i,j}=W_{e}\cdot\vec{o}_{i,j}$ (2)

where $x_{i,j}$ is the encoding of the output learned during the recoding step. The sentence and the classifiers’ outputs are ingested to the encoder of the transformer using a concatenation operator:

$\displaystyle\vec{x}_{i}=[\text{BERT}(s_{i}),\vec{x}_{i,1},\ldots,\vec{x}_{i,j% },\ldots,\vec{x}_{i,n}],$ (3)

assuming that $n$ base classifiers are available. In this way, each example gets an encoding, denoted by $\vec{x}_{i}$ , which is ingested to the encoder of the transformer.

Each layer of the transformer maintains the dimensionality of the input, through a chain of blocks of the type LayerNorm (SubLayer $(\vec{x}_{i})+\vec{x}_{i}$ ), which considers residual connections and layer normalization at each encoding step. An invariant of the transformer architecture is the dimensionality of the encoding, which remains constant throughout the process. We take advantage of the dimensionality invariance of the encoding process, to abstract at the encoder output a coding produced from each component of the input vector. This abstraction is not only guaranteed by the invariance in dimensionality but also by the chain of residual connections that defines the encoder of the transformer. We denote the encoding of each input through the transformer encoder as $e_{i}^{j}$ , where $j$ ranges from 0 to $n$ , and therefore $e_{i}^{j}=\text{Encoding}(\vec{x}_{i,j})$ . Note that $e_{i}^{0}=\text{Encoding}(\text{BERT}(s_{i}))$ . Accordingly, the encoder output can be expressed as the concatenation of the input encodings $[e_{i}^{0},\ldots,e_{i}^{j},\ldots,e_{i}^{n}]$ . We combine these encodings by producing a single joint representation using the Shur’s product $\odot$ of the encodings. The Shur product (element-wise dot product) provides a joint representation that finally generates the vector that represents each example:

$\displaystyle\vec{h}_{i}=\odot_{j}(e_{i}^{j}),$ (4)

where $\vec{h}_{i}$ is the joint encoding of the sentence and its encoded classifiers outputs.

The last part of our encoding strategy includes a softmax, which produces a classification in the dataset classes from the example encoding:

$\displaystyle p=\text{Softmax}(W_{o}\cdot\vec{h}_{i}+b_{o}),$ (5)

where $W_{o}$ and $b_{0}$ are the learnable parameters of the softmax layer. During training, we backpropagate the errors measured in the softmax layer to the learnable parameters of the encoder, so that the task that the encoder solves is to produce a joint representation of each example minimizing the loss function of the task. As a loss function, we use categorical cross-entropy.

4. A new data augmentation strategy based on word substitution on GloVe

The data augmentation strategy that we introduce is based on words substitution. Using the sentences from the original dataset, a word-level sampling process is conducted, replacing the original words with other words that keep the original semantics and syntax of the examples safe. The resampling is done at the sentence level, and the substitution is driven by relationships found between GloVe vectors. For this purpose, we define a word substitution operator, which operates using linguistic regularities of the GloVe encodings. Linguistic regularities have been one of the great findings that these encodings offer, allowing the discovery of relationships between related terms or excluded terms. We take inspiration from Levy and Goldberg’s linguistic regularities work [33] to define a word substitution operator that is useful to our data augmentation strategy.

The proposed data augmentation strategy follows a rationale similar to that of SMOTE, in the sense that it uses resampling on minority classes to produce class balance. Our approach is also inspired by the type of training BERT uses [17]. BERT uses a word resampling rate in the masked language model task, which is crucial for word embeddings, considering the left and right contexts around each target word. Unlike BERT, which is generally used in other types of datasets (for example, Wikipedia), resampling of sentences in social media such as Twitter imposes the difficulty of working with shorter sentences.

The idea of our data augmentation strategy is to keep the semantics and syntax of the original examples safe when generating new sentences. For this, our approach takes two complementary actions. The first introduces syntactic invariance regarding the original sentence, constraining the type of words to be replaced. Syntactic invariance is fulfilled by constraining the substitution of a sampled word to another that has the same Part-of-Speech tag (POS-tag) in the generated sentence.

A key element of our data augmentation strategy is to keep the semantics of the original sentence safe relative to the generated sentence. To achieve this goal, we define a word substitution operator that makes use of the linguistic regularities of the GloVe vectors. Given a word to substitute, our operator considers the context words in the original sentences as words related to the target word. The related words correspond to the terms contiguous to the target word. The notion of an adjacent word also considers skip-grams; that is, words that are at a distance greater than one. The number of words in the context window is a parameter of the operator. Usually, the length of the context window is five words, including the target term. The operator also considers contrary related terms, called negative words in the work of Levy and Goldberg [33]. To determine negative words for the target word, the operator retrieves at random sentences from the other classes in the dataset in which the target word occurs. Negative words to the target word are terms that are used in the context of the target word in the sentences of the other classes and that do not belong to the list of positive words. Therefore, the substitution operator uses the list of positive and negative words to implement the most similar target word operation, using a proximity function.

4.1 Sampling words in sentences for data augmentation

Our data augmentation technique operates at the sentence level. The method samples at random sentences in the minority classes of the dataset until class balance is achieved. The method works with a word-level sampling fraction. Let $f$ be the sampling fraction of the algorithm and let $s_{i}$ be a sentence to sample. Let us assume that $s_{i}$ is an example belonging to a specific class in $L$ (the set of classes of the dataset), denoted as class $l$ . Word sampling is done at random, constrained to a set of candidate words to be substituted. The candidate words to be substituted correspond to words that have a certain POS-tag in the sentence. Following this rule, we avoid that the substitution technique replaces uninteresting words in terms of semantic richness like conjunctions, prepositions, or determiners, among others. Our method focuses on replacing verb forms, adjectives, and nouns. Accordingly, the words to be sampled correspond to the subset of words that belong to any of these three categories. Let $W_{i}$ be the set of candidate words to be replaced in $s_{i}$ . The method picks words at random from $W_{i}$ according to the sampling fraction $f$ . Let $w_{i}$ be a word sampled from $W_{i}$ . We say that $w_{i}$ is a target word to replace in $s_{i}$ . We consider a context window $C_{i}=(w_{i-n},\ldots,w_{i},\ldots,w_{i+n})$ in $s_{i}$ , sometimes called surrounding text, around the target. Words in $C_{i}$ like nouns, formal verbs or adjectives, including the target word $w_{i}$ , are considered related terms for the word substitution operator. We denote the set of related terms as $a$ .

The selection of terms contrary to the target is made by sampling at random a sentence in $L\setminus l$ (that is, in classes that do not correspond to the class of $s_{i}$ ) in which $w_{i}$ occurs. In that sentence, context words of $w_{i}$ like nouns, verbs or adjectives are used as negative related terms. We denote the set of negative related terms as $b$ .

4.2 Word substitution using linguistic regularities

The word substitution operator makes use of the linguistic regularities found in Levy and Golberg’s work [33]. In that work, the authors used a set of related words and a set of negative words to implement a similarity search in the embedded space. We define an operator to replace the target word, using $a$ and $b$ as positively and negatively related terms to the target. The operator is implemented according to:

$\displaystyle\text{arg max}_{x\in V}\left(\text{sim}(\vec{x},\vec{a}-\vec{b})% \right),$ (6)

where $x$ is a word in the vocabulary $V$ of the dataset, and $\vec{x}$ , $\vec{a}$ , and $\vec{b}$ are the word embeddings of $x$ and the words in the sets $a$ and $b$ , respectively. Usually, the vector representation of a set corresponds to the sum of the vectors of each of its elements, scaled by the number of elements in the set. Accordingly, $\vec{a}$ and $\vec{b}$ correspond to the scaled sum of the word embeddings of the words that belong to $a$ and $b$ , respectively. Since the embedded space is of relatively high dimensionality (200 dimensions in GloVe), the similarity function should work well in a space with these characteristics. Therefore, the similarity function used corresponds to the cosine similarity.

Note that the number of words in the sets $a$ and $b$ can vary. To avoid this difference affecting the calculation of Eq. (6), we introduce a scale factor proportional to the number of words in $a$ and $b$ . Let $n_{a}$ be the number of words in $a$ and let $n_{b}$ be the number of words in $b$ . Let $w_{i}$ be the target word to be replaced. The vector difference between the sets $a$ and $b$ is given by:

$\displaystyle\vec{a}-\vec{b}=\frac{1}{2(n_{a}+n_{b})}\cdot\left((n_{a}+n_{b})% \cdot w_{i}+\sum_{w_{j}\in a}w_{j}-\sum_{w_{j}\in b}w_{j}\right),$ (7)

Note that Eq. (7) allows carry out the process of replacing $w_{i}$ for sets $a$ and $b$ with a different number of elements. Equation (7) is undefined if $n_{a}+n_{b}=0$ , indicating that the substitution is feasible if and only if there is at least one word in $a$ or $b$ .

Equation (6) gives us a list of related terms sorted by similarity. The word substitution method picks from this list the term with the highest similarity score that has the same POS-tag of $w_{i}$ , keeping safe the syntax of $s_{i}$ . Therefore, a new sentence is generated, replacing all the sampled words from $s_{i}$ with the words picked by the word substitution method.

5. Experiments

To study the performance of our proposal, we worked on two tasks: stance classification and harassment detection. These two tasks correspond to multi-class classification problems. For both tasks, we work with datasets widely known in the area, labeled by experts and validated by the community. In each of these tasks, we test the performance of our proposal with and without data augmentation.3

³
Git repository: https://github.com/Buguemar/Transformer_as_ensemble.

In this way, we separate the effect produced by the classifier and the data augmentation technique. To make the comparison, we evaluated various related work methods, including methods considered state of the art for each of these tasks. In particular, we tested the performance in each of the baseline methods that the transformer uses. A classifier that only works on the text encoding (BERT or GloVe embeddings) of each sentence was also studied. In this way, we isolate the effect that the text representation has on the proposed combination of classifiers outputs. Among the methods used for the evaluation, committee machines are considered, which are implemented on the same baselines used by the transformer. In this way, we can compare the effect produced by our combination strategy and the strategies used by committee machines.

In this section, we first explain some essential characteristics of datasets and tasks in which we study our proposal. Then, we provide details of the preprocessing performed on the text, the metrics and validation strategies used, and the specific strategies used to train each method. Finally, the results are shown, and the main findings derived from the validation are discussed.

5.1 Tasks

5.1.1 Stance classification

We use a publicly available Twitter dataset named RumourEval, used on SemEval 2017 (Task 8) [16], consisting of two subtasks: (a) stance classification, and (b) veracity classification.

RumourEval data has already been annotated for veracity and stance following a published annotation scheme [56]. The labeling process was conducted as part of the PHEME project [15], where the relation between rumor veracity and stance was studied. This way, each tweet presents a stance related to a claim defined as follows:

•
Supporting (S): The tweet supports the veracity of the claim.
•
Denying (D): The tweet denies the veracity of the claim.
•
Questioning (Q): The tweet demands additional evidence.
•
Commenting (C): The tweet is related to the claim but it is not helpful to infer its veracity.

The dataset considers three partitions: training, validation, and testing. Training and validation partitions comprise 297 threads collected for eight events in total, which include 4,519 tweets in total. These events include popular breaking news such as the Charlie Hebdo shooting in Paris, the Ferguson unrest in the US, and the German wings plane crash in the French Alps. The testing partition includes 1,021 tweets in total. These include 20 threads extracted from the same events as the training set and eight threads from two other events. The distribution of tweets per partition is summarized in Table 1. All the tweets are in English.

Table 1
Distribution of tweets in training, validation, and testing partitions for RumourEval, subtask a

Supporting Denying Questioning Commenting

Training 841 333 330 2734

Validation 69 11 28 173

Testing 68 69 106 778

Total 978 413 464 3685

The task to be solved is to impute the stance of the tweet concerning the original claim. This task is a multi-class classification problem, with four classes with imbalance, as indicated in Table 1. The majority classes (commenting and supporting) are probably the least interesting since they confirm or amplify the reach of the original claim. The minority classes are at the same time the most interesting since they question or deny the original claim. In this type of tweets, usually, new information is added that allows questioning or denying the original claim. The unbalance ratio between the smallest class (denying) and the majority class (commenting) is almost 1 to 9, which shows a severe class imbalance in the dataset. The imbalance reflects what happens in social media, where a large volume of tweets tends to amplify the effect of a claim, and only a few provide new information, denying or questioning the validity of the original claim.
5.1.2 Harassment detection

	Supporting	Denying	Questioning	Commenting
Training	841	333	330	2734
Validation	69	11	28	173
Testing	68	69	106	778
Total	978	413	464	3685

One of the discovery challenges raised within the ECML/PKDD 2019 conference consisted of the automatic detection of online harassment in social media.4

⁴
https://ecmlpkdd2019.org/programme/discovery/.

The dataset provided for this challenge contains 10622 annotated tweets, split into training, validation, and testing partitions, as it is shown in Table 2. The competition has two related tasks: the first one is a binary classification (harassment or non-harassment tweet), and the second task is a multi-class classification of online harassment tweets into three categories: indirect harassment, sexual harassment, and physical harassment. Indirect harassment is understood as harassment not directed explicitly towards a person, but towards an ethnic group, community, or group with which a person can identify. Sexual harassment consists of direct person-to-person harassment with a sexual connotation. Finally, physical harassment is a type of harassment that reviles negatively, highlighting a physical or mental characteristic of a person.

Table 2

Distribution of tweets in training, validation, and testing partitions for SIMAH

	Non-Harassment	Indirect-Harassment	Sexual-Harassment	Physical-Harassment
Training	3661	55	2582	76
Validation	1493	71	525	36
Testing	1512	197	312	100
Total	6666	323	3419	212

The Table 2 shows that the task corresponds to a multi-class classification problem. The challenge was introduced as two complementary tasks; one of them is a binary classification, in which the harassment class joins the three specific classes defined for the second task. We approach the task as a single classification problem, with four classes, with class imbalance. The class imbalance is more severe than in the stance dataset. The majority class (non-harassment) corresponds to almost 2/3 of the total number of tweets in the dataset. The imbalance in the dataset between the majority and minority class (physical harassment) has a ratio of 1 is to 30. The number of tweets provided in the training partition for the minority classes is extremely low, with a ratio of 1:70 regarding the majority class. As in the stance dataset, the task is to impute the class at the tweet level.

5.2 Experimental setting

5.2.1 Text preprocessing

As social media data sources are unstructured and noisy, we require a careful preprocessing procedure before ingesting the data. It is a well-known fact that text preprocessing saves space and computational time during the learning stage. Also, text preprocessing prevents the ingest of noisy data, limiting the effect of artifacts during the learning process. Accordingly, the text normalization procedure applied to the datasets of tweets considered remove punctuation marks, digits and transform to lowercase. We also applied a stemmer in order to remove morphological affixes from words. To process jargon, we removed emojis, and then we applied the following rules:

•
HTML marks were replaced by the symbol <html>.
•
Hashtags ( $\#$ word) were replaced by the symbol <hashtag>.
•
Mentions (@word) were replaced by the symbol <user>.
•
Cardinals were replaced by the symbol <number>.

Once each tweet was preprocessed, we used GloVe to encode words. GloVe vectors were pre-trained on a Twitter corpus of two billion tweets (27 billion tokens) with a vocabulary of 1.2 million words [40]. These vectors were used in the base classifiers and were ingested one-at-a-time as a sequence of word vectors per tweet. We used word vectors with 200 dimensions in the baselines.
5.2.2 Model configuration

We used five baselines to test our method. Two of them are based on convolutional neural networks (CNNs) and the other three on recurrent neural networks (RNNs). CNN1 was implemented using one convolutional layer, and CNN2 with two layers. For both CNNs, we used ReLU activation functions. In the case of RNNs, we used GRU cells. For one RNN, we used a two-layered GRU, while the other RNNs were implemented using a three-layered GRU. For the five baselines, the output was produced using a softmax. As a loss function, we used focal loss. For the RNN2 baseline, we replaced the focal loss function with categorical cross-entropy. Focal loss was defined with gamma 2.0 and class weights inversely proportional to the amount of data instances per class in the dataset. In Table 3, we show the parameters of each architecture.

Table 3
Architecture of our baseline models

Model configuration
CNN1	CNN2	RNN1	RNN2	RNN3
BatchNorm	BatchNorm	GRU-128	GRU-256	GRU-256
conv5-128 (ReLU)	conv5-128 (ReLU)	dropout-0.45	dropout-0.65	dropout-0.65
max_pool3	max_pool3	BatchNorm	BatchNorm	BatchNorm
BatchNorm	BatchNorm	GRU-64	GRU-128	GRU-128
dropout-0.65	dropout-0.65	dropout-0.2	dropout-0.45	dropout-0.45
conv5-64 (ReLU)	conv5-64 (ReLU)	BatchNorm	BatchNorm	BatchNorm
max_pool3	max_pool3		GRU-64	GRU-64
BatchNorm	BatchNorm		dropout-0.35	dropout-0.35
dropout-0.65	dropout-0.65		BatchNorm	BatchNorm
F-100 (ReLU)	F-128 (ReLU)
BatchNorm	BatchNorm
	dropout-0.3

Training settings vary depending on the specific scenario of classification. We indicate the setting for each learning scenario, either with the original data or using data augmentation.

Unbalanced case

The baselines used in stance detection were trained with 20 epochs for CNNs and RNN2 models, while RNN1 and RNN3 used ten epochs. For harassment detection, the baselines were trained with 15 and 8 epochs, respectively. All the baselines were trained using mini-batch gradient descent with batches of 32 samples and Adam optimizer for parameter updating. After preparing the baselines, we used the transformer encoder, as explained in Section 3, to learn a joint representation of each example and the classifier’s outputs. To train the transformer, stance detection fixed the number of epochs to 75, while harassment detection used 60 epochs. The transformer used focal loss with class weights inversely proportional to each class size. The size of the hidden units for feed-forward layers was set to 128 with a dropout of 0.35 using 32 elements per mini-batch. Both tasks define a transformer using Adam with warmup 6000 and a factor of 1. This setting corresponds to increasing the learning rate linearly for the first warmup training steps and decreasing it after that proportionally to the inverse square root of the step number. The number of encoders used was 2 with 768 or 200 dimensions depending on the encoding of the text (BERT as Service or GloVe, respectively). We used multi-head attention with four attention heads.

Balanced case (Augmented data)

SemEval 2017 was augmented from 4238 tweets to 10956 tweets, while the SIMAH dataset was augmented from 6374 to 14644 tweets. With this augmentation, the number of training epochs was redefined to keep the number of iterations constant. Then, for a fixed mini-batch size, the number of epochs was defined by:

$\displaystyle e_{\textit{aug}}=e_{\textit{unb}}\frac{|D_{\textit{unb}}|}{|D_{% \textit{aug}}|},$

where $|D_{\textit{unb}}|$ and $|D_{\textit{aug}}|$ are the amounts of training instances in the original and in the augmented datasets, and $e_{\textit{unb}}$ is the number of epochs used in the unbalanced scenario. Thus, the baselines in stance detection were trained with eight epochs for CNNs and RNN2 and with four epochs for RNN1 and RNN3. In harassment detection, we used seven and four epochs, respectively. Accordingly, the transformer for stance detection was trained with 30 epochs while in harassment detection, we used 26 epochs.

For validation, we used accuracy and F1 score metrics. The latter metric is recommended for evaluating multi-class classification with class imbalance. The training was performed on the training partitions of each dataset, using the validation partition to adjust parameters during training. Each testing partition was reserved to evaluate the performance of classifiers after training. The results that we show in the next section correspond to results in the testing partitions.

5.3 Results

In this section, we show the results per task in three different modalities, without data augmentation, using SMOTE, and using our data augmentation technique. In this way, we managed to separate the effect of the proposed method from the effect produced by data augmentation. In each table, the results are shown according to accuracy, F1 score at the macro level, and F1 score by class. This validation design allows us to distinguish the effects that occur globally from those that affect each class in particular. To evaluate the variability of our method produced by the data augmentation technique, we use the training and validation partitions to produce five models in five independent runs. The reported results correspond to the averages of these five models in the testing partition.

Table 4
Results reported on test data for stance classification. Results reported by the state of the art methods correspond to the testing partition’s best trial performance. When we evaluate our proposal using our best trial, the margin with which we surpass the other methods increases. For example, our performance in the best experiment in F1 macro with masking at 0.15 rises to 0.486. We prefer to report the average results even though the comparison does not benefit us

Method	Accuracy	F1 macro	Supporting	Denying	Questioning	Commenting
Voting Lozano et al. [34]	0.749	0.453	0.427	0.022	0.512	0.852
CNN Chen et al. [12]	0.392	0.407	0.195	0.114	0.507	0.813
XGBoost [6]	0.780	0.453	0.397	0.052	0.494	0.869
Two-layered GRU [37]	0.622	0.434	0.314	0.158	0.531	0.739
Random Forest	0.762	0.216	0.000	0.000	0.000	0.865
Linear SVM	0.629	0.309	0.089	0.164	0.200	0.782
Gaussian SVM	0.721	0.274	0.104	0.069	0.081	0.841
BERT	0.300	0.195	0.086	0.096	0.139	0.459
Transformer $+$ BERT	0.699	0.457	0.166	0.251	0.597	0.814

5.3.1 Stance classification

We provide a description of the experimental results in stance classification. Table 4 presents the performance of the studied methods on test data. Results achieved by state-of-the-art methods are shown in the sequel. We show the results achieved using the CNN proposed by Chen et al. [12] and Lozano et al. [34] and the XGBoost proposed by Bahuleyan et al. [6], which achieved the first places in the SemEval 2017 competition in this dataset. We also show the results achieved by Ma et al. using a two-layered GRU [37]. Furthermore, we show the results obtained using Random Forest, linear SVM, and Gaussian SVM. The performance achieved by our proposal is shown in the last row of the table. We also trained a variant of our proposal as a baseline using just the BERT vectors on the sentences to ingest the transformer, indicated as BERT. The variant of our proposal, reported in the Table 4, use data augmentation with masking at 15%.

The results in Table 4 show that the performance of the state of the art methods is quite uneven. Lozano’s method [34] achieves the best result in the supporting class, and XGBoost [6] does the same in the commenting class. Both classes are the majority of the dataset, which is why the performance in accuracy that both methods exhibit is high. In terms of accuracy, XGBoost achieves the best result in this experiment. The results of these methods in minority classes are poor, with the denying class being the most difficult to predict.

Chen’s method [12] achieves a slightly more even performance between classes, at the cost of lowering performance in the supporting class. This result reduces their performance in accuracy and F1 score. Something interesting happens with the Ma et al. [37] method, which manages to overcome the three state-of-the-art methods in the questioning and denying class. However, the method fails to beat them in F1 score. Random Forest, Linear SVM, and Gaussian SVM show over-fitting to the majority class. The transformer ingested only with BERT sentences achieves poor results, showing over-fitting to the commenting class. Our proposal is the one that achieves the best balance between classes in the experiment at the cost of reducing performance in the supporting class. The proposal achieves state of the art results in the denying and questioning classes, with significantly better performances than the rest of the evaluated methods. Given this improvement, our proposal achieves the best result in F1 score. Using data augmentation enables improvements in the questioning and denying class, at the cost of decreasing performance in the supporting class. This result indicates that our data augmentation strategy managed to improve performance in the most challenging classes.

Data augmentation

To evaluate the proposed data augmentation technique, we performed a comparison with SMOTE. To do this, SMOTE was carried out on each baseline. Using the baseline models, we trained three variants of committee machines, using the output with the highest confidence between the baselines (best-fit), using the normalized Hadamard product between classifiers’ outputs (norm) and using a majority voting approach (voting). SMOTE was applied, achieving a balance between classes, over-sampling minority classes. The results of this experiment using SMOTE for data augmentation are shown in Table 5. Note that our proposal based on the transformer cannot directly use SMOTE since it requires the encoding of the text of each example. When producing new cases for data augmentation based on linear combinations of existing cases, SMOTE does not generate examples in the text space but in the continuous feature vector encoding space.

Table 5
Results reported on test data using SMOTE for stance classification

Method	Accuracy	F1 macro	Supporting	Denying	Questioning	Commenting
CNN1	0.566	0.402	0.233	0.142	0.530	0.702
CNN2	0.567	0.401	0.235	0.148	0.521	0.700
RNN1	0.345	0.250	0.185	0.089	0.270	0.455
RNN2	0.349	0.254	0.094	0.134	0.319	0.469
RNN3	0.304	0.249	0.199	0.108	0.293	0.396
Best-fit committee	0.473	0.335	0.187	0.155	0.391	0.607
Norm committee	0.543	0.381	0.237	0.130	0.476	0.679
Voting committee	0.541	0.377	0.246	0.126	0.460	0.677

Table 5 shows that SMOTE produces better performance in minority classes. The most significant improvements are seen in the denying class, at the cost of worsening performance in the majority classes. The improvement produced by SMOTE does not outperform our proposal, having lower performances than us in all categories except in the supporting class. Committee machines benefit from the use of SMOTE, improving their performance in the minority classes. However, its performance in the commenting class is low, being below all the state of the art methods shown in Table 4.

We study the performance of our proposal in several experimental settings. First, we explore our proposal without using the data augmentation technique. In this configuration, we test two variants, one with flat class weights, without penalizing one class over another, and another with class weights whose value is inversely proportional to the number of examples in each category. The second variant focuses more on minority classes, so it is expected that its results will exceed those obtained with a flat penalty. We study the performance of the committee machines in both scenarios. Our proposal is studied with two word embeddings. One is GloVe, and the other is BERT. With these, two baselines were included, one for each embedding without using the classifiers’ outputs. The results of this experimental setting are shown in Table 6.

Table 6

Experimental results in stance classification using flat weights and class weights inversely proportional to class sizes. Both variants are evaluated without data augmentation. The results correspond to averages across five independent trials

Method	Mask	Accuracy	F1 macro	Supporting	Denying	Questioning	Commenting
CNN1	0.0	0.770	0.357	0.029	0.040	0.495	0.865
CNN2		0.771 ${}^{*}$	0.309	0.011	0.000	0.356	0.868 ${}^{*}$
RNN1		0.730	0.371	0.125	0.028	0.493	0.840
RNN2		0.684	0.392	0.118	0.200	0.441	0.809
RNN3		0.751	0.399	0.159	0.010	0.575	0.850
Best-fit committee		0.703	0.398	0.118	0.187	0.465	0.822
Norm committee		0.752	0.380	0.064	0.118	0.484	0.855
Voting committee		0.767	0.365	0.058	0.045	0.494	0.864
BERT		0.758	0.221	0.021	0.000	0.000	0.863
Transformer $+$ BERT		0.739	0.403	0.138	0.101	0.527	0.846
GloVe		0.762	0.216	0.000	0.000	0.000	0.865
Transformer $+$ GloVe		0.769	0.348	0.050	0.010	0.467	0.866
CNN1	0.0 ${}_{\textit{cw}}$	0.653	0.421	0.086	0.224	0.601	0.773
CNN2		0.627	0.420	0.077	0.237	0.626	0.742
RNN1		0.452	0.348	0.133	0.209	0.487	0.562
RNN2		0.611	0.392	0.134	0.232	0.453	0.749
RNN3		0.445	0.358	0.158	0.219	0.513	0.540
Best-fit committee		0.613	0.401	0.141	0.244	0.473	0.745
Norm committee		0.672	0.431	0.120	0.246	0.567	0.792
Voting committee		0.666	0.451	0.134	0.268	0.623	0.780
BERT		0.081	0.098	0.109	0.107	0.173	0.002
Transformer $+$ BERT		0.557	0.443	0.189	0.273 ${}^{*}$	0.635 ${}^{*}$	0.675
GloVe		0.086	0.102	0.121	0.113	0.173	0.000
Transformer $+$ GloVe		0.527	0.417	0.182	0.243	0.600	0.643

Table 6 shows that CNN2 without class weights is very strong in accuracy. Its performance is due to the remarkable performance it gets in the commenting class. However, CNN2 exhibits a deficient performance in the other classes, which evidences the presence of over-fitting. Both the transformer and the committee machines do not perform well with flat class weights. By incorporating class weights into the loss function, the transformer significantly improves its performance. In this configuration, our BERT-based proposal achieves the best results from these experiments in the minority classes. However, the cost paid by the method in the majority classes is high, which implies that its performance in F1 macro is low. The transformer performs better with BERT than with GloVE, in both experimental settings. In terms of F1 macro, the best performing method is voting committee, illustrating that the combination of classifiers’ outputs is beneficial in this problem. The performances of both text-based baselines are deficient, mainly when class weights are not used.

Our data augmentation strategy is parameterized by the word sampling fraction, a parameter that we call masking. We test various masking settings, with values of 0.15, 0.5, and 0.85. Low masking values represent slight variations around the original sentences (low variability). High masking values produce more significant fluctuations in the sentences concerning the original sentences (high variability). We used our data augmentation technique, over-sampling minority classes until the balance between classes was achieved. The results of this experiment are shown in Table 7.

Table 7

Experimental results in stance classification using different settings for our data augmentation technique. The results correspond to averages across five independent trials

Method	Mask	Accuracy	F1 macro	Supporting	Denying	Questioning	Commenting
CNN1	0.15	0.696	0.440	0.097	0.246	0.609	0.810
CNN2		0.709	0.441	0.101	0.234	0.610	0.820
RNN1		0.595	0.414	0.201 ${}^{*}$	0.217	0.511	0.726
RNN2		0.655	0.399	0.101	0.216	0.495	0.786
RNN3		0.584	0.409	0.165	0.240	0.521	0.709
Best-fit committee		0.665	0.412	0.105	0.222	0.531	0.792
Norm committee		0.707	0.437	0.123	0.249	0.553	0.822
Voting committee		0.706	0.453	0.124	0.268	0.601	0.818
BERT		0.300	0.195	0.086	0.096	0.139	0.459
Transformer $+$ BERT		0.699	0.457 ${}^{*}$	0.166	0.251	0.597	0.814
GloVe		0.390	0.215	0.095	0.096	0.105	0.565
Transformer $+$ GloVe		0.685	0.452	0.142	0.258	0.608	0.799
CNN1	0.5	0.713	0.424	0.115	0.214	0.543	0.825
CNN2		0.698	0.428	0.107	0.213	0.580	0.813
RNN1		0.621	0.390	0.163	0.190	0.457	0.751
RNN2		0.632	0.387	0.113	0.196	0.476	0.764
RNN3		0.654	0.366	0.140	0.204	0.333	0.787
Best-fit committee		0.644	0.405	0.125	0.219	0.503	0.773
Norm committee		0.697	0.417	0.119	0.231	0.502	0.816
Voting committee		0.706	0.433	0.135	0.232	0.542	0.822
BERT		0.302	0.202	0.112	0.085	0.154	0.455
Transformer $+$ BERT		0.705	0.438	0.141	0.243	0.549	0.820
GloVe		0.350	0.210	0.093	0.095	0.135	0.519
Transformer $+$ GloVe		0.712	0.443	0.149	0.234	0.563	0.825
CNN1	0.85	0.736	0.408	0.091	0.225	0.474	0.844
CNN2		0.743	0.405	0.058	0.185	0.528	0.849
RNN1		0.712	0.396	0.089	0.180	0.487	0.827
RNN2		0.648	0.363	0.087	0.226	0.363	0.775
RNN3		0.714	0.361	0.076	0.144	0.396	0.830
Best-fit committee		0.673	0.374	0.077	0.220	0.403	0.797
Norm committee		0.733	0.400	0.077	0.235	0.446	0.842
Voting committee		0.740	0.407	0.081	0.221	0.480	0.847
BERT		0.296	0.202	0.117	0.113	0.132	0.446
Transformer $+$ BERT		0.712	0.428	0.115	0.229	0.541	0.825
GloVe		0.347	0.218	0.112	0.104	0.142	0.513
Transformer $+$ GloVe		0.744	0.414	0.096	0.203	0.508	0.849

Table 7 shows that higher variability in our technique, produced by higher masking, allows methods to improve accuracy. Almost all models achieve improvements in the majority classes by using masking at 0.5 or 0.85. The best result in accuracy is achieved by our proposal when used with GloVe with masking at 0.85. Masking at 0.85 means a very high fluctuation of the generated text concerning the original sentences. This variability benefits the behavior in the majority classes, allowing models to separate these classes from the minority classes better. However, in these configurations, no improvements are observed in the minority classes. All models suffer a deterioration in their performance in terms of F1 macro. Consequently, a higher variability in the data augmentation technique allows to improve the separability of the majority classes but does not generate better performance in the minority ones. Our technique, in terms of F1 macro, gets its best results with masking at 0.15. Minor fluctuations in the texts generated from the minority classes allow for improvements in these classes, without deteriorating the majority classes. The transformer based on BERT with masking at 0.15 achieves a good performance in minority classes, quite close to the optimum obtained using class weights without data augmentation, but with much less deterioration in the majority classes. The class that is most difficult for this configuration is the supporting class. Despite its poor performance in this class, the overall performance in terms of macro F1 is the best of all observed in the experiments.

5.3.2 Harassment detection

To study the performance of our proposal in the harassment detection problem, we used the same experimental setting used to study stance classification. This fact means that we train five trials of our proposal in this dataset, using the validation partition for tuning during training. We evaluated each of these models in the testing partition once the training of the five trials ended. We report results averaged across the five models. We compared the results of our proposal with other methods published in the SIMAH competition of ECML/PKDD 2019, in which we obtained the first place. The comparison was made with a SMOTE-based data augmentation model [19], and with a model based on neural networks with attention layers [28]. We also compare our results with those obtained by Random Forest, Linear SVM, and Gaussian SVM. The best result in terms of F1 macro of our proposal was achieved using class weights without data augmentation on the transformer combined with BERT. The results of this experiment are shown in Table 8.

Table 8
Results reported on test data in harassment detection. F-score results are shown per class and at macro level

Method	Accuracy	F1 macro	Non- Harassment	Indirect- Harassment	Physical- Harassment	Sexual- Harassment
MPA RNN [28]	–	0.471	0.687	0.356	0.127	0.714
SMOTE [19]	0.750	0.460	0.870	0.240	0.120	0.600
Random Forest	0.754	0.215	0.860	0.000	0.000	0.000
Linear SVM	0.804	0.439	0.886	0.070	0.120	0.678
Gaussian SVM	0.858	0.435	0.920	0.000	0.050	0.770
Transformer $+$ BERT	0.808	0.516	0.890	0.256	0.129	0.790

Table 8 shows that Gaussian SVM performs very well in the non-harassment class, which is the majority class of this dataset. The cost of this performance is paid by the minority classes, among which indirect-harassment stands out, in which Gaussian SVM fails to predict any example of the testing set. The result in physical harassment, another minority class, is also deficient. The neural network method with attention layers [28] achieves the best performance in the indirect harassment class, but at the cost of substantially lowering its performance in the non-harassment majority class. That makes its F1 macro performance not good. The SMOTE-based method [19] achieves a somewhat more even performance in the minority classes but at the cost of lower performance in the sexual harassment class. Our approach delivers good performance in the two majority classes, non-harassment, and sexual harassment, and the best performance of this experiment in the physical harassment class. Accordingly, its performance in F1 macro is the best in this experiment.

We study the impact of SMOTE on this specific task. To do this, we train each baseline of our proposal using SMOTE for data augmentation. The augmented dataset achieved balance between the classes by oversampling the minority classes. Unlike the dataset used in stance classification, this dataset has a much more marked imbalance. Therefore, the effect that SMOTE produces on the dataset is much more noticeable. In this experiment, we evaluated the performance of committee machines, using the three variants studied in this work. The results are shown in Table 9.

Table 9 shows that SMOTE manages to improve the performance of these models in the minority classes. This effect is similar to that observed by Espinoza and Weiss [19], in which the improvement in the minority classes implies a deterioration in the majority classes. As a result of this deterioration, the results in F1 macro are not good, all being less than the results obtained using our proposal. The results between baselines are quite uneven. While CNNs show better performance in the sexual harassment class, RNNs seem not to be able to learn in this scenario. Only the RNN2 achieves competitive accuracy, due to its good performance in the majority class. On the other hand, none of the three methods of committee machines perform well, obtaining low results in the sexual harassment class.

Table 9

Results reported on test data using SMOTE for harassment detection

Method	Accuracy	F1 macro	Non- Harassment	Indirect- Harassment	Physical- Harassment	Sexual- Harassment
CNN1	0.770	0.449	0.869	0.137	0.149	0.639
CNN2	0.745	0.455	0.853	0.154	0.180	0.632
RNN1	0.558	0.359	0.709	0.147	0.149	0.429
RNN2	0.686	0.381	0.809	0.123	0.120	0.472
RNN3	0.589	0.352	0.735	0.132	0.120	0.421
Best-fit committee	0.753	0.438	0.857	0.139	0.190	0.568
Norm committee	0.769	0.427	0.867	0.109	0.148	0.584
Voting committee	0.771	0.434	0.868	0.114	0.161	0.594

We study different variants of our proposal in this specific task. First, we examine the performance of our proposal without using data augmentation. Two options are evaluated for this purpose, one with flat class weights and the other with class weights inversely proportional to the class sizes. It should be noted that this dataset is more unbalanced than the one used for stance classification, which is why the effect of class weights is more noticeable. The imbalance between the majority and the minority classes in the training partition has a ratio of 1:70, showing an extremely severe class imbalance. The results of this experiment are shown in Table 10.

Table 10 shows that when using flat class weights, all configurations perform poorly in minority classes. The best result in the indirect harassment class is achieved by RNN2, with only an F1 of 0.159. Improvements in this class make F1 macro better. It is fascinating that this machine, with flat class weights, manages to learn (weakly) a minority class without using data augmentation. This class may have some characteristic in its examples that are well conditioned for sequential learning, for example, word sequences that are typically used in this type of case.

Another recurrent neural network, RNN1, is the one that achieves the best result in this class by using class weights, obtaining an F1 at 0.310, the best of all the experiments observed by us in this class. This same baseline achieves a good result in the physical harassment class, another minority class. As a result, this baseline is the best performer in F1 macro. This important finding indicates that a classic model, without attention, without data augmentation, without a combination of machines, manages to overcome the other methods, rather more complex, in these classes. A plausible explanation for this finding is that recurrent networks can learn word sequences that are very important to this class. Our method, in this scenario of imbalance, fails to overcome this baseline, and overfit the majority classes, following a very similar pattern to that of the voting committee. This finding shows that in extreme imbalance scenarios, our proposal is not helpful. By using class weights, our proposal improves its performance in minority classes, being better in indirect harassment when using BERT, and having better results in physical harassment when using GLoVe. This dependence on the word embedding used indicates that the performance in these classes strongly depends on particular linguistic characteristics.

Table 10

Results reported on test data without data augmentation. The reported results correspond to averages across five independent trials. F-score results are shown per class and at macro level

Method	Mask	Accuracy	F1 macro	Non- Harassment	Indirect- Harassment	Physical- Harassment	Sexual- Harassment
CNN1	0.0	0.851	0.412	0.915	0.000	0.000	0.732
CNN2		0.858	0.421	0.919	0.000	0.000	0.765
RNN1		0.854	0.427	0.918	0.011	0.018	0.759
RNN2		0.826	0.453	0.901	0.159	0.048	0.703
RNN3		0.851	0.441	0.917	0.000	0.084	0.762
Best-fit committee		0.851	0.434	0.916	0.038	0.030	0.750
Norm committee		0.862	0.426	0.923	0.000	0.000	0.782
Voting committee		0.865 ${}^{*}$	0.429	0.925	0.000	0.000	0.793
BERT		0.725	0.229	0.841	0.000	0.000	0.076
Transformer $+$ BERT		0.860	0.425	0.920	0.000	0.005	0.773
GloVe		0.751	0.216	0.858	0.000	0.000	0.006
Transformer $+$ GloVe		0.865 ${}^{*}$	0.430	0.924	0.000	0.000	0.795 ${}^{*}$
CNN1	0.0 ${}_{\textit{cw}}$	0.834	0.411	0.906	0.032	0.060	0.648
CNN2		0.818	0.411	0.898	0.132	0.060	0.553
RNN1		0.776	0.529 ${}^{*}$	0.894	0.310 ${}^{*}$	0.280 ${}^{*}$	0.633
RNN2		0.831	0.423	0.906	0.047	0.063	0.678
RNN3		0.637	0.473	0.766	0.202	0.248	0.676
Best-fit committee		0.846	0.490	0.915	0.166	0.160	0.719
Norm committee		0.864	0.482	0.924	0.088	0.137	0.780
Voting committee		0.862	0.502	0.923	0.147	0.173	0.765
BERT		0.124	0.120	0.142	0.088	0.074	0.177
Transformer $+$ BERT		0.808	0.516	0.890	0.256	0.129	0.790
GloVe		0.069	0.073	0.038	0.095	0.061	0.097
Transformer $+$ GloVe		0.741	0.496	0.824	0.214	0.207	0.739

We study our data augmentation technique in this problem with three different configurations in the masking parameter, with values at 0.15, 0.5, and 0.85. As this dataset has a much more marked imbalance of classes than in the case of stance classification, it is expected that the effect of the data augmentation will be more noticeable. Over-sampling of minority classes, which in this case are very unbalanced, stresses the data augmentation method to achieve class balance. This scenario allows evaluating this technique in extreme data imbalance scenarios. The results of these experiments are shown in Table 11.

Table 11 shows that a higher masking factor implies a deterioration in the methods’ performance, which is more evident in the majority classes. This fact makes the performance in accuracy low when using a higher masking factor. The effect produced by high maskings in minority classes is unclear, noting that there is no improvement in these classes in most configurations. The poor performance of our data augmentation technique for high masking factors is attributed to the extreme imbalance observed between the minority and majority classes. Our proposal with BERT obtains a performance very similar to that of the voting committee machine when using masking at 0.15, showing that in this scenario, the transformer approximates such a machine. The voting committee machine obtained the best result from this study in the non-harassment class, which, although it is the majority class, is at the same time the least interesting class in a harassment detection scenario. None of the machines perform well in minority classes, showing how difficult this problem is in these particular classes. The sexual harassment class is more approachable, and our proposal, combined with BERT, gets a good result, with F1 at 0.791, which is the highest of this experiment.

Table 11

Results reported on test data applying data augmentation (balanced data). F-score results are shown per class and at macro level. The results correspond to the average obtained from 5 executions of each experiment

Method	Mask	Accuracy	F1 macro	Non- Harassment	Indirect- Harassment	Physical- Harassment	Sexual- Harassment
CNN1	0.15	0.847	0.433	0.913	0.054	0.035	0.729
CNN2		0.847	0.437	0.914	0.070	0.031	0.734
RNN1		0.836	0.495	0.911	0.120	0.198	0.749
RNN2		0.824	0.466	0.902	0.101	0.145	0.716
RNN3		0.834	0.509	0.907	0.176	0.182	0.770
Best-fit committee		0.843	0.505	0.913	0.159	0.190	0.757
Norm committee		0.860	0.457	0.924	0.064	0.054	0.785
Voting committee		0.862	0.469	0.926 ${}^{*}$	0.075	0.085	0.790
BERT		0.307	0.194	0.456	0.074	0.053	0.193
Transformer $+$ BERT		0.862	0.457	0.925	0.060	0.054	0.791
GloVe		0.337	0.199	0.504	0.079	0.056	0.157
Transformer $+$ GloVe		0.860	0.482	0.924	0.109	0.104	0.790
CNN1	0.5	0.833	0.449	0.905	0.112	0.046	0.735
CNN2		0.839	0.468	0.910	0.110	0.111	0.740
RNN1		0.830	0.482	0.903	0.116	0.159	0.751
RNN2		0.808	0.462	0.888	0.067	0.207	0.686
RNN3		0.785	0.481	0.874	0.162	0.180	0.707
Best-fit committee		0.824	0.484	0.902	0.105	0.206	0.722
Norm committee		0.846	0.466	0.914	0.044	0.149	0.758
Voting committee		0.847	0.482	0.915	0.064	0.186	0.762
BERT		0.337	0.209	0.498	0.071	0.075	0.191
Transformer $+$ BERT		0.848	0.472	0.915	0.098	0.105	0.769
GloVe		0.316	0.200	0.464	0.071	0.068	0.198
Transformer $+$ GloVe		0.850	0.481	0.916	0.079	0.160	0.769
CNN1	0.85	0.812	0.436	0.892	0.108	0.053	0.692
CNN2		0.814	0.445	0.893	0.109	0.078	0.700
RNN1		0.826	0.494	0.903	0.146	0.171	0.756
RNN2		0.795	0.452	0.880	0.141	0.108	0.681
RNN3		0.794	0.454	0.877	0.122	0.115	0.700
Best-fit committee		0.819	0.488	0.896	0.153	0.177	0.725
Norm committee		0.841	0.458	0.911	0.089	0.078	0.753
Voting committee		0.843	0.463	0.913	0.102	0.077	0.758
BERT		0.322	0.202	0.472	0.087	0.047	0.202
Transformer $+$ BERT		0.839	0.455	0.910	0.092	0.066	0.750
GloVe		0.379	0.216	0.544	0.065	0.066	0.187
Transformer $+$ GloVe		0.830	0.462	0.904	0.107	0.098	0.740

Limitations of this study

We have introduced a transformer-based method to work with multi-class text classification problems with a strong class imbalance. Our contribution goes in two lines: first, based on a set of baselines, the transformer is used to use the classifier’s outputs to combine them in a sort of parameterized committee machine. The flexibility of the transformer allows these inputs to be easily mixed with text encodings. The proposal has been evaluated in two very complex tasks. In both tasks, our method achieves the best performance. These results illustrate how powerful the transformer architecture is. Our proposal’s novelty is to use the transformer to learn how to combine the results of other machines. In these scenarios, our proposal outperforms the rest of the methods. Since the class imbalance is severe, we introduce a data augmentation technique. Our strategy is invariant to the syntactic of the original examples. In addition, the strategy aims to maintain the semantics of the original texts using replacements in the space defined by the GloVe encoding. The results of the technique show that in scenarios with moderate imbalance, as in the case of stance classification, the method improves the results obtained without increasing data. When the imbalance is more severe, as in the case of harassment detection, the technique is unable to overcome this difficulty. Although our proposal achieves the best results in both challenges, the results are far from satisfactory because these tasks are challenging. The complexity of these tasks is in the low availability of data for the classes of interest, which are shown to be severely unbalanced. This fact is a reality in many contexts, such as in social media analytics, for example, the field to which the two tasks studied belong. A final thought that illustrates a limitation of this study is that, given the limited availability of examples, it is challenging even for a highly complex model to solve a machine learning task. A data augmentation technique, such as the one introduced by us, requires a sufficient number of representative examples for each class. It cannot address the absence of observations in the classes of interest. The question about statistics and measurable conditions that allow determining when a sufficient amount of data is available to generalize in a class is a fundamental question for this area.

6. Conclusions

We have introduced a method based on the transformer that allows baselines to be combined. The flexibility of this architecture enables baselines to be combined with text encodings. The proposal, together with a new data augmentation technique, allows obtaining good results in two multi-class text classification tasks with class imbalance. Despite the relative success that we show in this work, these tasks are far from being considered as resolved.

This work can be extended in several lines. At an experimental level, the study of our proposal in other problems, for example, in classic multi-class classification (e.g., 20 newsgroups) in which larger data volumes are managed, and more classes are available, is an interesting scenario to study the performance of our proposal. The use of attention in conjunction with explanatory mechanisms can also help elucidate which characteristics of each class are most relevant during our proposal training. Studying this proposal in other classification tasks is also another exciting line of work. Its use in classification of images with severe imbalance, as in the case of medical images, could also be of interest for this type of architecture.

Footnotes

Acknowledgments

This research was possible due to the funding of Programa de Iniciación Científica PIIC-DGIP of Universidad Técnica Federico Santa María.

References

Anand

Walker

M.A.

Abbott

Tree

J.E.F.

Bowmani

and Minor

, Cats rule and dogs drool!: Classifying stance in online debate, in: Proceedings of the 2nd Workshop on Computational Approaches to Subjectivity and Sentiment Analysis, WASSA@ACL ’11, Portland, OR, USA, 2011, pp. 1–9.

Arango

Pérez

and Poblete

, Hate speech detection is not as easy as you may think: A closer look at model validation, in: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’19, Paris, France, 2019, pp. 45–54.

Augenstein

Rocktäschel

Vlachos

and Bontcheva

, Stance detection with bidirectional conditional encoding, in: Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP ’16, Austin, Texas, USA, 2016, pp. 876–885.

L.J.

Kiros

J.R.

and Hinton

G.E.

, Layer normalization, CoRR, abs/1607.06450, 2016.

Badjatiya

Gupta

and Varma

, Deep learning for hate speech detection in tweets, in: Proceedings of the 26th International Conference on World Wide Web Companion, Perth, Australia, 2017, pp. 759–760.

Bahuleyan

and Vechtomova

, UWaterloo at SemEval-2017 t-8: Detecting stance towards rumours with topic independent features, in: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval ’17), Vancouver, Canada, 2017, pp. 461–464.

Bravo-Marquez

Mendoza

and Poblete

, Meta-level sentiment models for big social data analysis, Knowledge-Based Systems 69 (2014), 86–99.

Bugueño

and Mendoza

, Applying self-attention for stance classification, in: Iberoamerican Congress on Pattern Recognition (CIARP ’19), Habana, Cuba, 2019, pp. 51–61.

Bugueño

and Mendoza

, Learning to detect online harassment on twitter with the transformer, in: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Springer, 2019, pp. 298–306.

10.

Chatzakou

Kourtellis

Blackburn

de Cristo-Faro

Stringhini

and Vakali

, Mean birds: Detecting aggression and bullying on twitter, in: ACM Web Science Conference (WebSci ’17), Troy New York, USA, 2017, pp. 13–22.

11.

Chawla

N.V.

Bowyer

K.W.

Hall

L.O.

and Kegelmeyer

W.P.

, SMOTE: Synthetic minority over-sampling technique, Journal of Artificial Intelligent Research 16 (2002), 321–357.

12.

Chen

Y.-C.

Liu

Z.-Y.

and Kao

H.-Y.

, IKM at SemEval-2017 t-8: Convolutional neural networks for stance detection and rumor verification, in: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval ’17), Vancouver, Canada, 2017, pp. 465–469.

13.

Chernick

M.R.

, Resampling methods, Wiley Interdisciplinary Review on Data Mining and Knowledge Discovery 2(3) (2012), 255–262.

14.

Davidson

Warmsley

Macy

M.W.

and Weber

, Automated hate speech detection and the problem of offensive language, in: Proceedings of the Eleventh International Conference on Web and Social Media, ICWSM ’17, Montréal, Québec, Canada, 2017, pp. 512–515.

15.

Derczynski

and Bontcheva

, Pheme: Veracity in digital social networks, in: Posters, Demos, Late-breaking Results and Workshop Proceedings of the 22nd Conference on User Modeling, Adaptation, and Personalization co-located with the 22nd Conference on User Modeling, Adaptation, and Personalization (UMAP ’14), Aalborg, Denmark, 2014.

16.

Derczynski

Bontcheva

Liakata

Procter

Wong Sak Hoi

and Zubiaga

, SemEval-2017 t-8: RumourEval: Determining rumour veracity and support for rumours, in: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval ’17), Vancouver, Canada, 2017, pp. 69–76.

17.

Devlin

Chang

Lee

and Toutanova

, BERT: pre-training of deep bidirectional transformers for language understanding, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT, Minneapolis, MN, USA, 2019, pp. 4171–4186.

18.

Dumais

S.T.

Platt

J.C.

Hecherman

and Sahami

, Inductive learning algorithms and representations for text categorization, in: Proceedings of the 1998 ACM CIKM International Conference on Information and Knowledge Management, Bethesda, Maryland, USA, 1998, pp. 148–155.

19.

Espinoza

and Weiss

, Detection of harassment on twitter with deep learning techniques, in: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Springer, 2019, pp. 307–313.

20.

Faulkner

, Automated classification of stance in student essays: An approach using stance target information and the wikipedia link-based measure, in: Proceedings of the Twenty-Seventh International Florida Artificial Intelligence Research Society Conference, FLAIRS ’14, Pensacola Beach, Florida, USA, 2014.

21.

Fortuna

and Nunes

, A survey on automatic detection of hate speech in text, ACM Computing Surveys 51(4) (2018), 1:85.

22.

Gambäck

and Sikdar

U.K.

, Using convolutional neural networks to classify hate-speech, in: Proceedings of the First Workshop on Abusive Language Online, ALW@ACL ’17, Vancouver, BC, Canada, 2017, pp. 85–90.

23.

Hasan

K.S.

and Ng

, Stance classification of ideological debates: Data, models, features, and constraints, in: Sixth International Joint Conference on Natural Language Processing, IJCNLP ’13, Nagoya, Japan, 2013, pp. 1348–1356.

24.

Zhang

Ren

and Sun

, Deep residual learning for image recognition, in: IEEE Conference on Computer Vision and Pattern Recognition, CVPR ’16, Las Vegas, NV, USA, 2016, pp. 770–778.

25.

Hochreiter

and Schmidhuber

, Long short-term memory, Neural Computation 9(8) (1997), 1735–1780.

26.

Joachims

, Text categorization with support vector machines: Learning with many relevant features, in: Machine Learning: ECML ’98, 10th European Conference on Machine Learning, Chemnitz, Germany, 1998, pp. 137–142.

27.

Johnson

J.M.

and Khoshgoftaar

T.M.

, Survey on deep learning with class imbalance, Journal of Big Data 6(1) (2019), 27.

28.

Karatsalos

and Panagiotakis

, Attention-based method for categorizing different types of online harassment language, arXiv preprint arXiv:1909.13104, 2019.

29.

Krawczyk

, Learning from imbalanced data: Open challenges and future directions, Progress in Artificial Intelligence 5(4) (2016), 221–232.

30.

Kuncheva

L.I.

, Combining Pattern Classifiers: Methods and Algorithms, Wiley, 2004.

31.

Larkey

L.S.

and Croft

W.B.

, Combining classifiers in text categorization, in: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’96, Zurich, Switzerland, 1996, pp. 289–297.

32.

LeCun

Matan

Boser

B.E.

Denker

J.S.

Henderson

Howard

R.E.

Hubbard

W.E.

Jacket

L.D.

and Baird

H.S.

, Handwritten zip code recognition with multilayer networks, in: 10th IAPR International Conference on Pattern Recognition, Conference C: Image, Speech, and Signal Processing, and Conference D: Computer Architecture for Vision in Pattern Recognition, ICPR ’90, Atlantic City, NJ, USA, 1990, pp. 35–40.

33.

Levy

and Goldberg

, Linguistic regularities in sparse and explicit word representations, in: Proceedings of the Eighteenth Conference on Computational Natural Language Learning, CoNLL ’14, Baltimore, Maryland, USA, 2014, pp. 171–180.

34.

Lozano

M.G.

Lilja

Tjörnhammar

and Karasalo

, Mama edha at semeval ’17 t-8: Stance classification with CNN and rules, in: Proceedings of the 11th International Workshop on Semantic Evaluation, SemEval@ACL ’17, Vancouver, Canada, 2017, pp. 481–485.

35.

and Weng

, A survey of image classification methods and techniques for improving classification performance, International Journal of Remote Sensing 28(5) (2007), 823–870.

36.

Lukasik

Srijith

P.K.

Bontcheva

Zubiaga

and Cohn

, Hawkes processes for continuous time sequence classification: an application to rumor stance classification in twitter, in: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL ’16, Berlin, Germany, Volume 2: Short Papers, 2016.

37.

Gao

and Wong

, Detect rumor and stance jointly by neural multi-task learning, in: Companion of the The Web Conference on The Web Conference, WWW ’18, Lyon, France, 2018, pp. 585–593.

38.

Mohammad

Kiritchenko

Sobhani

Zhu

and Cherry

, SemEval-2016 t-6: Detecting stance in tweets, in: Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval ’16), San Diego, California, USA, 2016, pp. 31–41.

39.

Mohammad

S.M.

Sobhani

and Kiritchenko

, Stance and sentiment in tweets, ACM Transactions on Internet Technologies (TOIS) 17(3) (2017), 1–23.

40.

Pennington

Socher

and Manning

C.D.

, Glove: Global vectors for word representation, in: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 1532–1543.

41.

Salton

and Buckley

, Term-weighting approaches in automatic text retrieval, Information Processing & Management 24(5) (1988), 513–523.

42.

Schapire

R.E.

and Singer

, Boostexter: A boosting-based system for text categorization, Machine Learning 39(2/3) (2000), 135–168.

43.

Schmidt

and Wiegand

, A survey on hate speech detection using natural language processing, in: Proceedings of the Fifth International Workshop on Natural Language Processing for Social Media, SocialNLP@EACL ’17, Valencia, Spain, 2017, pp. 1–10.

44.

Sebastiani

, Machine learning in automated text categorization, ACM Computing Surveys 34(1) (2002), 1–47.

45.

Sharifirad

Jafarpour

and Matwin

, Boosting text classification performance on sexist tweets by text augmentation and text generation using a combination of knowledge graphs, in: Proceedings of the 2nd Workshop on Abusive Language Online, ALW@EMNLP ’18, Brussels, Belgium, 2018, pp. 107–114.

46.

Sharifirad

and Matwin

, When a tweet is actually sexist. A more comprehensive classification of different online harassment categories and the challenges in NLP, CoRR, abs/1902.10584, 2019.

47.

Sobhani

Inkpen

and Matwin

, From argumentation mining to stance classification, in: Proceedings of the 2nd Workshop on Argumentation Mining, ArgMining@HLT-NAACL ’15, Denver, Colorado, USA, 2015, pp. 67–77.

48.

Somasundaran

and Wiebe

, Recognizing stances in online debates, in: ACL 2009, Proceedings of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the AFNLP, Singapore, 2009, pp. 226–234.

49.

Vaswani

Shazeer

Parmar

Uszkoreit

Jones

Gomez

A.N.

Kaiser

Ł.

and Polosukhin

, Attention is all you need, in: Advances in Neural Information Processing Systems, 2017, pp. 5998–6008.

50.

Walker

Tree

J.F.

Anand

Abbott

and King

, A corpus for research on deliberation and debate, in: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC ’12), Istanbul, Turkey, 2012, pp. 812–817.

51.

Waseem

, Are you a racist or am I seeing things? annotator influence on hate speech detection on twitter, in: Proceedings of the First Workshop on NLP and Computational Social Science, NLP+CSS@EMNLP ’16, Austin, TX, USA, 2016, pp. 138–142.

52.

Waseem

and Hovy

, Hateful symbols or hateful people? Predictive features for hate speech detection on twitter, in: Proceedings of the Student Research Workshop, SRW@HLT-NAACL ’16, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, 2016, pp. 88–93.

53.

Schuster

Chen

Q.V.

Norouzi

Macherey

Krikun

Cao

Gao

Macherey

Klingner

Shah

Johnson

Liu

Kaiser

Gouws

Kato

Kudo

Kazawa

Stevens

Kurian

Patil

Wang

Young

Smith

Riesa

Rudnick

Vinyals

Corrado

Hughes

and Dean

, Google’s neural machine translation system: Bridging the gap between human and machine translation, CoRR, abs/1609.08144, 2016.

54.

Zhang

Robinson

and Tepper

J.A.

, Detecting hate speech on twitter using a convolution-gru based deep neural network, in: The Semantic Web – 15th International Conference, ESWC ’18, Heraklion, Crete, Greece, 2018, pp. 745–760.

55.

Zubiaga

Kochkina

Liakata

Procter

and Lukasik

, Stance classification in rumours as a sequential task exploiting the tree structure of social media conversations, in: COLING ’16, 26th International Conference on Computational Linguistics, Osaka, Japan, 2016, pp. 2438–2448.

56.

Zubiaga

Liakata

Procter

Hoi

G.W.S.

and Tolmie

, Analysing how people orient to and spread rumours in social media by looking at conversational threads, PloS One 11(3) (2016), e0150989.

Learning to combine classifiers outputs with the transformer for text classification

Abstract

Keywords

1. Introduction

2.1 Text classification

2.2 Stance classification

2.3 Online harassment detection

1 https://ecmlpkdd2019.org/submissions/discovery/.

3.1 The transformer

3.2 BERT

3.3 Combining classifiers outputs with the transformer

2 https://github.com/hanxiao/bert-as-service.

4.1 Sampling words in sentences for data augmentation

4.2 Word substitution using linguistic regularities

3 Git repository: https://github.com/Buguemar/Transformer_as_ensemble.

5.1.1 Stance classification

4 https://ecmlpkdd2019.org/programme/discovery/.

5.2.1 Text preprocessing

Table 3 Architecture of our baseline models

Unbalanced case

Balanced case (Augmented data)

Data augmentation

Table 5 Results reported on test data using SMOTE for stance classification

Table 8 Results reported on test data in harassment detection. F-score results are shown per class and at macro level

Limitations of this study

Footnotes

Acknowledgments

References

¹
https://ecmlpkdd2019.org/submissions/discovery/.

²
https://github.com/hanxiao/bert-as-service.

³
Git repository: https://github.com/Buguemar/Transformer_as_ensemble.

⁴
https://ecmlpkdd2019.org/programme/discovery/.

Table 3
Architecture of our baseline models

Table 5
Results reported on test data using SMOTE for stance classification

Table 8
Results reported on test data in harassment detection. F-score results are shown per class and at macro level