Ranking based multi-label classification for sentiment analysis

Abstract

This paper proposes a sentiment analysis framework based on ranking learning. The framework utilizes BERT model pre-trained on large-scale corpora to extract text features and has two sub-networks for different sentiment analysis tasks. The first sub-network of the framework consists of multiple fully connected layers and intermediate rectified linear units. The main purpose of this sub-network is to learn the presence or absence of various emotions using the extracted text information, and the supervision signal comes from the cross entropy loss function. The other sub-network is a ListNet. Its main purpose is to learn a distribution that approximates the real distribution of different emotions using the correlation between them. Afterwards the predicted distribution can be used to sort the importance of emotions. The two sub-networks of the framework are trained together and can contribute to each other to avoid the deviation from a single network. The framework proposed in this paper has been tested on multiple datasets and the results have shown the proposed framework’s potential.

Keywords

Sentiment analysis multi-label classification ranking

1 Introduction

As a lastingly hot topic in the field of natural language processing, sentiment analysis has received much attention from researchers during the last years. Sentiment analysis has been widely applied to variant applications as it can extract people’s opinions on certain documents [1]. It is essential for companies and organizations to understand such information beneath the online comments, thereby providing more satisfied services [2].

To understand the sentiment of a text, a lot of efforts had been devoted to this area. Earlier works proposed to use manually or automatically built sentiment lexicon to directly analyze the sentiment since the sentiment of a text is often related to the words that have representations of sentiment [3]. The well-known works include SentiWordNet and Opinion Lexicon [4, 5]. This kind of method can make use of professional knowledge, while it is labor-intensive and may have subjective biases.

With the development of machine learning techniques, a lot of machine learning based methods have been developed to solve this problem, e.g., Logistic Regression (LR), Naïve Bayes (NB), Support Vector Machines (SVM), Maximum Entropy (ME), etc. [6, 7]. Recently, along with the boom of deep learning techniques, a large number of advanced deep learning models are also proposed to this challenge, e.g., Convolutional Neural Network (CNN) [8] and Recurrent Neural Network (RNN) based models [9]. By using these methods, semantic information can be captured more efficiently for analysis.

Most currently developed sentiment analysis methods focus on binary or triple classification of the polarity of a text, i.e., positive, neutral, negative. However, since sentiments in natural language are rich and probably show different intensities, a sentence might have different types of emotion expressions, which means a document may contain more than one polarity [10]. Particularly for a long document, it might have different aspects and each aspect can have different emotional expression. That means the sentiment polarity of an aspect is probably only linked to a part of the document. When we analyze the emotion from perspectives of different aspects, the same document can have different sentiments.

For example, as one of the largest Chinese social media websites, Weibo lets people to vote for the most accurate sentiment label of a given text, which produces more sentiment classes at low cost [11]. The popularization of social media provides more fine-grained sentiment-labeled texts. How to analyze this kind of multi-label sentiment has been attached much attention in the community [12].

Currently, most works on multi-label sentiment analysis mainly tried to use classification based methods [10, 12], which focused on the presence or absence of various emotions to determine whether a certain emotion exists in a sentence, thereby giving the corresponding annotation true or false [13]. This kind of methods simply consider the emotion polarities in the document existed or not. However, the emotions usually differ in intensities in the same document [14], which means the intensity ranking of different emotions can be a useful information for the multi-label emotion classification task.

Given a text, it can be viewed as a query, and emotions it contains could be viewed as the query results we should return. Therefore the whole process can be viewed as a ranking learning process. Inspired by the ranking learning algorithm [15], in this research we also regard the emotion classification task as a ranking learning task and let the model give the ranking of different emotions together with classification results, so as to make better use of the intensity relationship information between different emotions.

In this paper we first adopted BERT [16] as the backbone of our model because of its strong ability for representation. On this basis, we use a classification network to convert the extracted representation into estimation of each kind of sentiments, which is essentially a multi-label classification task. In order to predict the importance of each kind of sentiments, we design a network inspired by ListNet [17] to estimate the distribution of sentiments. Results of our experiments show that by combining ranking learning module, the proposed model could exploit more fine-grained emotional relationship information, which contributes to better classification performance.

The rest of our paper is organized as follows. Section 2 will discuss related work about sentiment analysis. Section 3 will describe the proposed ranking based multi-label sentiment analysis model in detail. Section 4 will present the experimental study and Section 5 will give summarization and potential future directions.

2 Related Work

Sentiment is one of the most essential fundamental applications in natural language processing. It has been attracting much attention for a long time. Considering that emotion of text is often related with the words that have representations of sentiment, earlier methods employed manually or automatically built sentiment lexicon. Manually crafted lexicons are widely used while they normally require large amounts of labor and need to be rebuilt for each new domain [18, 19]. In contrast, Liu et al. generate the sentiment lexicon by manually creating a small list of seed words with sentiment labels and growing the list [20]. They consider adjectives as sentiment words, assign each adjective in a sentence to a sentiment polarity according to the generated lexicon, flip the polarity of a sentiment word when a negation word appears within a word distance of five words and finally determine the sentiment polarity using majority voting.

With the development of machine learning technology, a lot of machine learning based approaches have been developed. For example, Liang et al. proposed an Auxiliary-Sentiment Latent Dirichlet Allocation (AS-LDA) model, in which they assume that words in subjective documents consists of two parts: sentiment element words and auxiliary words [6]. Similarly Li et al. combined TF-IDF and latent semantic analysis (LSA) in machine learning models to predict the polarity of a document [7]. Recently deep learning has been successfully applied to various tasks with the more efficient capability in extracting features automatically. For example, Convolutional Neural Network (CNN) was used to images initially, but it has been proved to be effectively in natural language process. Kim et al. reported on a series of experiments with CNN trained on top of pre-trained word vectors for sentence-level classification tasks [8], which shows excellent results on multiple benchmarks. Conneau et al. [21] proposed a new architecture (VDCNN) for text processing which operates directly at the character level and uses only small convolutions and pooling operations.

Compared with the CNN model, Recurrent Neural Network (RNN) and its variants are also widely used in the natural language processing challenge, because it has advantages in positional relationships, dependencies, and etc [22]. Tang et al. propose Target-Dependent LSTM (TDLSTM) and Target-Connection LSTM (TCLSTM) for targeted sentiment analysis [23]. Day et al. [9] used LSTM for sentiment analysis on google play consumer reviews. Similarly Tang et al. [24] introduced a model which first learns sentence representation with LSTM. Afterwards, semantics of sentences and their relations are adaptively encoded in document representation with gated recurrent neural network. By this method, semantic information between sentences can be captured easily and better analysis results can be achieved. RNN is effective in sentiment analysis, though it still has some limitations. For example, it cannot distinguish the importance of word context cues.

Some hybrid models are also reported in the literature. Rehman et al. [25] proposed a hybrid model using LSTM and very deep CNN model named as Hybrid CNN-LSTM Model, the proposed model combines set of features that are extracted by convolution and global max-pooling layers with long term dependencies. Gan et al. [26] first proposes a sparse attention based separable dilated convolutional neural network (SA-SDCCN), which consists of multichannel embedding layer, separable dilated convolution module, sparse attention layer, and output layer. SA-SDCCN achieves comparable or even better performance than state-of-the-art methods.

Though most methods consider the polarity detection for the document as a whole, some researchers have indicated the importance of aspect level emotion expression. Yu et al. tried to extract sentiment information in the form of feature vector from the pros and cons section of a review, where the positive and negative opinions on the aspects are expressed explicitly, and use the sentiment information to train an SVM sentiment classifier [27]. Similarly, CNN is also widely used for less granularity emotion detection. Motivated by the assumption that the key words may contain the aspect term and indicate the category or the polarity of an aspect, regardless of their position [28], CNN models are utilized in aspect sentiment analysis. For example, Ruder et al. [29] concatenate every word embedding with the aspect vector and use CNN to determine the sentiment of the aspect. Gu et al. organize CNN aspect mappers and a CNN sentiment classifier in a cascaded way, where the mappers detect the aspect category and the classifier predicts the aspect sentiment polarity [30]. Wu et al. propose a multitask CNN that also contain aspect mappers and sentiment classifier [31].

To better find fine-grained emotions, the importance of multi-label sentiment analysis has been identified [12]. Huang et al. proposed a multi-task multi-label approach to solve this problem [10]. Emotion ranking is also reported an important feature for this task. For example, Yang et al. used topics as an important feature in emotion ranking [11]. Similarly, Zhou et al. further proposed a ranking method for sentiment analysis [14]. It is believed identification of emotion from different perspectives is an critical challenge.

3 Ranking based multi-label sentiment classification

The overall proposed sentiment analysis model in this paper is shown in Fig. 1, where w₁, w₂, …, w_n represent the words composing the sentence while [CLS] and [SEP] are the special characters added to the beginning and the end. Firstly we will employ BERT model to extract features from data after preprocessing. Then two classification and ranking sub-networks will use the features to detect and sort the sentiments 1 .

Fig. 1

The overall architecture.

3.1 Feature extraction using BERT

Recently, a trend in using features extracted from pre-trained language model to fine-tune on downstream tasks under supervision is growing. Generally it proves that such pre-training contributes to better results of downstream tasks. Devlin et al. proposed Bidirectional Encoder Representation from Transformers (BERT) [16], a model that stacks multiple encoders of Transformer [32] together. It has been trained on a large corpus from Wikipedia and made advances in several natural language processing tasks.

Based on transfer learning, we load the BERT model pre-trained on corpus of large scales. The single encoder layer in BERT [16] is shown in Fig. 2. After pre-processing, sentences are converted to a list of tokens. Afterwards, the special characters, i.e., [CLS] and [SEP], are inserted into the head and the tail of the token list. Apart from word embeddings, the embeddings of position and segment are calculated and summed up at the embedding layer of BERT. In other words, the embedding of a sentence is calculated as: $X = e_{word} + e_{position} + e_{type}$ (1) where e represents the embedding vector. We refer to it as X because it can also be explained as the initial features of sentence.

Fig. 2

A single encoder layer in BERT.

After layer normalization and dropout [33], X is sent to the 12 stacked identical encoder layers of BERT, in which more abstract features are extracted. The key idea of the encoder is the self-attention part, which can be calculated as: $\begin{matrix} Q & = W_{Q} X \\ K & = W_{K} X \\ V & = W_{V} X \end{matrix}$ (2) $Attention (Q, K, V) = dropout (softmax (\frac{{QK}^{T}}{\sqrt{d_{k}}})) V$ (3)

In these equations, Q, K and V represent the information of query, key and value, respectively. The self-attention part is followed by the linear layer, tanh activation function, dropout and layer normalization. These layers give the encoder highly expressive ability and resistance to overfitting.

The part where features output from BERT is depicted in Fig. 3. The standard BERT model only makes use of x_cls, the final hidden vector corresponding to [CLS], which is explained as the aggregate representation of the sentence. It may not be enough and we design a module based on position-attention for feature extraction, because the intensities of words vary with the positions of words. This part gets the feature x_attention through the following equation: $\begin{matrix} weights & = softmax (X_{seq} W_{attention}) \\ x_{attention} & = (weights)^{T} X_{seq} \end{matrix}$ (4) where X_seq represents the final hidden vectors not corresponding to [CLS] and weights means the position-attention.

Fig. 3

The output part of BERT module.

To further enrich the features, we calculate the max pooling and the mean pooling of X_seq along the sentence, which we refer to as x_max and x_mean. At last, we get the final representation of the sentence by concatenating x_cls, x_attention, x_max and x_mean, which are the features we use for classification and ranking: $x_{final} = [x_{cls}; x_{attention}; x_{\max}; x_{mean}]$ (5)

3.2 Intensities based sentiment ranking

In this research, we consider sentiment intensity as an important feature in multi-label classification. Currently ranking algorithms, e.g., RankNet [34] and ListNet [17], have been widely applied in tasks of rank learning. In this research, we proposed to rank the sentiments by their intensities. The structure of this sub-network is depicted in Fig. 4, which is slightly different from the standard ListNet.

Fig. 4

The structure of the ranking subnetwork.

Learning To Rank (L2R) is a supervised learning method that gains success in document retrieval, collaborative filtering and many other tasks. Given the features of several documents X and a query q, a L2R model intends to calculate the relevance between X and q and sort all the documents according to the relevance. Frequently used features include the click rate, the quality and the number of co-occurrent words between the query and the document. A direct way for L2R is to calculate scores of each document independently. This kind of method is called pointwise approach. As only independent information of each document is taken into consideration, the results of pointwise approach are usually not very well.

As pairwise approach only considers the relationship between a pair of documents each time and ignores the global information, some researchers postulate that methods taking lists of documents as instances in learning could behave better. These methods are referred to as listwise approach and a typical one of them is ListNet [17]. This model supposes that the priority of each document x_i accounts for s_i of the sum, so the probability of rank π =< x_π(1), x_π(2), …, x_π(n) > can be calculated as: $P (π) = \prod_{i = 1}^{n} \frac{Φ (s_{π (i)})}{\sum_{k = i}^{n} Φ (s_{π (k)})}$ (6) where Φ can be any positive increasing function (e.g., exp). The training process of ListNet is to fit the probability distribution of all the possible ranks with that of labels. However, a list of n documents may have n ! possible results of ranks and thus can lead to huge amount of computation. To simplify the training procedure, we can classify the ranks into n groups by the top one document and use the probability distribution of these groups to train the model. The probability of a group where the ranks start with document j is referred to as P_top-1 (j) and described as: $P_{top - 1} (j) = \sum_{π (1) = j, π \in Ω_{n}} P (π)$ (7)

To estimate the similarity between the predicted probability distribution and the real one, we can use Kullback-Leibler divergence as the loss function: $loss = KL (P, Q) = \sum_{i = 1}^{n} p_{i} \log (\frac{p_{i}}{q_{i}})$ (8)

It should be noted that ranking of the text sentiments should follow the classification of text sentiments, because estimation of sentiment intensities makes no sense if the sentiments are unknown. However, the classification module can not behave well at the start of training. It means if the classification module is followed by a ranking module in the procedure of training, the wrong classification information will flow into the ranking module, which can lead to cascade error. Considering that, these two modules are designed in parallel and the final loss function of the model is the sum of loss functions of these two modules. When the model is trained well and put into utilization, the ranking module can use the classification results predicted by the classification module.

For simplification of calculation, we only use top-1 possibility distribution to estimate and improve the performance of the model. The loss function used here is Kullback-Leibler divergence: $\begin{matrix} loss & = KL (P_{true}, P_{pred}) \\ = \sum_{i = 1}^{n} p_{true} [i] \log (\frac{p_{true} [i]}{p_{pred} [i]}) \\ = \sum_{i = 1}^{n} \frac{e^{y_{true} [i]}}{\sum_{k}^{n} e^{y_{true} [k]}} \log (\frac{e^{y_{true} [i]} \sum_{k}^{n} e^{y_{pred} [k]}}{e^{y_{pred} [i]} \sum_{k}^{n} e^{y_{true} [k]}}) \end{matrix}$ (9) where y_pred is the predicted scores of each sentiment and y_true is the ones calculated from labels. The derivative of the loss with respect to the score of a sentiment y_pred[j] is: $\begin{matrix} \frac{\partial loss}{\partial y_{pred} [j]} & = \frac{\partial}{\partial y_{pred} [j]} (p_{true} [j] \log (\frac{p_{true} [j]}{p_{pred} [j]})) \\ = \frac{\partial}{\partial y_{pred} [j]} (- p_{true} [j] \log (p_{pred} [j])) \end{matrix}$ (10)

3.3 Fine-grained Sentiment Classification based on Cross Entropy Loss

The feature x_final extracted from BERT can be used for multi-label classification of sentiments as well as ranking of sentiments. The classification module is demonstrated in Fig. 5.

Fig. 5

The structure of the classification subnetwork.

The structure is generally similar to that of ranking module but they differ in the loss function and the meaning of output. In this sub-network, the output vector is to judge the presence or absence of each sentiment but not the proportion scores of each sentiment. The output of the last full-connected layer is in real number field. The loss function layer use sigmoid to map the output into (0, 1) and then use Binary Cross Entropy Loss to compute the loss of each sentiment. This procedure is illustrated as below: $\begin{matrix} loss (y_{true}, y_{pred}) = - \sum_{i}^{n} y_{true} [i] \log (σ (y_{pred} [i])) \\ - \sum_{i}^{n} (1 - y_{true} [i]) \log (1 - σ (y_{pred} [i])) \end{matrix}$ (11) where σ represents the sigmoid function.

In this sub-network, multiple full-connected layers and activation functions are stacked to catch information for classification. Activation functions used here are Rectified Linear Units (ReLU) [35], which will conduct the operation as below: $ReLU (x) = \max (0, x)$ (12)

The output of ReLU keeps to 0 when the input is negative, so ReLU has resistance to noise and thus fits in with the need of classification tasks.

4 Experimental Study

4.1 Dataset

There are many datasets related to sentiment analysis. However, as the early studies of sentiment analysis mainly focus on judgement of sentiment polarity, the datasets that researchers collect and publish are mostly datasets with labels of limited sentiment polarity. The typical ones include datasets of film review on IMDB, Sentiment140 and so on. In these datasets, sentiments are simply classified to positive sentiment, negative sentiment and neutral sentiment. It is obvious that there are various sentiments of humans, so only three polarities is not clear enough for description of sentiment. Therefore, some researchers started to construct datasets with fine-grained sentiment labels. Table 1 lists the datasets we use in this paper. Ren-CECps [36] is a Chinese microblog dataset with fine-grained labels. It collects 1487 microblogs in Chinese and gives each microblog scores ranging from 0 to 1 on 8 types of sentiment to represent the corresponding intensities. For precision of the model, we ignore sentences that are bigger than 300 in length or smaller than 5. The other dataset, SemEval2007, is the dataset for the 14^th subtask in the SemEval of 2007. It contains 1250 English headlines, each of which are labeled with 6 types of sentiment in the form of scores ranging from 0 to 100. As English headlines are usually short, we make use of the whole of it. We shuffle the datasets, select the previous 80 percent as the training set and use the rest for testing.

Table 1
Datasets used in this paper

Name Amount of samples Types of sentiment

Ren-CECps 35096 Joy, Hate, Love, Sorrow, Anxiety, Surprise, Anger, Expect

SemEval 2007 1250 Joy, Anger, Disgust, Fear, Sad, Surprise

Name	Amount of samples	Types of sentiment
Ren-CECps	35096	Joy, Hate, Love, Sorrow, Anxiety, Surprise, Anger, Expect
SemEval 2007	1250	Joy, Anger, Disgust, Fear, Sad, Surprise

4.2 Evaluation criteria

4.2.1 Criteria for classification performance

The classification module is to predict the presence or absence of every type of sentiment, which is in fact a multi-classification task. We use Hamming distance, macro-F1 score and micro-F1 score as criteria.

Hamming Distance. Hamming distance [37] is an index usually used in information theory to measure the difference between two strings. The main idea is to describe how much change is needed to convert one string to the same of another. In other words, it represents the amount of positions where characters in the strings are different. For strings composed of 0 and 1, hamming distance can be calculated as follow: $HammingDistantce (s_{1}, s_{2}) = \sum (s_{1} \oplus s_{2})$ (13) The classification results in this paper can be expressed as a N × K 0-1 matrix Y_pred, where N is the number of samples in testing set and K represents the number of sentiment types. Y_pred = 1 means the i^th sample is predicted to have the j^th sentiment and otherwise to do not have. Y_true, the matrix of sentiment labels, can also be defined in this way. Thus, the performance of classification can be evaluated as below: $HammingDistantce (Y_{true}, Y_{pred}) = \frac{\sum (Y_{true} \oplus Y_{pred})}{N}$ (14) Obviously, the smaller the hamming distance is, the more similar the prediction is to the truth.

F1 Score. F1 Score is another criterion to evaluate classification performance, taking precision and recall into account. Taking binary classification as instance, we can define TP, FP and FN as below: $TP = \sum_{i = 1}^{N} (y_{true} [i] = 1 \land y_{pred} [i] = 1)$ (15) $FP = \sum_{i = 1}^{N} (y_{true} [i] = 1 \land y_{pred} [i] = 0)$ (16) $FN = \sum_{i = 1}^{N} (y_{true} [i] = 0 \land y_{pred} [i] = 1)$ (17) Then the precision, recall and F1 can be calculated as: $Precision = \frac{TP}{TP + FP}$ (18) $Recall = \frac{TP}{TP + FN}$ (19)

$F 1 (y_{true}, y_{pred}) = \frac{2 * Precision * Recall}{Precision + Recall}$ (20) As F1 score takes into account both the precision and the recall, it is usually considered a good choice for classification evaluation. When faced with multi-classification problems, F1 score becomes inadequate but we can use micro-F1 score and macro-F1 score instead. In the idea of macro-F1 score, the precision of the model is defined as the mean value of precisions of each type of sentiment and the recall is defined the same way. Then the macro-F1 score can also be calculated by following equation. For a dataset that has labels of K types of emotion, the macro-F1 score can be calculated as: $\begin{matrix} macroF 1 (Y_{true}, Y_{pred}) & = \frac{2 * P_{macro} * R_{macro}}{P_{macro} + R_{macro}} \\ P_{macro} & = \frac{1}{K} \sum_{j = 1}^{K} P_{e_{j}} \\ R_{macro} & = \frac{1}{K} \sum_{j = 1}^{K} R_{e_{j}} \end{matrix}$ (21) Considering the difference between possibility distributions of different types, micro-F1 score is defined as: $\begin{matrix} microF 1 (Y_{true}, Y_{pred}) & = \frac{2 * P_{micro} * R_{micro}}{P_{micro} + R_{micro}} \\ P_{micro} & = \frac{\sum_{j = 1}^{K} {TP}_{j}}{\sum_{j = 1}^{K} {TP}_{j} + {FP}_{j}} \\ R_{micro} & = \frac{\sum_{j = 1}^{K} {TP}_{j}}{\sum_{j = 1}^{K} {TP}_{j} + {FN}_{j}} \end{matrix}$ (22) where TP_j, FN_j and FP_j represent the corresponding statistics of the j^th type of sentiment. In this paper, we use micro-F1 score and macro-F1 score to assess the performance of classification.

4.2.2 Criteria for Ranking Performance

The ranking module intends to evaluate intensities of different sentiments in the form of order. We assess its performance by typical evaluation indexes used in L2R.

Normalized Discounted Cumulative Gain. Normalized Discounted Cumulative Gain (NDCG) [38] is a frequently-used criterion for performance of L2R models. To get the NDCG, we have to first calculate Discounted Cumulative Gain (DCG). Given a query q and m recommended documents, supposing that r_i represents the relevance index of the i^th document to the query, we can get DCG as: $DCG = \sum_{i = 1}^{m} \frac{2^{r_{i}} - 1}{\log_{2} (i + 1)}$ (23)

This index will grow as the high-relevance documents are placed forward, but it will also grow if the relevance indexes become bigger. Therefore, it has to be normalized by IDCG (Ideal Discounted Cumulative Gain), the DCG of the same documents in order of relevance. For example, given k documents, the sentiments of which are e₁, e₂, …, e_k from top to bottom according to real intensity and are predicted to be e_π(1)e_π(1), e_π(2), …, e_π(k), the NDCG is calculated as: $\begin{matrix} NDCG & = \frac{DCG}{IDCG} \\ = \frac{\sum_{i = 1}^{k} \frac{2^{s_{π (i)}} - 1}{\log_{2} (i + 1)}}{\sum_{i = 1}^{k} \frac{2^{s_{i}} - 1}{\log_{2} (i + 1)}} \end{matrix}$ (24) where s_i is the real intensity of e_i. The value of NDCG ranges from 0 to 1 and the result with a bigger one means a better performance of ranking.

Mean Average Precision. Similar to NDCG in idea, Mean Average Precision (MAP) is also an evaluation index of ranking that takes into account the position and the relevance of documents. It composed of P, AP and MAP. MAP assumes the documents to be relevant to the query or irrelevant otherwise. Given a list of documents for a query, P of the i^th document means the proportion relevant documents occupy in the first i ones and AP means the average of P of relevant documents in the list. Furthermore, for several queries, MAP means the mean value of their AP. These values can be calculated as: $\begin{matrix} P & = \frac{\sum_{j = 1}^{i} r_{j}}{i} \\ AP & = {mean}_{i} (P) \\ MAP & = {mean}_{q} (AP) \end{matrix}$ (25) where r_j is 1 when the j^th document is relevant to the query and otherwise 0. In this paper, sentiments of a text sample are treated the same as documents in the example and one is assumed to be relevant only if the corresponding value in label is not 0. Similarly, value of MAP ranges from 0 to 1 and a bigger one means a better performance.

Ranking Loss. The last criterion to introduce is Ranking Loss. It also classifies sentiments to be either relevant or irrelevant. For each prediction, is calculates the proportion of tuple <e_j, e_k> where e_j is relevant and e_k is irrelevant but predicted to be prior. The calculation is illustrated as: $\begin{matrix} {Tuples}_{i} & = {< e_{i, j}, e_{i, k} > | e_{i, j} = 1, e_{i, k} = 0, s_{i, j} > s_{i, k}} \\ RL & = \frac{1}{N} \sum_{i = 1}^{N} \frac{| {Tuples}_{i} |}{\sum_{j = 1}^{K} e_{i, j} \sum_{j = 1}^{K} (1 - e_{i, j})} \end{matrix}$ (26) where e_i,j represents the presence or absence of relevance of the j^th sentiment to the i^th sample and s_i,j represents the index of the corresponding sentiment in the ranking list the model predict.

4.3 Experimental configuration

In our experiments, we use Adam [39] for optimization. The learning rate is 3 × 10^-5 and the model is trained for 10 epochs. The parameters of BERT module were loaded and warmed up [40] at the first epoch, which is to use a lower learning rate at the start of training in order to smooth the updating of gradients. To avoid over-fit, we add a dropout layer after obtaining the feature x_final and the probability is set to 0.2.

4.4 Baseline methods

We compare our model with several baselines, including CNN [8], BiLSTM [41], INN-RER [11] and SU4MLC [42].

CNN. A model based on CNN that uses a pre-trained matrix of word embedding. The features are extracted by CNN from the matrix and sent to full-connected layers for text classification.

BiLSTM. An attention-based bi-directional LSTM for classification. It use features extracted by the bi-directional LSTM concatenated with position-attention-weighted word vectors as the final features for text classification.

INN-RER. An interpretable neural network with topical information. This model generates topical information through from topic models, transfers it into the neural network and thus trains the neural network to approximate the behavior of topic models in terms of topical distribution. Classification of text is based on that distribution.

SU4MLC. A semantic-unit-based dilated CNN for multi-label text classification. Similarly, this model use bi-directional LSTM to extract the text features. Futhermore, abstract semantic features are extracted through multiple dilated convolution layers. At last, attention mechanism selectively blends semantic features from different layers for classification.

We apply the first three models and our model to SemEval 2007. The performances are describe in Table 2. For Ren-CECps, we apply these models as well as SU4MLC. The comparison is depicted in Table 3.

Table 2
Performances comparison on SemEval 2007. The character ↑ means the index of better performance is bigger and ↓ has the converse meaning

Model micro-F1↑ macro-F1↑ Hamming loss↓ Ranking loss↓

CNN 0.7629 0.7530 0.3833 0.2856

BiLSTM 0.7758 0.7609 0.3411 0.2689

INN-RER 0.7156 0.6093 0.3005 0.2302

our model 0.8445 0.8234 0.2067 0.1302

Model	micro-F1↑	macro-F1↑	Hamming loss↓	Ranking loss↓
CNN	0.7629	0.7530	0.3833	0.2856
BiLSTM	0.7758	0.7609	0.3411	0.2689
INN-RER	0.7156	0.6093	0.3005	0.2302
our model	0.8445	0.8234	0.2067	0.1302

Table 3

Performances comparison on Ren-CECps

Model	micro-F1↑	macro-F1↑	Hamming loss↓	Ranking loss↓
CNN	0.2976	0.2285	0.1508	0.1538
BiLSTM	0.5203	0.4454	0.1657	0.1702
INN-RER	0.6225	0.5133	0.3209	0.1924
SU4MLC	0.590	-	0.1782	-
Our model	0.6280	0.5537	0.1175	0.1093

According to the tables, it is clear that the model we propose has advantage over the baseline models. We think it should be owed to two reasons. On the one hand, we utilize the BERT model that has been trained on large scales of corpus and development of deep learning these years has proved that large amounts of data can greatly improve the performance of models. On the other hand, the classification module and ranking module are designed in parallel at the stage of training, but they can work together for application, resolving the problem on different layers.

To evaluate the performance of ranking module independently, we apply our model with different subnetworks to Ren-CECps. The experiment result is shown in Table 4.

Table 4

Comparison of different ranking subnetworks

Subnetwork	micro-F1↑	macro-F1↑	NDCG↑	MAP↑
MSE	0.6205	0.5327	0.8434	0.7908
RankNet	0.6226	0.5282	0.8364	0.7802
ListNet	0.6280	0.5537	0.8477	0.7960

The MSE refers to a network composed of three linear layers with ReLU that use Mean Square Error as loss function, which is a kind of pointwise approach. Meanwhile, RankNet belongs to pairwise approach and ListNet belongs to Listwise approach. Obviously, ListNet performs the best according to whichever criterion. We think the reason ListNet performs better than RankNet in this task is that the number of types of sentiment in the dataset is far less than the number of documents in document retrieval. As a result, RankNet has fewer chances to be trained and thus performs not very well. So we choose ListNet as the ranking subnetwork according to this.

4.5 Visualization of correlation matrix

At last, we show the ranking performance of the model in a visualized way. For N samples to classify, the result can be expressed as an N × K 0-1 matrix. We refer to it as Y, where Y_i,j represents the presence or absence of the j^th sentiment of the i^th sample. Thus, the j^th column of Y can be seen as representation of the j^th sentiment, which we refer to as e_j. Based on the column vectors, we can calculate Pearson correlation coefficients between different sentiments and get the correlation matrix M to depict their relationship. $\begin{matrix} M & = [M_{ij}]_{N \times K} \\ M_{ij} & = \frac{cov (e_{i}, e_{j})}{σ_{e_{i}} σ_{e_{j}}} \end{matrix}$ (27)

On this basis, we visualize the correlation matrix of Y_pred and Y_true as depicted in Fig. 6. It can be found that the correlation coefficients between sentiments in prediction are similar to that of label, which means the model has a quite good overall performance of classification. However, these two matrices differ quite greatly in the correlation coefficient of Hate and Anger and that of Sorrow and Anxiety, where the matrix of Y_pred is much bigger. We think this phenomenon is due to these sentiments have great relevance in text which is recorded in the pre-trained BERT model while the model lacks the ability to distinguish between them.

Fig. 6

correlation matrix of sentiment.

To solve this problem, the main method is to add more samples for training, so that the model can learn more semantic regulations and reduce its dependency on the pre-trained parameters in BERT. Besides, a branch to learn the relevance between different sentiments can be added into the model, so as to supervise the model.

5 Conclusion and future work

In this paper, we have proposed a classification model for fine-grained text sentiment analysis based on BERT. The key idea is to enrich the features BERT extracts by attention mechanism and two subnetworks to detect and sort the sentiments which are trained in parallel but work together in applications. Experiments show that our proposed model obtains superior performance over the baseline models.

Though our proposed model shows good performance in the form of the similarity of correlation matrices, it may have too much dependency on the relevance between similar sentiments recorded in BERT. As future work, a direction worth to attempt is to add a subnetwork for learning of relevance and supervision on the existing model.

Footnotes

Acknowledgment

This work was partially supported by the State Key Laboratory of Software Development Environment of China (No. SKLSDE-2019ZX-16).

The source code of the proposed framework is made available at

References

Agarwal

, Xie

, Vovsha

, Rambow

and Passonneau

, Sentiment analysis of twitter data, in Proceedings of the 2011 ACL Workshop on Language in Social Media, 2011, pp. 30–38.

Godbole

, Srinivasaiah

and Skiena

, Large-scale sentiment analysis for news and blogs, in Proceedings of the First International Conference on Weblogs and Social Media, 2007.

Taboada

, Brooke

, Tofiloski

, Voll

K.D.

and Stede

, Lexicon-based methods for sentiment analysis, Computational Linguistics37 (2) (2011), 267–307.

Esuli

and Sebastiani

, SENTIWORDNET: A publicly available lexical resource for opinion mining, in Proceedings of the 5th International Conference on Language Resources and Evaluation, 2006, pp. 417–422.

Baccianella

, Esuli

and Sebastiani

, Sentiwordnet 3.0: An enhanced lexical resource for sentiment analysis and opinion mining, in Proceedings of the 7th International Conference on Language Resources and Evaluation, 2010.

Liang

, Liu

, Tan

and Bai

, Sentiment classification based on AS-LDA model,, Procedia Computer Science31 (2014), 511–516.

and Shen

, Research on sentiment analysis of microblogging based on LSA and TF-IDF, in Proceedings of 3rd IEEE International Conference on Computer and Communications, 2017, pp. 2584–2588.

Kim

, Convolutional neural networks for sentence classification, in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, 2014, pp. 1746–1751.

Day

M.-Y.

and Lin

Y.-D.

, Deep learning for sentiment analysis on Google play consumer review, in Proceedings of 2017 IEEE International Conference on Information Reuse and Integration, 2017, pp. 382–388.

10.

Huang

, Peng

, Li

and Lee

, Sentiment and topic analysis on social media: A multi-task multi-label classification approach, in Proceedings of ACM Web Science 2013 Conference, 2013, pp. 172–181.

11.

Yang

, Zhou

and He

, An interpretable neural network with topical information for relevant emotion ranking, in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018, pp. 3423–3432.

12.

Liu

S. M.

and Chen

, A multi-label classification based approach for sentiment classification, Expert Systems with Applications42 (3) (2015), 1083–1093.

13.

Howard

and Ruder

, Universal language model fine-tuning for text classification, in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 2018, pp. 328–339.

14.

Zhou

, Yang

and He

, Relevant emotion ranking from text constrained with emotion relationships, in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2018, pp. 561–571.

15.

, A short introduction to learning to rank, IEICE Transactions on Information Systems94-D (10) (2011), 1854–1862.

16.

Devlin

, Chang

, Lee

and Toutanova

, BERT: Pre-training of deep bidirectional transformers for language understanding, in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019, pp. 4171–4186.

17.

Cao

, Qin

, Liu

, Tsai

and Li

, Learning to rank: From pairwise approach to listwise approach, in Proceedings of the 24th International Conference on Machine Learning, 2007, pp. 129–136.

18.

Das

S. R.

and Chen

M. Y.

, Yahoo! for amazon: Extracting market sentiment from stock message boards, in Proceedings of the Asia Pacific Finance Association Annual Conference, 2001.

19.

Huettner

and Subasic

, Fuzzy typing for document management, ACL 2000 Companion Volume: Tutorial Abstracts and Demonstration Notes, 2000, pp. 26–27.

20.

and Liu

, Mining and summarizing customer reviews, in Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2004, pp. 168–177.

21.

Schwenk

, Barrault

, Conneau

and LeCun

, Very deep convolutional networks for text classification, in Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, 2017, pp. 1107–1116.

22.

Rong

, Peng

, Ouyang

, Li

and Xiong

, Structural information aware deep semi-supervised recurrent neural network for sentiment analysis, Frontiers of Computer Science9 (2) (2015), 171–184.

23.

Tang

, Qin

, Feng

and Liu

, Effective LSTMs for target-dependent sentiment classification, in Proceedings of the 26th International Conference on Computational Linguistics, 2016, pp. 3298–3307.

24.

Tang

, Qin

and Liu

, Document modeling with gated recurrent neural network for sentiment classification, in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 2015, pp. 1422–1432.

25.

Rehman

A.U.

, Malik

A.K.

, Raza

and Ali

, A hybrid CNN-LSTM model for improving accuracy of movie reviews sentiment analysis, Multimedia Tools and Applications78 (18) (2019), 26597–26613.

26.

Gan

, Wang

, Zhang

and Wang

, Sparse attention based separable dilated convolutional neural network for target entities sentiment analysis, Knowledge-Based Systems, 188 (2020).

27.

, Zha

Z.-J.

, Wang

and Chua

T.-S.

, Aspect ranking: Identifying important product aspects from online consumer reviews, in Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 2011, pp. 1496–1505.

28.

Dohaiha

H. H.

, Prasad

, Maag

and Alsadoon

, Deep learning for aspect-based sentiment analysis: A comparative review, Expert Systems With Applications118 (15) (2019), 272–299.

29.

Ruder

, Ghaffari

and Breslin

J. G.

, INSIGHT-1 at semeval-2016 task 5: Deep learning for multilingual aspect-based sentiment analysis, in Proceedings of the 10th International Workshop on Semantic Evaluation, 2016, pp. 330–336.

30.

, Gu

and Wu

, Cascaded convolutional neural networks for aspect-based opinion summary, Neural Processing Letters46 (2) (2017), pp. 581–594.

31.

, Gu

, Sun

and Gu

, Aspect-based opinion summarization with convolutional neural networks, in Proceedings of 2016 International Joint Conference on Neural Networks, 2016, pp. 3157–3163.

32.

Vaswani

, Shazeer

, Parmar

, Uszkoreit

, Jones

, Gomez

A. N.

, Kaiser

and Polosukhin

, Attention is all you need, in Proceedings of 2017 Annual Conference on Neural Information Processing Systems, 2017, pp. 6000–6010.

33.

L.J.

, Kiros

and Hinton

G. E.

, Layer normalization, CoRR, abs/1607.06450, 2016.

34.

Burges

C.J.C.

, Shaked

, Renshaw

, Lazier

, Deeds

, Hamilton

and Hullender

G.N.

, Learning to rank using gradient descent, in Proceedings of the 22nd International Conference on Machine Learning, 2005, pp. 89–96.

35.

, Wang

, Chen

and Li

, Empirical evaluation of rectified activations in convolutional network, CoRR, abs/1505.00853, 2015.

36.

Quan

and Ren

, Construction of a blog emotion corpus for chinese emotional expression analysis, in Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, 2009, pp. 1446–1454.

37.

Hamming

R. W.

, Error detecting and error correcting codes, The Bell System Technical Journal29 (2) (1950), 147–160.

38.

Järvelin

and Kekäläinen

, Cumulated gain-based evaluation of IR techniques, ACM Transactions on Information Systems20 (4) (2002), 422–446.

39.

Kingma

D. P.

and Ba

, Adam:Amethod for stochastic optimization, in Proceedings of 3rd International Conference on Learning Representations, 2015.

40.

Goyal

, Dollár

, Girshick

R. B.

, Noordhuis

, Wesolowski

, Kyrola

, Tulloch

, Jia

and He

, Accurate, large minibatch SGD: training ImageNet in 1 hour, CoRR, abs/1706.02677, 2017.

41.

Felbo

, Mislove

, Søgaard

, Rahwan

and Lehmann

, Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm, in Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2017, pp. 1615–1625.

42.

Lin

, Su

, Yang

, Ma

and Sun

, Semantic-unit-based dilated convolution for multi-label text classification, in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018, pp. 4554–4564.

Ranking based multi-label classification for sentiment analysis

Abstract

Keywords

1 Introduction

2 Related Work

3 Ranking based multi-label sentiment classification

4.1 Dataset

Table 1 Datasets used in this paper Name Amount of samples Types of sentiment Ren-CECps 35096 Joy, Hate, Love, Sorrow, Anxiety, Surprise, Anger, Expect SemEval 2007 1250 Joy, Anger, Disgust, Fear, Sad, Surprise

4.2.1 Criteria for classification performance

4.4 Baseline methods

Footnotes

Acknowledgment

References

Table 1
Datasets used in this paper

Name Amount of samples Types of sentiment

Ren-CECps 35096 Joy, Hate, Love, Sorrow, Anxiety, Surprise, Anger, Expect

SemEval 2007 1250 Joy, Anger, Disgust, Fear, Sad, Surprise