Abstract
This paper proposes a sentiment analysis framework based on ranking learning. The framework utilizes BERT model pre-trained on large-scale corpora to extract text features and has two sub-networks for different sentiment analysis tasks. The first sub-network of the framework consists of multiple fully connected layers and intermediate rectified linear units. The main purpose of this sub-network is to learn the presence or absence of various emotions using the extracted text information, and the supervision signal comes from the cross entropy loss function. The other sub-network is a ListNet. Its main purpose is to learn a distribution that approximates the real distribution of different emotions using the correlation between them. Afterwards the predicted distribution can be used to sort the importance of emotions. The two sub-networks of the framework are trained together and can contribute to each other to avoid the deviation from a single network. The framework proposed in this paper has been tested on multiple datasets and the results have shown the proposed framework’s potential.
Introduction
As a lastingly hot topic in the field of natural language processing, sentiment analysis has received much attention from researchers during the last years. Sentiment analysis has been widely applied to variant applications as it can extract people’s opinions on certain documents [1]. It is essential for companies and organizations to understand such information beneath the online comments, thereby providing more satisfied services [2].
To understand the sentiment of a text, a lot of efforts had been devoted to this area. Earlier works proposed to use manually or automatically built sentiment lexicon to directly analyze the sentiment since the sentiment of a text is often related to the words that have representations of sentiment [3]. The well-known works include SentiWordNet and Opinion Lexicon [4, 5]. This kind of method can make use of professional knowledge, while it is labor-intensive and may have subjective biases.
With the development of machine learning techniques, a lot of machine learning based methods have been developed to solve this problem, e.g., Logistic Regression (LR), Naïve Bayes (NB), Support Vector Machines (SVM), Maximum Entropy (ME), etc. [6, 7]. Recently, along with the boom of deep learning techniques, a large number of advanced deep learning models are also proposed to this challenge, e.g., Convolutional Neural Network (CNN) [8] and Recurrent Neural Network (RNN) based models [9]. By using these methods, semantic information can be captured more efficiently for analysis.
Most currently developed sentiment analysis methods focus on binary or triple classification of the polarity of a text, i.e., positive, neutral, negative. However, since sentiments in natural language are rich and probably show different intensities, a sentence might have different types of emotion expressions, which means a document may contain more than one polarity [10]. Particularly for a long document, it might have different aspects and each aspect can have different emotional expression. That means the sentiment polarity of an aspect is probably only linked to a part of the document. When we analyze the emotion from perspectives of different aspects, the same document can have different sentiments.
For example, as one of the largest Chinese social media websites, Weibo lets people to vote for the most accurate sentiment label of a given text, which produces more sentiment classes at low cost [11]. The popularization of social media provides more fine-grained sentiment-labeled texts. How to analyze this kind of multi-label sentiment has been attached much attention in the community [12].
Currently, most works on multi-label sentiment analysis mainly tried to use classification based methods [10, 12], which focused on the presence or absence of various emotions to determine whether a certain emotion exists in a sentence, thereby giving the corresponding annotation true or false [13]. This kind of methods simply consider the emotion polarities in the document existed or not. However, the emotions usually differ in intensities in the same document [14], which means the intensity ranking of different emotions can be a useful information for the multi-label emotion classification task.
Given a text, it can be viewed as a query, and emotions it contains could be viewed as the query results we should return. Therefore the whole process can be viewed as a ranking learning process. Inspired by the ranking learning algorithm [15], in this research we also regard the emotion classification task as a ranking learning task and let the model give the ranking of different emotions together with classification results, so as to make better use of the intensity relationship information between different emotions.
In this paper we first adopted BERT [16] as the backbone of our model because of its strong ability for representation. On this basis, we use a classification network to convert the extracted representation into estimation of each kind of sentiments, which is essentially a multi-label classification task. In order to predict the importance of each kind of sentiments, we design a network inspired by ListNet [17] to estimate the distribution of sentiments. Results of our experiments show that by combining ranking learning module, the proposed model could exploit more fine-grained emotional relationship information, which contributes to better classification performance.
The rest of our paper is organized as follows. Section 2 will discuss related work about sentiment analysis. Section 3 will describe the proposed ranking based multi-label sentiment analysis model in detail. Section 4 will present the experimental study and Section 5 will give summarization and potential future directions.
Related Work
Sentiment is one of the most essential fundamental applications in natural language processing. It has been attracting much attention for a long time. Considering that emotion of text is often related with the words that have representations of sentiment, earlier methods employed manually or automatically built sentiment lexicon. Manually crafted lexicons are widely used while they normally require large amounts of labor and need to be rebuilt for each new domain [18, 19]. In contrast, Liu et al. generate the sentiment lexicon by manually creating a small list of seed words with sentiment labels and growing the list [20]. They consider adjectives as sentiment words, assign each adjective in a sentence to a sentiment polarity according to the generated lexicon, flip the polarity of a sentiment word when a negation word appears within a word distance of five words and finally determine the sentiment polarity using majority voting.
With the development of machine learning technology, a lot of machine learning based approaches have been developed. For example, Liang et al. proposed an Auxiliary-Sentiment Latent Dirichlet Allocation (AS-LDA) model, in which they assume that words in subjective documents consists of two parts: sentiment element words and auxiliary words [6]. Similarly Li et al. combined TF-IDF and latent semantic analysis (LSA) in machine learning models to predict the polarity of a document [7]. Recently deep learning has been successfully applied to various tasks with the more efficient capability in extracting features automatically. For example, Convolutional Neural Network (CNN) was used to images initially, but it has been proved to be effectively in natural language process. Kim et al. reported on a series of experiments with CNN trained on top of pre-trained word vectors for sentence-level classification tasks [8], which shows excellent results on multiple benchmarks. Conneau et al. [21] proposed a new architecture (VDCNN) for text processing which operates directly at the character level and uses only small convolutions and pooling operations.
Compared with the CNN model, Recurrent Neural Network (RNN) and its variants are also widely used in the natural language processing challenge, because it has advantages in positional relationships, dependencies, and etc [22]. Tang et al. propose Target-Dependent LSTM (TDLSTM) and Target-Connection LSTM (TCLSTM) for targeted sentiment analysis [23]. Day et al. [9] used LSTM for sentiment analysis on google play consumer reviews. Similarly Tang et al. [24] introduced a model which first learns sentence representation with LSTM. Afterwards, semantics of sentences and their relations are adaptively encoded in document representation with gated recurrent neural network. By this method, semantic information between sentences can be captured easily and better analysis results can be achieved. RNN is effective in sentiment analysis, though it still has some limitations. For example, it cannot distinguish the importance of word context cues.
Some hybrid models are also reported in the literature. Rehman et al. [25] proposed a hybrid model using LSTM and very deep CNN model named as Hybrid CNN-LSTM Model, the proposed model combines set of features that are extracted by convolution and global max-pooling layers with long term dependencies. Gan et al. [26] first proposes a sparse attention based separable dilated convolutional neural network (SA-SDCCN), which consists of multichannel embedding layer, separable dilated convolution module, sparse attention layer, and output layer. SA-SDCCN achieves comparable or even better performance than state-of-the-art methods.
Though most methods consider the polarity detection for the document as a whole, some researchers have indicated the importance of aspect level emotion expression. Yu et al. tried to extract sentiment information in the form of feature vector from the pros and cons section of a review, where the positive and negative opinions on the aspects are expressed explicitly, and use the sentiment information to train an SVM sentiment classifier [27]. Similarly, CNN is also widely used for less granularity emotion detection. Motivated by the assumption that the key words may contain the aspect term and indicate the category or the polarity of an aspect, regardless of their position [28], CNN models are utilized in aspect sentiment analysis. For example, Ruder et al. [29] concatenate every word embedding with the aspect vector and use CNN to determine the sentiment of the aspect. Gu et al. organize CNN aspect mappers and a CNN sentiment classifier in a cascaded way, where the mappers detect the aspect category and the classifier predicts the aspect sentiment polarity [30]. Wu et al. propose a multitask CNN that also contain aspect mappers and sentiment classifier [31].
To better find fine-grained emotions, the importance of multi-label sentiment analysis has been identified [12]. Huang et al. proposed a multi-task multi-label approach to solve this problem [10]. Emotion ranking is also reported an important feature for this task. For example, Yang et al. used topics as an important feature in emotion ranking [11]. Similarly, Zhou et al. further proposed a ranking method for sentiment analysis [14]. It is believed identification of emotion from different perspectives is an critical challenge.
Ranking based multi-label sentiment classification
The overall proposed sentiment analysis model in this paper is shown in Fig. 1, where w1, w2, …, w n represent the words composing the sentence while [CLS] and [SEP] are the special characters added to the beginning and the end. Firstly we will employ BERT model to extract features from data after preprocessing. Then two classification and ranking sub-networks will use the features to detect and sort the sentiments 1 .

The overall architecture.
Recently, a trend in using features extracted from pre-trained language model to fine-tune on downstream tasks under supervision is growing. Generally it proves that such pre-training contributes to better results of downstream tasks. Devlin et al. proposed Bidirectional Encoder Representation from Transformers (BERT) [16], a model that stacks multiple encoders of Transformer [32] together. It has been trained on a large corpus from Wikipedia and made advances in several natural language processing tasks.
Based on transfer learning, we load the BERT model pre-trained on corpus of large scales. The single encoder layer in BERT [16] is shown in Fig. 2. After pre-processing, sentences are converted to a list of tokens. Afterwards, the special characters, i.e., [CLS] and [SEP], are inserted into the head and the tail of the token list. Apart from word embeddings, the embeddings of position and segment are calculated and summed up at the embedding layer of BERT. In other words, the embedding of a sentence is calculated as:

A single encoder layer in BERT.
After layer normalization and dropout [33], X is sent to the 12 stacked identical encoder layers of BERT, in which more abstract features are extracted. The key idea of the encoder is the self-attention part, which can be calculated as:
In these equations, Q, K and V represent the information of query, key and value, respectively. The self-attention part is followed by the linear layer, tanh activation function, dropout and layer normalization. These layers give the encoder highly expressive ability and resistance to overfitting.
The part where features output from BERT is depicted in Fig. 3. The standard BERT model only makes use of x
cls
, the final hidden vector corresponding to [CLS], which is explained as the aggregate representation of the sentence. It may not be enough and we design a module based on position-attention for feature extraction, because the intensities of words vary with the positions of words. This part gets the feature x
attention
through the following equation:

The output part of BERT module.
To further enrich the features, we calculate the max pooling and the mean pooling of X
seq
along the sentence, which we refer to as x
max
and x
mean
. At last, we get the final representation of the sentence by concatenating x
cls
, x
attention
, x
max
and x
mean
, which are the features we use for classification and ranking:
In this research, we consider sentiment intensity as an important feature in multi-label classification. Currently ranking algorithms, e.g., RankNet [34] and ListNet [17], have been widely applied in tasks of rank learning. In this research, we proposed to rank the sentiments by their intensities. The structure of this sub-network is depicted in Fig. 4, which is slightly different from the standard ListNet.

The structure of the ranking subnetwork.
Learning To Rank (L2R) is a supervised learning method that gains success in document retrieval, collaborative filtering and many other tasks. Given the features of several documents X and a query q, a L2R model intends to calculate the relevance between X and q and sort all the documents according to the relevance. Frequently used features include the click rate, the quality and the number of co-occurrent words between the query and the document. A direct way for L2R is to calculate scores of each document independently. This kind of method is called pointwise approach. As only independent information of each document is taken into consideration, the results of pointwise approach are usually not very well.
As pairwise approach only considers the relationship between a pair of documents each time and ignores the global information, some researchers postulate that methods taking lists of documents as instances in learning could behave better. These methods are referred to as listwise approach and a typical one of them is ListNet [17]. This model supposes that the priority of each document x
i
accounts for s
i
of the sum, so the probability of rank π =< xπ(1), xπ(2), …, xπ(n) > can be calculated as:
To estimate the similarity between the predicted probability distribution and the real one, we can use Kullback-Leibler divergence as the loss function:
It should be noted that ranking of the text sentiments should follow the classification of text sentiments, because estimation of sentiment intensities makes no sense if the sentiments are unknown. However, the classification module can not behave well at the start of training. It means if the classification module is followed by a ranking module in the procedure of training, the wrong classification information will flow into the ranking module, which can lead to cascade error. Considering that, these two modules are designed in parallel and the final loss function of the model is the sum of loss functions of these two modules. When the model is trained well and put into utilization, the ranking module can use the classification results predicted by the classification module.
For simplification of calculation, we only use top-1 possibility distribution to estimate and improve the performance of the model. The loss function used here is Kullback-Leibler divergence:
The feature x final extracted from BERT can be used for multi-label classification of sentiments as well as ranking of sentiments. The classification module is demonstrated in Fig. 5.

The structure of the classification subnetwork.
The structure is generally similar to that of ranking module but they differ in the loss function and the meaning of output. In this sub-network, the output vector is to judge the presence or absence of each sentiment but not the proportion scores of each sentiment. The output of the last full-connected layer is in real number field. The loss function layer use sigmoid to map the output into (0, 1) and then use Binary Cross Entropy Loss to compute the loss of each sentiment. This procedure is illustrated as below:
In this sub-network, multiple full-connected layers and activation functions are stacked to catch information for classification. Activation functions used here are Rectified Linear Units (ReLU) [35], which will conduct the operation as below:
The output of ReLU keeps to 0 when the input is negative, so ReLU has resistance to noise and thus fits in with the need of classification tasks.
Dataset
There are many datasets related to sentiment analysis. However, as the early studies of sentiment analysis mainly focus on judgement of sentiment polarity, the datasets that researchers collect and publish are mostly datasets with labels of limited sentiment polarity. The typical ones include datasets of film review on IMDB, Sentiment140 and so on. In these datasets, sentiments are simply classified to positive sentiment, negative sentiment and neutral sentiment. It is obvious that there are various sentiments of humans, so only three polarities is not clear enough for description of sentiment. Therefore, some researchers started to construct datasets with fine-grained sentiment labels. Table 1 lists the datasets we use in this paper. Ren-CECps [36] is a Chinese microblog dataset with fine-grained labels. It collects 1487 microblogs in Chinese and gives each microblog scores ranging from 0 to 1 on 8 types of sentiment to represent the corresponding intensities. For precision of the model, we ignore sentences that are bigger than 300 in length or smaller than 5. The other dataset, SemEval2007, is the dataset for the 14 th subtask in the SemEval of 2007. It contains 1250 English headlines, each of which are labeled with 6 types of sentiment in the form of scores ranging from 0 to 100. As English headlines are usually short, we make use of the whole of it. We shuffle the datasets, select the previous 80 percent as the training set and use the rest for testing.
Datasets used in this paper
Datasets used in this paper
Criteria for classification performance
The classification module is to predict the presence or absence of every type of sentiment, which is in fact a multi-classification task. We use Hamming distance, macro-F1 score and micro-F1 score as criteria.
Hamming Distance. Hamming distance [37] is an index usually used in information theory to measure the difference between two strings. The main idea is to describe how much change is needed to convert one string to the same of another. In other words, it represents the amount of positions where characters in the strings are different. For strings composed of 0 and 1, hamming distance can be calculated as follow:
F1 Score. F1 Score is another criterion to evaluate classification performance, taking precision and recall into account. Taking binary classification as instance, we can define TP, FP and FN as below:
The ranking module intends to evaluate intensities of different sentiments in the form of order. We assess its performance by typical evaluation indexes used in L2R.
Normalized Discounted Cumulative Gain. Normalized Discounted Cumulative Gain (NDCG) [38] is a frequently-used criterion for performance of L2R models. To get the NDCG, we have to first calculate Discounted Cumulative Gain (DCG). Given a query q and m recommended documents, supposing that r
i
represents the relevance index of the i
th
document to the query, we can get DCG as:
This index will grow as the high-relevance documents are placed forward, but it will also grow if the relevance indexes become bigger. Therefore, it has to be normalized by IDCG (Ideal Discounted Cumulative Gain), the DCG of the same documents in order of relevance. For example, given k documents, the sentiments of which are e1, e2, …, e
k
from top to bottom according to real intensity and are predicted to be eπ(1)eπ(1), eπ(2), …, eπ(k), the NDCG is calculated as:
Mean Average Precision. Similar to NDCG in idea, Mean Average Precision (MAP) is also an evaluation index of ranking that takes into account the position and the relevance of documents. It composed of P, AP and MAP. MAP assumes the documents to be relevant to the query or irrelevant otherwise. Given a list of documents for a query, P of the i
th
document means the proportion relevant documents occupy in the first i ones and AP means the average of P of relevant documents in the list. Furthermore, for several queries, MAP means the mean value of their AP. These values can be calculated as:
Ranking Loss. The last criterion to introduce is Ranking Loss. It also classifies sentiments to be either relevant or irrelevant. For each prediction, is calculates the proportion of tuple <e
j
, e
k
> where e
j
is relevant and e
k
is irrelevant but predicted to be prior. The calculation is illustrated as:
In our experiments, we use Adam [39] for optimization. The learning rate is 3 × 10-5 and the model is trained for 10 epochs. The parameters of BERT module were loaded and warmed up [40] at the first epoch, which is to use a lower learning rate at the start of training in order to smooth the updating of gradients. To avoid over-fit, we add a dropout layer after obtaining the feature x final and the probability is set to 0.2.
Baseline methods
We compare our model with several baselines, including CNN [8], BiLSTM [41], INN-RER [11] and SU4MLC [42].
CNN. A model based on CNN that uses a pre-trained matrix of word embedding. The features are extracted by CNN from the matrix and sent to full-connected layers for text classification.
BiLSTM. An attention-based bi-directional LSTM for classification. It use features extracted by the bi-directional LSTM concatenated with position-attention-weighted word vectors as the final features for text classification.
INN-RER. An interpretable neural network with topical information. This model generates topical information through from topic models, transfers it into the neural network and thus trains the neural network to approximate the behavior of topic models in terms of topical distribution. Classification of text is based on that distribution.
SU4MLC. A semantic-unit-based dilated CNN for multi-label text classification. Similarly, this model use bi-directional LSTM to extract the text features. Futhermore, abstract semantic features are extracted through multiple dilated convolution layers. At last, attention mechanism selectively blends semantic features from different layers for classification.
We apply the first three models and our model to SemEval 2007. The performances are describe in Table 2. For Ren-CECps, we apply these models as well as SU4MLC. The comparison is depicted in Table 3.
Performances comparison on SemEval 2007. The character ↑ means the index of better performance is bigger and ↓ has the converse meaning
Performances comparison on SemEval 2007. The character ↑ means the index of better performance is bigger and ↓ has the converse meaning
Performances comparison on Ren-CECps
According to the tables, it is clear that the model we propose has advantage over the baseline models. We think it should be owed to two reasons. On the one hand, we utilize the BERT model that has been trained on large scales of corpus and development of deep learning these years has proved that large amounts of data can greatly improve the performance of models. On the other hand, the classification module and ranking module are designed in parallel at the stage of training, but they can work together for application, resolving the problem on different layers.
To evaluate the performance of ranking module independently, we apply our model with different subnetworks to Ren-CECps. The experiment result is shown in Table 4.
Comparison of different ranking subnetworks
The MSE refers to a network composed of three linear layers with ReLU that use Mean Square Error as loss function, which is a kind of pointwise approach. Meanwhile, RankNet belongs to pairwise approach and ListNet belongs to Listwise approach. Obviously, ListNet performs the best according to whichever criterion. We think the reason ListNet performs better than RankNet in this task is that the number of types of sentiment in the dataset is far less than the number of documents in document retrieval. As a result, RankNet has fewer chances to be trained and thus performs not very well. So we choose ListNet as the ranking subnetwork according to this.
At last, we show the ranking performance of the model in a visualized way. For N samples to classify, the result can be expressed as an N × K 0-1 matrix. We refer to it as Y, where Yi,j represents the presence or absence of the j
th
sentiment of the i
th
sample. Thus, the j
th
column of Y can be seen as representation of the j
th
sentiment, which we refer to as e
j
. Based on the column vectors, we can calculate Pearson correlation coefficients between different sentiments and get the correlation matrix M to depict their relationship.
On this basis, we visualize the correlation matrix of Y pred and Y true as depicted in Fig. 6. It can be found that the correlation coefficients between sentiments in prediction are similar to that of label, which means the model has a quite good overall performance of classification. However, these two matrices differ quite greatly in the correlation coefficient of Hate and Anger and that of Sorrow and Anxiety, where the matrix of Y pred is much bigger. We think this phenomenon is due to these sentiments have great relevance in text which is recorded in the pre-trained BERT model while the model lacks the ability to distinguish between them.

correlation matrix of sentiment.
To solve this problem, the main method is to add more samples for training, so that the model can learn more semantic regulations and reduce its dependency on the pre-trained parameters in BERT. Besides, a branch to learn the relevance between different sentiments can be added into the model, so as to supervise the model.
In this paper, we have proposed a classification model for fine-grained text sentiment analysis based on BERT. The key idea is to enrich the features BERT extracts by attention mechanism and two subnetworks to detect and sort the sentiments which are trained in parallel but work together in applications. Experiments show that our proposed model obtains superior performance over the baseline models.
Though our proposed model shows good performance in the form of the similarity of correlation matrices, it may have too much dependency on the relevance between similar sentiments recorded in BERT. As future work, a direction worth to attempt is to add a subnetwork for learning of relevance and supervision on the existing model.
