Abstract
The purpose of sentiment classification is to accomplish automatic judssssgment of the sentiment tendency of text. In the sentiment classification task of online reviews, traditional models focus on the optimization of algorithm performance, but ignore the imbalanced distribution of the number of sentiment classifications of online reviews, which causes serious degradation in the classification performance of the model in practical applications. The experiment was divided into two stages in the overall context. The first stage trains SimBERT using online review data so that SimBERT can fully learn the semantic features of online reviews. The second stage uses the trained SimBERT model to generate fake minority samples and mix them with the original samples to obtain a distributed balanced dataset. Then the mixed data set is input into the deep learning model to complete the sentiment classification task. Experimental results show that this method has excellent classification performance in the sentiment classification task of hotel online reviews compared with traditional deep learning models and models based on other imbalanced processing methods.
Introduction
With the development of the Internet, users are not only the recipients of Internet information, but also the producers of Internet information. Online reviews usually contain sentiment information that users want to express, and analyzing sentiment information can not only provide reference for other users’ product purchases, but also provide data support for product recommendations and the application of opinion monitoring technology.
Therefore, it is of great significance to mine the sentiment information of online reviews, and sentiment classification technology is used to mine the sentiment information of online reviews. Sentiment classification of online reviews refers to the automatic identification of the sentiment tendency of online reviews [1].
When many researchers complete the sentiment classification task of online reviews, they use data with balanced sentiment classification distribution for model training to obtain a satisfactory model performance [2–5]. However, this is inconsistent with the actual situation. Online reviews often have the characteristics of imbalanced data distribution of each sentiment classification, which will lead to the decline of the classification performance of the model in practical applications [6]. However, directly using data with imbalanced sentiment classification distribution for model training will make the model biased towards the classification with more samples, and it is still difficult to obtain good classification performance. For the sake of brevity, the sentiment tendency category with a large number of samples in the data distribution is called the majority samples, and the sentiment tendency category with a small number of samples is called the minority category. Although the number of minority samples is small, they usually have higher value than the majority samples. For example, when users purchase goods on e-commerce platforms, they tend to pay more attention to the minority online reviews with negative sentiment tendencies.
The traditional imbalanced classification methods can be divided into two categories: undersampling and oversampling [7–8]. The number distribution of each category of the dataset processed by undersampling or oversampling is balanced. Undersampling is removing part of the majority samples. Oversampling is to forge new minority samples by repeating or simply learning the characteristics of minority samples.
The traditional imbalanced classification method either deletes part of the majority samples, or simply repeats or forges the minority samples to balance the distribution of the number of samples. In order to fully learn the characteristics of minority samples, this paper uses Integrating Retrieval and Generation into BERT (SimBERT) [9] to complete the sentiment classification task of online reviews with imbalanced data distribution. The training of SimBERT is divided into two stages: pre-training and fine-tuning. In this paper, the default parameters of SimBERT are used in the pre-training stage, and only the fine-tuning stage is innovated. First, SimBERT is fine-tuned using real online reviews. After that, the fine-tuned SimBERT is used to generate false minority online reviews, and the generated online review data is mixed with the original review data samples to balance the distribution of sample numbers. Finally, a Bi-directional Long Short-Term Memory (BiLSTM) neural network that can capture bidirectional semantic dependencies is used to complete the sentiment classification task of online reviews. Experiments show that, compared with traditional imbalanced sentiment classification methods, this method can achieve higher sentiment classification performance.
The structure of this paper is arranged as follows: Section 2 introduces some work related to this paper in detail; Section 3 presents the imbalanced sentiment classification model based on SimBERT proposed in this paper; Section 4 conducts experiments and gives the results and corresponding analysis; Section 5 concludes this paper and presents an outlook for the future.
Related work
Sentiment classification
At present, sentiment classification is mainly divided into two research directions: sentiment classification based on dictionary and sentiment classification based on machine learning [10]. The sentiment classification based on the dictionary can obtain high sentiment classification performance, but the construction of the sentiment dictionary needs to consume a lot of labor costs, and the sentiment classification method based on dictionary is difficult to solve the out of vocabulary (OOV) problem [11, 12]. Compared with sentiment classification based on dictionary, sentiment classification based on machine learning avoids the OOV problem and reduces the use of manual work [2]. However, traditional machine learning models require manual construction of feature engineering, so the performance of the model is greatly restricted by the quality of feature engineering. In recent years, deep learning has developed rapidly as a branch of machine learning. The deep learning model removes the feature engineering part of the traditional machine learning model, reducing the interference of artificial factors in model training. Deep learning technology also performs well in sentiment classification tasks [13–15]. Reference [16] proposed Bidirectional Encoder Representation from Transformers (BERT), which is a milestone work in the field of NLP. BERT has excellent classification performance in sentiment classification tasks.
These studies are all based on the premise that the distribution of each sentiment category of the sample is balanced, and the sentiment classification research for imbalanced online reviews is still relatively lacking.
Imbalanced classification
At present, imbalance classification is mainly divided into three research directions: oversampling technology, undersampling technology, and combined sampling technology. The undersampling method solves the problem of imbalanced distribution of samples in the training set by deleting some majority samples. Random Under Sampling (RUS) [17] is a classic undersampling method, which directly deletes a certain proportion of majority samples in the training set to balance the distribution of samples in each category. The random undersampling algorithm has a simple process and a high calculation rate, but due to the deletion of some samples, it will cause information loss, which will affect the classification performance of the model. The NearMiss algorithm is a prototype selection algorithm that selects to retain more representative samples of the majority and deletes less representative samples. Compared with the RUS algorithm, the NearMiss algorithm reduces the information loss and improves the classification performance. Tomek Link [18] has also been improved for the purpose of reducing information loss. The Tomek Link algorithm is based on the K-nearest neighbor algorithm, which removes the majority samples that are nearest to the minority samples in the samples, so as to form a balanced data set. For undersampling methods, such as NearMiss and Tomek Link, compared with the RUS algorithm, the information loss of the majority samples is reduced, but the information loss cannot be completely avoided.
The oversampling method is to replicate or forge the minority samples, and mix them with the majority samples to form a balanced data set. Random Over Sampling (ROS) is to randomly sample and replicate the minority samples to balance the distribution of the number of sample sets. The ROS algorithm simply copies the minority samples, so it is easy to cause the model to overfit the minority samples, resulting in a decline in classification performance. Synthetic minority oversampling technique (SMOTE) [19] is based on the K-nearest neighbor algorithm, by inserting fake minority samples into each minority sample and its corresponding K-nearest neighbor, so as to balance the distribution of the number of samples. However, the SMOTE algorithm only uses the local features of the minority samples and lacks the overall consideration of the minority samples, which leads to the generation of noise samples at the classification boundary and affects the classification performance. Adaptive Synthetic Sampling Approach for Imbalanced Learning (ADASYN) [20] is also based on the K-nearest neighbor algorithm, which considers the overall distribution of samples and the degree of imbalanced distribution of samples. The ADASYN algorithm generates different numbers of new samples for different minority samples, thereby reducing the generation of noise samples at the classification boundary to a certain extent. For the oversampling method, SMOTE and ADASYN are improved algorithms of the ROS algorithm, but there is still the problem of overfitting of minority samples.
Combined sampling technology refers to a sampling technology that combines under sampling technology and oversampling technology. The SMOTETomek [21] algorithm is a combined sampling algorithm combining SMOTE algorithm and Tomek Link algorithm. First, the SMOTE algorithm is used to analyze the minority samples, generate new minority samples, and obtain an expanded data set, and then remove the Tomek Link pairs from the data set to form a balanced data set. SMOTEENN algorithm is another combined sampling technique that combines oversampling and undersampling [22]. Synthetic sampling technology combines the advantages of oversampling and undersampling, but it also has the defects of both.
BERT proposes a model training method of pre-training and fine-tuning, which performs well in many NLP tasks. It has been widely concerned and recognized in the industrial and academic fields. However, BERT is not suitable for text generation tasks due to its mask structure. Reference [23] proposes Unified Language Model (UniLM). Under the overall framework of BERT, an Attention Mask is proposed to endow the model with the ability to generate text. SimBERT uses similar texts to train text generation tasks under the UniLM framework. In addition, it adds a classification task for judging whether text pairs are similar. SimBERT performs well on similar text generation tasks. Therefore, this paper uses SimBERT to generate minority samples to balance the sample distribution and better complete the imbalanced sentiment classification task of online reviews.
Method
Data cleaning and data distribution
The data set used in this paper is the online review data of three economic hotels in Beijing, such as Home, Hanting, Seven Days, on the actual Ctrip travel website. Online review samples usually contain a lot of noise information that has nothing to do with sentiment tendency, and the noise information causes the decline of model classification performance. Data cleaning of online reviews can effectively reduce the content of noise information in samples, and lay a solid foundation for the subsequent training of sentiment classification model. Figure 1 shows the specific steps of data cleaning.

Crawling and cleaning of online review data.
Annotate sentiment tendency: This paper completes the binary classification task of sentiment classification of online reviews, so online reviews need to be tagged into two categories: positive sentiment tendency and negative sentiment tendency. Firstly, a comprehensive data annotation guideline was developed to provide guidance to the annotators, ensuring consistency and accuracy in the manual tagging process. The annotation guideline encompasses definitions of positive and negative sentiment and provides criteria for tagging when positive or negative sentiment should be assigned. It also includes a limited number of annotated examples as references for the annotators. Subsequently, considering that the data consists of online reviews of hotels, nine students with prior hotel accommodation experience were recruited as annotators for this research endeavor. Finally, each review was assigned to three annotators for annotation, with no communication allowed among the annotators during the tagging process. The final tag for each review were determined based on a voting rule that follows the principle of majority preference.
Word segmentation: There are spaces between words in English text as a natural dividing line, but Chinese characters are all connected together. Word segmentation is to divide sentence samples into list samples whose elements are independent words, so as to prepare for the subsequent removal of stop words and low-frequency words.
Delete stop words: stop words include auxiliary words, interjections, function words and other words without practical meaning. Removing stop words can help the model better focus on words that contain sentiment tendency.
Delete low frequency words: Low frequency words are words that rarely appear in the entire data set. Low frequency words may contain certain sentiment information. However, due to their low frequency, paying attention to low frequency words will do more harm than good to the sentiment classification performance of the model.
After that, delete the samples with empty content, deduplicate the entire data set, and reduce the impact of malicious repeated reviews. Finally, the data is stored in the database for subsequent experimental research.
The number of online review samples in the database is counted, a total of 60033 samples. Figure 2 shows the distribution of the number of sentiment tendency of the samples. It can be seen that the sample size distribution is extremely imbalanced.

The distribution of the sample size of online reviews.
SimBERT is a model based on the idea of UniLM, and UniLM is an improved model based on BERT. Therefore, the BERT and UniLM are introduced first. Then, the model structure of SimBERT is introduced.
BERT uses the encoder of the Transformer model [24] for stacked construction. BERT proposes two pre-training tasks, mask language model (MLM) task and Next Sentence Prediction (NSP) task. The MLM task is to randomly mask 15% of the words in a sentence, and capture the semantic information of the text by predicting the masked words. The NSP task is to judge whether a sentence pair is composed of two consecutive sentences, so as to capture the semantic information between sentences. Figure 3 shows a schematic diagram of the model structure of BERT.

Schematic diagram of the model structure of BERT.
[mask] in Fig. 3 represents the masked word, [sos] is the beginning of the sample, and the corresponding output [cls] represents the semantic vector of the entire sample. [eos] indicates the end of two sentences. The MLM task uses contextual information to predict words. In the application of text generation task, only the preceding text is usually provided, so that the model can automatically generate the following text. Therefore, BERT does not perform well in text generation tasks. In addition, reference [25] pointed out that the NSP task in BERT randomly scrambles all samples, and then performs the binary classification task of judging whether the sentence pair is continuous. BERT only needs to judge whether the sentence pair is in an article and it can be better completed This task, but also negatively affects the MLM task. BERT only needs to judge whether the sentence pair is in an article to complete the task well, and it will also have a negative impact on the MLM task. Therefore, reference [25] argues that the NSP task is less helpful for BERT to capture the semantics of samples.
UniLM has improved the shortcomings of BERT not suitable for text generation. UniLM includes four Self-attention Masks tasks: Bidirectional LM, Left-to-Right LM, Right-to-Left LM, and Seq-to-Seq LM. The structure of SimBERT mainly refers to Seq-to-Seq LM for construction, so only Seq-to-Seq LM is introduced. A schematic diagram of the structure of UniLM’s Seq-to-Seq LM is given in Fig. 4.

Schematic diagram of the structure of Seq-to-Seq LM.
In BERT, both the token of Segment 1 and the token of Segment 2 can accept bidirectional semantic information. In the Seq-to-Seq LM of UniLM, the token of Segment 1 can receive bidirectional semantic information, while the token of Segment 2 can only receive forward semantic information. Therefore, Seq-to-Seq LM of UniLM is more suitable for text generation tasks. The [cls] vector output by BERT contains the semantic information of segment 1 and segment 2, while the [cls] vector output by Seq-to-Seq LM only contains the semantic information of segment 1.
SimBERT is built based on the framework structure of Seq-to-Seq LM of UniLM, input similar sample pairs for training, and obtain the ability to generate similar text. In the experiment of this paper, the samples with the same sentiment classification are regarded as similar samples, and the online review samples are input into SimBERT for training, so that SimBERT can obtain the ability to generate samples with the same sentiment classification. In addition, take out the [CLS] vectors in the entire batch to get a sentence vector matrix V ∈ Rb×d (b is batch_size, d is hidden_size). Do l2 normalization on the d dimension to get the matrix

Schematic diagram of similarity classification task.
SimBERT enhances the ability to discriminate samples of different sentiment tendencies through similarity classification tasks, thereby improving the quality of the same sentiment classification samples generated by SimBERT.
Use the online review data to train the SimBERT model, so that the model can fully learn the semantic features in online reviews. After mixing the original online review samples and the false minority samples generated by SimBERT, a data set with a balanced distribution of the number of sentiment tendency is obtained. Use the BERT pre-training model to convert the online review text into a vector representation. The BERT model used is Google’s open source BERTbase version. Then Bi-directional long short-term memory neural network (BiLSTM) [26] is used to capture bi-directional semantic information of online reviews. The experiment completed the binary classification task of positive sentiment tendency and negative sentiment tendency. Therefore, the final sentiment classification result was obtained through the Sigomid layer suitable for binary classification. Figure 6 shows the specific classification process.

The specific steps of sentiment classification.
Experiment settings
First, the online review data is cleaned according to the steps in Fig. 1 in Section 3.1. In the data cleaning, the jieba word segmentation tool is used for word segmentation, and the stop word list of Harbin Institute of Technology is used to remove stop words. Afterwards, the data distribution is obtained as shown in Fig. 2 in Section 3.1, from which it can be seen that the positive sentiment tendency samples are the majority samples, and the negative sentiment tendency samples are the minority samples.
The experiment is divided into two stages. First, 30% samples are randomly selected from the original data set as a test set for testing the classification performance of the sentiment classification model. Then, the first stage of training uses the remaining 70% of the data (training set A) to continue training the SimBERT model on the basis of the open source SimBERT model trained using common corpus. After the training is completed, a model that can deeply capture the semantic information of online reviews can be obtained. Before the start of the second stage of the experiment, the SimBERT trained in the first stage was used to generate samples with negative sentiment tendency. Afterwards, the generated sample data is mixed with the training set A to form a training set B with balanced sample distribution. Finally, after converting the training set B into a vector using BERT, it is input into the sentiment classification model to complete the sentiment classification task. Table 1 shows the distribution of the number of online reviews of each sentiment tendency in the training set and test set of the experiment.
The distribution of sample size for each sentiment tendency
The distribution of sample size for each sentiment tendency
Three classic evaluation indicators are used to evaluate the classification performance of the sentiment classification model, which are Precision, Recall and F1 value. The higher the three classification indicators, the better the classification performance of the model. Among the three classification indicators, the F1 value is the most important and is used to evaluate the overall classification performance of the model. The specific calculation process of Precision, Recall and F1 is shown in Equations (1–3).
The deep learning model is built using the open source framework Tensorflow. The number of tokens of each sample after data cleaning is counted, and the samples with the number of tokens below 65 account for more than 99% of all samples. Therefore, set the input vector length to 65. For online review samples with more than 65 tokens, the first 65 tokens are intercepted and converted into vectors. If the comment sample is less than 65 tokens, it will be converted into a vector and filled with 0 to a length of 65. The SimBER in this paper continues to train on the open source model, therefore, the default parameters of SimBERT are used in the experiments. Other hyperparameter settings involved in the experiments are given in Table 2.
Experiment with hyperparameter settings
A variety of classic imbalanced data processing methods are used for comparative experiments, including RUS, NearMiss, Tomek Link, ROS, SMOTE, ADASYN, SMOTETomek and SMOTEENN algorithms. After converting the original hotel review text into a vector using BERT, use the above method to process the imbalanced data, and then input it into the BiLSTM layer for sentiment classification, and finally obtain the classification result. In addition, the experiment (Base_Model) that directly uses BERT to convert the original data into a vector and then uses the BiLSTM layer for sentiment classification without imbalanced processing is set as the control group. Table 3 shows the classification performance comparison of all experimental groups. The bold numbers in Table 3 are the optimal values of the classification performance evaluation indicators in this column.
Classification performance comparison
Classification performance comparison
It can be seen from Table 3 that the Precision of Base_Model is 93.07%, which has achieved the optimal value of all experimental groups, but due to the imbalanced data distribution, the model is biased towards the majority samples. The Recall of Base_Model is very low, which is the lowest value in all experimental groups, and the F1 value of Base_Model is also the lowest value in all experimental groups. Therefore, it is necessary to handle data imbalance for online reviews in sentiment classification tasks.
In Fig. 7, the classification performance comparison of the three undersampling methods of RUS, NearMiss, and Tomek Link is given. It can be seen from Fig. 7 that among the three undersampling methods, RUS has the lowest classification performance. The reason for the analysis is that the RUS algorithm randomly deletes majority samples, which causes the largest information loss, so the classification performance is not good. When NearMiss and Tomek Link delete the majority samples, they delete the samples with poor representativeness in the majority samples according to certain rules, and retain the samples with better representativeness to reduce the loss of information. Therefore, the classification performance of NearMiss and Tomek Link is higher than that of RUS. Among the three undersampling methods, NearMiss has the best classification performance, but NearMiss has not completely solved the problem of information loss in the undersampling method.

Classification performance comparison of three undersampling methods.
In Fig. 8, the classification performance comparison of the three oversampling methods of ROS, SMOTE, and ADASYN is given. It can be seen from Fig. 8 that among the three oversampling methods, the classification performance of the ROS algorithm is the lowest. The reason for the analysis is that the ROS algorithm randomly replicates the minority samples, which is easy to cause over-fitting of the minority samples, so the classification performance is not good. The SMOTE algorithm synthesizes minority samples, which improves the diversity of minority samples compared with the ROS algorithm for copying minority samples, so the classification performance is significantly improved. The ADASYN algorithm considers the overall sample distribution, reduces the generation of noise samples compared with the SMOTE algorithm, and further improves the classification performance. Among the three undersampling methods, ADASYN algorithm has the best classification performance, but it is difficult for ADASYN algorithm to dig out the deep semantics of minority samples.

Classification performance comparison of three oversampling methods.
In order to make the follow-up analysis clearer, only the NearMiss algorithm and the ADASYN algorithm with the best classification performance among the undersampling method and the oversampling method are selected as representatives for subsequent analysis. Further analysis of NearMiss algorithm and ADASYN algorithm with other experimental groups is given in Fig. 9.

Further comparison of classification performance.
It can be seen from Fig. 9 that, compared with the ADASYN algorithm, the NearMiss algorithm has lower indicators than the ADSYN algorithm. This shows that the problem of NearMiss algorithm causing information loss of majority samples is more serious than the problem of ADASYN causing over fitting of minority samples on the classification performance of sentiment classification model. The indicators of the two methods, SMOTETomek and SMOTEENN, have little difference. They combine the advantages of oversampling method and undersampling method and the classification performance is better than that of pure oversampling method and pure undersampling method. The Precision of Base_Model obtained the optimal value of all experimental groups, and the Precision of SimBERT obtained the suboptimal value of all experimental groups. Compared with Base_Model, SimBERT is only 1.72% lower in the Precision, while the Recall and F1 value are increased by 13.25% and 17.45% respectively, which is a huge improvement. In addition, the Recall and F1 values of SimBERT are the optimal values in all experimental groups, compared with the suboptimal values of Recall and F1 values, which are respectively increased by 4.23% and 4.95%, which is a significant improvement.
Based on the above analysis, the imbalanced sentiment classification of online reviews based on SimBERT performs significantly better than other experimental groups in the sentiment classification task of hotel review data, and the classification performance is excellent.
Aiming at the imbalanced distribution of sentiment classification data in online reviews, a imbalanced sentiment classification of online reviews based on SimBERT is proposed. The experiment as a whole will be divided into two stages. The first stage of experiments trains SimBERT using online reviews. In the second stage of the experiment, SimBERT, which was trained in the first stage of the experiment, was used to generate false minority online review samples. The generated samples were mixed with the original samples to balance the distribution of each sentiment classification sample, and then the samples were sentimentally classified. The advantages of SimBERT over other imbalance processing algorithms mainly include two aspects. On the one hand, SimBERT uses deep learning technology to learn deeper semantic information of online reviews, on the other hand, using the open source SimBERT model to continue training can make the model more robust. Experiments compare a variety of methods for dealing with imbalanced data, including classic methods such as undersampling, oversampling, and combined sampling methods. The experimental results show that in the sentiment classification task of the online review data, the classification performance of the online review imbalance sentiment classification model based on SimBERT is significantly better than the traditional sentiment classification model dealing with imbalanced samples.
In the next step, the model will be applied to sentiment classification tasks of online reviews in more fields, such as e-commerce reviews, movie reviews, etc., to verify the generality of the model. SimBERT is a supervised training model. In the future, we should try the similar text generation model of unsupervised training to make full use of massive online review data, so as to better complete the sentiment classification task of online reviews.
