A two-stage balancing strategy based on data augmentation for imbalanced text sentiment classification

Abstract

In practice, the class imbalance is prevalent in sentiment classification tasks, which is harmful to classifiers. Recently, over-sampling strategies based on data augmentation techniques have caught the eyes of researchers. They generate new samples by rewriting the original samples. Nevertheless, the samples to be rewritten are usually selected randomly, which means that useless samples may be selected, thus adding this type of samples. Based on this observation, we propose a novel balancing strategy for text sentiment classification. Our approach takes word replacement as foundation and can be divided into two stages, which not only can balance the class distribution of training set, but also can modify noisy data. In the first stage, we perform word replacement on specific samples instead of random samples to obtain new samples. According to the noise detection, the second stage revises the sentiment of noisy samples. Toward this aim, we propose an improved term weighting called TF-IGM-CW for imbalanced text datasets, which contributes to extracting the target rewritten samples and feature words. We conduct experiments on four public sentiment datasets. Results suggest that our method outperforms several other resampling methods and can be integrated with various classification algorithms easily.

Keywords

Imbalanced text sentiment classification resampling noise modification data augmentation word replacement

1 Introduction

With the development of social networks and e-commerce, users have generated a large amount of text data. In actual applications, it is necessary to mine the subjective tendencies of these opinioned texts so that we can provide valuable information for practitioners and assist them with better decisions [1]. As a fundamental task of natural language processing, sentiment classification is utilized to predict the sentiment polarity of a text. Nevertheless, most existing text classification algorithms assume that the number of samples in different categories is balanced. On the contrary, the class distribution of the dataset is usually skewed in real scenarios, where the sample size in one class is much larger than the other one. As the problem of imbalance gets exacerbated, the performance of classifiers drops dramatically [2]. Such kind of sentiment classification is known as imbalanced text sentiment classification. The class with larger sample size is called the majority class and the class with smaller sample size is called the minority class.

Many methods have been proposed to solve this problem and can be generally divided into two types: data preprocessing and algorithm-level methods [3]. On the one hand, data preprocessing mainly refers to resampling methods, including over-sampling and under-sampling. Over-sampling methods generate new minority class samples, whereas under-sampling methods eliminate majority class samples to balance the dataset. On the other hand, algorithm-level methods mainly include cost-sensitive learning [4] and ensemble learning [5]. Cost-sensitive learning assigns different misclassification costs for different classes. The class easy to be misclassified will have more cost. Ensemble learning under-samples the majority class to obtain several balanced subsets and then train a classifier for each one. Finally, all classifiers work together to predict the label of new data.

Compared with algorithm-level methods, data preprocessing strategies are more universal. They are conducted before the training stage and can be easily integrated with different classification algorithms [6]. While various data preprocessing strategies are outstanding in numerical data applications, most of them are not suitable for text sentiment classification since text is not structured data. The basic units of texts are words rather than numbers [7]. Although we can utilize term weighting to represent the text as a vector, there are so many words in a corpus that the vector will be high-dimensional and sparse. In addition, if we perform numerical resampling directly in vector space, the semantics of the text will be ignored. A popular scheme is employing data augmentation techniques to over-sampling, including generation-based methods and replacement-based methods [8]. Generation-based methods depend on deep learning models to generate samples, but the quality of new sample cannot be guaranteed because of the limited capability of model [9]. Replacement-based methods generate new samples by replacing words in original samples with synonyms or antonyms [10]. However, most existing replacement-based methods have a shortcoming in common. The samples to be replaced are randomly selected, which means noisy data with vague or wrong sentiment may be selected. Therefore, noisy data will increase, bringing challenges to the classifier.

To this end, we propose a novel balancing strategy for imbalanced sentiment datasets. The approach is based on word replacement and is divided into two stages: over-sampling and noise modification. In the over-sampling stage, we extract representative data whose sentiment polarity is explicit according to the data distribution in vector space. Then we generate new samples to balance the dataset by replacing feature words of the representative data. The second stage performs noise detection to find noisy data whose sentiment polarity is vague or wrong. After that, we modify these noisy data with the right feature words. The whole process is illustrated in Fig. 1. In short, the main contributions of our study are 4-fold:

Inspired by TF-IGM, we propose a novel term weighting scheme called TF-IGM-CW for the two-stage strategy. The weighting can be used to extract feature words and separate different classes in vector space.

The concept of sentiment centroid is developed to extract representative data and noisy data in the two stages. Sentiment centroid is the centroid vector of all text vectors in a class, which represents the class effectively.

We propose a two-stage balancing strategy, which not only can balance the class distribution of the training set but also can modify the sentiment polarity of noisy data.

We conduct experiments on four public sentiment datasets. Results show that our method is superior to several other resampling methods. Both the over-sampling stage and the noise modification stage can improve the performances of classifiers.

Fig. 1

Overview of our two-stage balancing strategy.

The remainder of this paper is organized as follows. In Section 2, related works are summarized. The proposed term weighting and balancing strategy are explained in Section 3 and Section 4. Experimental results and analyses are presented in Section 5. Section 6 concludes the paper.

2 Related works

In this section, we review the issue of imbalanced sentiment classification and existing main solutions.

2.1 The issue of imbalanced sentiment classification

Skewed data distribution is very prevalent in sentiment datasets. Taking product reviews as an example, generally, positive reviews are the majority and negative reviews are the minority. Businesses are more interested in the latter because they can make improvements to the product by analyzing negative reviews. Nevertheless, the issue of imbalanced datasets does damage to the performances of classifiers. We can measure the severity of imbalance problem according to imbalance ratio (IR) [11], which is defined as the relationship between the sample size of majority class and minority class, by the expression: $IR = \frac{N_{Ma}}{N_{Mi}}$ (1) where N_Ma is the sample size of majority class and N_Mi is the sample size of minority class. As the value of IR gets larger, the performance of classifiers drops significantly.

In a general way, imbalanced text sentiment datasets mainly have three characteristics and the caused problems are as follows:

Imbalanced size: It refers to the large gap of sample size between the majority class and the minority class. To pursue high accuracy, the classifier ignores the minority class and prefers the majority class [12]. We consider an extreme case in the binary classification task. When IR = 9.9, the accuracy can be 99% even if the classifier predicts all samples as the majority class.

Small disjuncts [13]: Since the sample size is relatively small, the minority class has much fewer features. Consequently, minority class samples are distributed in numerous feature spaces [14] and surrounded by majority class samples, as shown in Fig. 2(a). This results in great difficulty for classifiers to learn the features of minority class.

Class overlapping [15]: In imbalanced datasets, samples belonging to different classes may overlap, especially for the minority class, as shown in Fig. 2(b). The sentiment of these samples occurring in other class area is vague. They are disadvantageous to classifiers and called noisy data.

Fig. 2

Examples of imbalanced dataset: (a) small disjuncts and (b) class overlapping. The red points are the samples of majority class and the blue points are the samples of minority class.

2.2 Solutions for imbalanced sentiment classification

The existing solutions for imbalanced sentiment classification consist of data preprocessing methods and algorithm-level methods.

Data preprocessing methods include over-sampling and under-sampling, while there are few related studies focused on text sentiment classification. Li et al. [2] propose a clustering-based under-sampling framework, which groups the majority class into serval clusters and only retains the representative samples in each cluster. Prusa et al. [16] discard majority class samples with random under-sampling (RUS) to balance the class distribution of Twitter datasets. Wang et al. [17] propose the BRC algorithm to cutting majority class samples in the dense boundary region. However, under-sampling methods may lose many features of the majority class. In the aspect of over-sampling, the basic strategy is random over-sampling (ROS), which augments the minority class by randomly duplicating minority class samples, but it suffers from the problem of over-fitting.

In regard to the above problem, researchers employ data augmentation techniques to over-sample, mainly including generation-based methods and replacement-based methods. Generation-based methods address imbalance problems with generative models. Hu et al. [18] generate controlled text with an auto-encoder backbone. Luo et al. [19] utilize sequence generative adversarial networks to add minority samples. Nevertheless, the quality of the sentence generated by these models is always poor. Replacement-based methods generate new sentences by replacing some words in the original sentences. Zhang et al. [20] extract replaceable words from the text firstly, and then randomly select some of them to replace with synonyms. Fadaee et al. [21] pay attention to the rare words and replace only these words in a sentence. Wei et al. [22] present a set of easy data augmentation (EDA) techniques for text classification tasks, including synonym replacement, random insertion, random swap, and random deletion. The above replacement-based methods show high effectiveness in imbalanced text classification. However, they have a drawback in common. The processed sentences are random, which means data contributing nothing to the classifier may be selected. As a result, this type of data easily misclassified by the classifier increase. In this paper, we address this disadvantage and propose a non-random replacement-based method for imbalanced sentiment classification. The sentiment polarity of new samples can be controlled to a large extent.

The algorithm-level method is another solution. Since they are not the focus of our paper, we only give a brief introduction to them. Madabushi et al. [23] combine cost-sensitivity with BERT to improve the performance of classifier for imbalanced text datasets. Wang et al. [24] study the effectiveness of ensemble learning for imbalanced text sentiment classification.

3 Proposed term weighting scheme

In our method, we extract representative data and noisy data based on the class distribution of text datasets in vector space. Toward this aim, we improve TF-IGM weighting which is described in Section 3.1 and propose a novel term weighting called TF-IGM-CW in Section 3.2. Moreover, feature words of different classes are extracted respectively, which is explained in Section 3.3.

3.1 Overview to TF-IGM weighting

Generally, we utilize term weighting schemes (TWS) to represent a word as a number and transform texts into vectors. In this way, we can mine the deep relationships between texts, such as similarity. Besides, term weighting can measure the class distinguishing power of a word [25], which can be used to extract feature words. A term weighting usually consists of local factor and global factor. Term frequency (TF) is generally used as local factor, which refers to the frequency of a term in a document. Global factor reflects the occurrences of a specific term in the entire corpus, which can be utilized to extract feature words.

In this subsection, we review a recently proposed term weighting scheme called TF-IGM (term frequency & inverse gravity moment) [26], which can precisely measure the class distinguishing power of a term. The formula of TF-IGM is as follows: $\begin{matrix} TF - IGM (t_{i}) = \\ TF (t_{i}, d_{k}) * (1 + λ * {\overset{︷}{[\frac{f_{i 1}}{\sum_{r = 1}^{M} f_{ir} * r}]}}^{IGM (t_{i})}) \end{matrix}$ (2)

In this formula, TF (t_i, d_k) is the term frequency of term t_i in document d_k. M represents the number of classes in the dataset. f_ir is the number of documents containing the term t_i in the r - th class, which are sorted in descending order with r being the rank. λ is the balance coefficient.

The core of TF-IGM is a statistical model called “Inverse Gravity Moment” (IGM). IGM is a global factor, which can show the class distinguishing power of a term by considering the inter-class distribution. For term t_i, IGM sorts all classes in descending order according to the number of documents containing t_i, and finally weights them. λ is used to balance local factor and global factor. In the related literature [26], its value range is set to be 5.0–9.0 and the default is 7.0, which is appropriate for different datasets. By calculating the IGM value of each word, we can extract feature words that can distinguish the class effectively in the dataset. The word with a larger IGM value has a stronger ability to distinguish classes.

3.2 Incorporating the class-specific factor into TF-IGM

Due to the full use of inter-class distribution information, the term weighting TF-IGM can accurately measure the class distinguishing power of words. And the related literature [26] proves that TF-IGM has an excellent performance in text classification tasks. However, TF-IGM has two drawbacks: (a) By means of IGM, we can only find the feature words of the entire dataset rather than feature words belonging to different classes; (b) Since each word shares the same IGM value in different classes, after the vectorization of text data with TF-IGM, the data of different classes will be mixed in vector space. Taking the binary-class dataset Movie Review Data (MR) [27] as an example, we select 4,265 positive and 1,705 negative movie reviews, so positive reviews are the majority class and negative reviews are the minority class. Particularly, the imbalance ratio is 2.5. We use TF-IGM to vectorize sentences. The dimensionality of sentence vector is equal to the number of words in the corpus, and each word corresponds to a dimension. The value of each dimension is the TF-IGM value of the corresponding word in the sentence. Then we apply t-SNE [28] to these vectors and plot their 2-D representations, as shown in Fig. 3. The data points of the majority class and the minority class are mixed in vector space, so the useful information available to us is quite limited. Similar problems also exist in many popular term weighting schemes, such as TF-IDF.

Fig. 3

Distribution of MR dataset in vector space based on TF-IGM.

In order to solve the above problems, we propose a class-specific factor called class weighting (CW). CW considers the intra-class distribution of a term and can measure the representative power of a term in a specific class. The CW values of each term in different classes are independent, which can be formulated as in the following equation: $CW (t_{ij}) = \log (1 + \frac{a_{ij}}{D_{j}} * \frac{D}{b_{ij} + 1})$ (3) where CW (t_ij) denotes the CW value of term t_i in class C_j. a_ij and b_ij represent the number of documents containing t_i belonging to class C_j and not belonging to C_j. D_j is total number of documents in class C_j and D is the number of documents not belonging to class C_j.

Formula (3) indicates that CW (t_ij) depends on the intra-class distribution of term t_i in the dataset. CW (t_ij) will be assigned a larger value when term t_i frequently occurs in class C_j and occasionally occurs in other classes. The word with better representative power will have a larger CW value. The minimum value of CW (t_ij) is 0 when t_i never occurs in class C_j, and the maximum value is log(1 + D) when t_i occurs only in class C_j. To unify the distribution of CW values in different classes, CW value can be transformed to fall in the interval [0, 1.0] by applying Min-Max normalization: $nCW (t_{ij}) = \frac{CW (t_{ij})}{\log (1 + D)}$ (4) where nCW (t_ij) denotes the normalized value of CW (t_ij) and D is the number of documents not belonging to class C_j.

We incorporating the class-specific factor CW into TF-IGM and achieve an improved term weighting called TF-IGM-CW:

$T F - IGM - CW (t_{ij}) = TF - IGM (t_{i}) * nCW (t_{ij})$ (5)

where TF - IGM - CW (t_ij) represents the TF-IGM-CW value of term t_i in class C_j and λ in TF - IGM (t_i) is set to the default value of 7. We visualize the selected MR data with TF-IGM-CW in the same way, as shown in Fig. 4. As thus, data points belonging to different classes are separated in vector space. We can extract the representative data and noisy data depending on the class distribution.

Fig. 4

Distribution of MR dataset in vector space based on TF-IGM-CW.

3.3 Feature words selection

Word replacement is one of the most popular data augmentation techniques for imbalanced sentiment classification. In previous studies [20 –22], the word to be replaced is usually randomly selected or selected based on its part of speech and occurrence frequency. Taking a different approach, we replace feature words in a target text with synonyms or antonyms, which can alleviate the problem of imbalanced features. Generally, the global factor of a term weighting can be used to extract feature words. In this study, in order to extract feature words of different classes respectively, we use the combination of global factor IGM and class-specific factor CW in TF-IGM-CW. Due to the combination of inter-class and intra-class information, IGM-CW not only can measure the class distinguishing power, but also can measure the representative power of a term for a specific class. The IGM-CW value of term t_i in class C_j (j = Ma, Mi) can be expressed by the following equation: $IGM - CW (t_{ij}) = IGM (t_{i}) * nCW (t_{ij})$ (6) where IGM (t_i) denotes the IGM value of t_i in the entire dataset and nCW (t_ij) is the normalized value of CW (t_ij). In essence, nCW (t_ij) is a restriction factor of IGM (t_i). The word with stronger class distinguishing power and representative power will be assigned a greater weight. To identify the class bias of a term, we define a feature indicator (FI) for each term t_i: $FI (t_{i}) = IGM - CW (t_{i - Ma}) - IGM - CW (t_{i - Mi})$ (7) where IGM - CW (t_i - Ma) and IGM - CW (t_i - Mi) denote the IGM-CW value of t_i in majority class and minority class. When FI (t_i) > 0, term t_i is biased toward the majority class and called the majority class word. When FI (t_i) < 0, t_i is called the minority class word. Taking the selected MR data as an example, we first perform lowercase conversion and word stemming, then stop words and rare words are removed. Finally, 2,465 words are retained. We calculate the FI value of each word and plot them on the coordinate axis, as shown in Fig. 5.

Fig. 5

The FI values of words in MR dataset.

In addition, we further divide all words into ordinary words, feature words of the majority class and the minority class in the light of threshold K (0 < K < 1):

Feature words of the majority class: According to the descending order of FI value, we select majority class words with a ratio of K as feature words of the majority class. These words have a much higher frequency of occurrence in the majority class than in the minority class. They can effectively represent the majority class and sparsely distributed on the top of coordinate axis.

Feature words of the minority class: According to the ascending order of FI value, we select minority class words with a ratio of K as feature words of the minority class. These words have a much higher frequency of occurrence in the minority class. They can represent the minority class and sparsely distributed on the bottom of coordinate axis.

Ordinary words: Excluding all feature words, the rest are ordinary words. These words have an approximate frequency of occurrence in both two classes. The class representative power of them is quite weak and they are densely distributed on both sides of FI = 0.

The value of K determines the number of feature words, so we show the effect of K on classification performance in the experimental section. In order to demonstrate the effectiveness of our method for extracting feature words, we set K to 30% and thus get two boundaries in Fig. 5. Words above the blue dotted line are feature words of the majority class, and words below the red dotted line are feature words of the minority class. We utilize WordCloud 1 to visualize some feature words according to the FI value, as shown in Fig. 6. Most feature words in the majority class can express a positive sentiment, such as ‘wonderfully’ and ‘entertainment’. And most feature words in the minority class are negative, such as ‘unfunny’ and ‘badly’.

Fig. 6

Word cloud of feature words in MR.

4 The proposed two-stage balancing strategy

There is great randomness in most of the existing over-sampling strategies based on word replacement. Specifically speaking, the data to be replaced may have no contribution to the classifier or even bring interference to it, such as noisy data whose sentiment polarity is wrong. Thus, this type of data may be added after over-sampling. Our non-random two-stage strategy mitigates the weakness by performing word replacement on some specific data. In this part, sentiment centroid is developed to extract these specific data, which is explained in Section 4.1. Then details of the over-sampling stage and the noise modification stage are introduced in Section 4.2 and Section 4.3.

4.1 Sentiment centroid

In Section 3, we show TF-IGM-CW can vectorize the text data and separate different classes in vector space. In this subsection, we propose the concept of sentiment centroid (SC). For text sentiment datasets, sentiment centroid is the centroid of all sentence vectors in a class. According to the distance between the sentence vector and the sentiment centroid, we can identify whether a sentence is representative data or noisy data. Generally, to obtain the centroid of a group of vectors, we calculate the average value of the corresponding dimensions of all vectors, and the resulting vector is the centroid. However, from our perspective, the significances of sentences are unequal in sentiment datasets. The sentence with stronger class distinguishing power should be assigned a greater weight when we calculate the centroid.

Since IGM-CW can measure the class distinguishing power and the class representative power of a word, we calculate the average IGM-CW value of all words in a sentence to identify its importance. The obtained average value is the weight of the sentence, which can be defined as the following equation: $Sentence_weight (S_{kj}) = \frac{\sum IGM - CW (t_{ij})}{N_{k}}, t_{i} \in S_{k}$ (8) where Sentence _ weight (S_kj) denotes the sentence weight of sentence S_k belonging to class C_j and ∑IGM - CW (t_ij) denotes the sum of IGM-CW values of all words in sentence S_k. N_k is the number of words in sentence S_k.

The reason for averaging is to avoid extreme scenarios. For example, a long sentence with all ordinary words and a short sentence with all feature words, in this case, the latter should be assigned a greater weight than the former. Finally, we calculate the weighted average value of corresponding dimensions of all sentence vectors: $\begin{matrix} SC (C_{j}) = \\ \sum Sentence_weight (S_{kj}) * Vector (S_{kj}), S_{kj} \in C_{j} \end{matrix}$ (9) where SC (C_j) is the sentiment centroid of class C_j. Sentence _ weight (S_kj) is the weight of S_k belonging to class C_j and Vector (S_kj) is the vector of sentence S_k. Notice that we normalize all sentence weights belonging to the same class in advance to ensure the sum of all weights is 1.

4.2 Over-sampling with representative data

In the over-sampling stage, we replace feature words in the representative data with the external knowledge base WordNet 2 , which not only can generate new samples with explicit sentiment label, but also can alleviate the problem of imbalanced features to some extent. Representative data refers to the data with strong class representative power, and they are not easy to be misclassified.

Since sentiment centroid is the centroid vector of all sentence vectors in a category, we can extract representative data according to the Euclidean distance between data and their centroids. Data with stronger class representative power will have a smaller distance. Taking the selected MR data as an example, we present some representative data based on the ascending order of distance, as shown in Table 1.

Table 1
Representative data in MR dataset

Class Representative data

Majority class quite funny for the type of movie it is...

a terrific date movie, whatever your orientation.

an engaging, formulaic sports drama that carries a charge of genuine excitement.

an entertaining documentary that freshly considers arguments the bard’s immortal plays were written by somebody else.

bloody Sunday has the grace to call for prevention rather than to place blame, making it one of the best war movies ever made.

Minority class a relative letdown.

hopelessly inane, humorless and under-inspired.

it is messy, uncouth, incomprehensible, vicious and absurd.

after all the big build-up, the payoff for the audience, as well as the characters, is messy, murky, unsatisfying.

ultimately this is a frustrating patchwork: an uneasy marriage of louis begley’s source novel (about schmidt) and an old payne screenplay.

Class	Representative data
Majority class	quite funny for the type of movie it is...
	a terrific date movie, whatever your orientation.
	an engaging, formulaic sports drama that carries a charge of genuine excitement.
	an entertaining documentary that freshly considers arguments the bard’s immortal plays were written by somebody else.
	bloody Sunday has the grace to call for prevention rather than to place blame, making it one of the best war movies ever made.
Minority class	a relative letdown.
	hopelessly inane, humorless and under-inspired.
	it is messy, uncouth, incomprehensible, vicious and absurd.
	after all the big build-up, the payoff for the audience, as well as the characters, is messy, murky, unsatisfying.
	ultimately this is a frustrating patchwork: an uneasy marriage of louis begley’s source novel (about schmidt) and an old payne screenplay.

Sentiment words are shown in bold italics. The sentiment label of majority class is positive and minority class is negative.

Table 1 shows that representative data have the following characteristics: explicit sentiment tendency, less redundant information, and containing sentiment words. Sentiment word refers to the word that has a great contribution to the sentiment polarity of sentences, such as ‘funny’, ‘terrific’, ‘entertaining’ in positive data and ‘Letdown’, ‘hopelessly’, ‘incomprehensible’ in negative data. These words rarely occur in another category, so most of them are feature words. By the use of knowledge base, we replace feature words with synonyms or antonyms to generate new minority class samples, as a result of which features of the minority class can be added.

In binary classification tasks, assuming that the difference between the sample sizes of two classes is N and the value of imbalance ratio is IR, we extract representative data in the majority class and minority class with the number of $\frac{IR}{IR + 1} • N$ and $\frac{1}{IR + 1} • N$ respectively. Notice that we perform the extraction based on the distance from each data to its centroid in ascending order.

Suppose the sentence length is l, we select words with the number of α • l to replace where parameter α (0 < α < 1) means the percent of words in a sentence. The specific process is as follows:

Word selection. Feature words are selected first in the representative data. And if a sentence does not have enough feature words, we select extra ordinary words following the descending order of IGM-CW values. The importance of these ordinary words is second only to feature words.

Replacement. By the use of WordNet, we replace the majority class feature word with an antonym and replace the minority class feature word with a synonym. Motivated by keeping the data distribution of a class, ordinary word and the feature word having no synonyms or antonyms will be replaced with a minority class feature word which has the closest IGM-CW value to it.

4.3 Modify noisy data

After the over-sampling stage, the dataset becomes balanced. Nevertheless, the dataset still has the problem of class overlapping. Some samples from one class may occur in the area of another class in the feature space, which is disadvantageous for classifiers to learn class features. These samples are called noisy data whose sentiment polarity is vague or even wrong. Part of these samples come from the original dataset and part from the generated samples in the over-sampling stage. Thus, it is necessary to extract these noisy data and modify their sentiment polarity.

Similarly, we identify noisy data based on the class distribution in vector space. The sentence having a farther Euclidean distance to its sentiment centroid than to another centroid is defined as noisy data. In this way, we perform noise detection on the selected MR data mentioned in Section 3.1and present some of the noisy samples in Table 2.

Table 2
Noisy data in MR dataset

Class Noisy data

Majority class the date movie that franz kafka would have made.

underachieves only in not taking the shakespeare parallels far enough.

an uncomfortable movie, suffocating and sometimes almost senseless, the grey zone does have a center, though a morbid one.

strip it of all its excess debris, and you’d have a 90-minute, four-star movie. as it is, it’s too long and unfocused.

often messy and frustrating, but very pleasing at its best moments, it’s very much like life itself.

Minority class faultlessly professional but finally slight.

at least it’s a fairly impressive debut from the director, charles stone iii.

the film is surprisingly well-directed by brett ratner, who keeps things moving well – at least until the problematic third act.

though it was made with careful attention to detail and is well-acted by james spader and maggie gyllenhaal, i felt disrespected.

handsome and sincere but slightly awkward in its combination of entertainment and evangelical boosterism.

Class	Noisy data
Majority class	the date movie that franz kafka would have made.
	underachieves only in not taking the shakespeare parallels far enough.
	an uncomfortable movie, suffocating and sometimes almost senseless, the grey zone does have a center, though a morbid one.
	strip it of all its excess debris, and you’d have a 90-minute, four-star movie. as it is, it’s too long and unfocused.
	often messy and frustrating, but very pleasing at its best moments, it’s very much like life itself.
Minority class	faultlessly professional but finally slight.
	at least it’s a fairly impressive debut from the director, charles stone iii.
	the film is surprisingly well-directed by brett ratner, who keeps things moving well – at least until the problematic third act.
	though it was made with careful attention to detail and is well-acted by james spader and maggie gyllenhaal, i felt disrespected.
	handsome and sincere but slightly awkward in its combination of entertainment and evangelical boosterism.

Sentiment words of the other class are shown in bold italics.

Table 2 shows that noisy data mainly have two types: (a) containing no sentiment tendency or just expounding an objective fact, such as the first noisy positive data; (b) possessing sentiment words of the other class, such as ‘uncomfortable’, ‘messy’ in the noisy positive data and ‘faultlessly’, ‘impressive’ in the noisy negative data. Most of these sentiment words are feature words that contribute to the vague or wrong sentiment polarity of a sentence.

We modify the two types of noisy data by replacing feature words. The specific process is:

Word selection. All the feature words of another class are selected. And if the sentence has no other class feature words, we select the two most important ordinary words, namely the two words with the largest IGM-CW value.

Replacement. Feature words of another class are replaced with antonyms in WordNet. Ordinary word and the feature word having no antonym will be replaced with a minority class feature word which has the closest IGM-CW value to it.

5 Experiment

In this section, we present the details of our experiments. In Section 5.1, we briefly introduce the experimental settings and in Section 5.2, results and analysis are shown.

5.1 Experimental setting

We conduct experiments on four public sentimental review datasets. All of them are binary classification dataset, consisting of positive reviews and negative reviews. We process the training sets of four datasets in the same way: holding the full set of positive samples as the majority class and randomly selecting some negative reviews as the minority class with IR = 2.5. The four adopted datasets are as follows:

MR [27]: Movie Review Data (MR) collected by Pang and Lee including 5,331 positive and 5,331 negative reviews. We extract 80% of them as the training set and remain 20% as the testing set.

SST-2 [29]: Stanford Sentiment Treebank (binary version), an extension of the MR but with train/dev/test splits. The split is 6,920 training, 872 validation, and 1,821 testing.

IMDB [30]: A large dataset of movie reviews from the Internet Movie Database (IMDB) consisting of 25,000 training and 25,000 testing reviews. Each sample has multiple sentences.

CR [31]: Customer reviews (CR) of five products including 2406 positive and 1367 negative reviews. Analogously, 80% of them are randomly selected as the training set and the remaining 20% are the testing set.

We compare our method with three popular random resampling strategies for imbalanced text sentiment classification, all of the dataset settings are:

Raw imbalanced dataset (Raw): directly using the raw imbalanced dataset.

Random under-sampling (RUS) [16]: under-sampling by randomly discarding the majority class samples.

Random over-sampling (ROS) [2]: over-sampling by randomly duplicating the minority class samples.

Random synonym replacement (RSR): over-sampling by generating new samples with random synonym replacement. We follow the specific process in [22].

Over-sampling with representative data (OSRD): the approach we proposed in the over-sampling stage, which over-samples by replacing feature words in the representative data.

Over-sampling with representative data+Revise noisy data (OSRD+RND): combining the over-sampling stage with the noise modification stage, which modifies the sentiment polarity of noisy data after OSRD.

To validate the effectiveness of our method integrated with different classification algorithms, we consider four basic models: logistic regression (LR), naive bayes (NB), convolutional neural network (CNN), recurrent neural network (RNN) and two state-of-the-art deep learning models: ACNN and AC-BiLSTM. The specific settings of them are introduced below:

LR and NB: implementing them with sklearn [32] in default settings.

CNN: the CNN architecture described by Kim [33] is adopted: input layer, 1D convolutional layer with 128 filters of size 5, global 1D max pool layer, dense layer of 64 hidden units and output layer.

RNN: the RNN architecture described by Liu et al. [34] is used: input layer, a bi-directional layer with 128 LSTM cells, dropout layer with rate of 0.5, dense layer of 64 hidden units and output layer.

ACNN [35]: a CNN model combined with attention mechanism is adopted: input layer, BiLSTM of 128 hidden units with attention mechanism, 1D convolutional layer with 100 filters of size 3,4 and 5 respectively, 1D max pool layer, dropout layer with rate of 0.5, dense layer of 64 hidden units and softmax layer.

AC-BiLSTM [36]: a BiLSTM model with attention mechanism and convolutional layer is used: input layer, 1D convolutional layer with 100 filters of size 3, BiLSTM of 150 hidden units with attention mechanism, dropout layer with rate of 0.7 and softmax layer.

In addition, we utilize Accuracy (Acc) to evaluate the performance of each classification algorithm.

5.2 Results and analysis

5.2.1 The effectiveness of our two-stage balancing method

We conduct experiments on four imbalanced datasets with different combinations of balancing strategies and classification algorithms. Threshold K is set to 30% and parameter α is set to 10%.

The comparison results depicted in Table 3 indicate that our method outperforms all of the three considered random resampling methods. Compared to raw imbalanced datasets, the average improvement of OSRD is 2.1% and more 0.9% of improvement can be achieved after implementing RND. It should be noted that the average improvement of OSRD and RND on the two state-of-the-art models are 0.3% and 0.1% lower than on the four basic models respectively, which probably because the two state-of-the-art models are better at learning the characteristics of raw imbalanced datasets. Consequently, the proposed method has a slightly lower improvement on the two state-of-the-art deep learning models. We also observe that under-sampling method RUS has the lowest improvement, because feature information may be lost during the under-sampling process, especially for small imbalanced datasets. But RUS has an average improvement of 1.1% on IMDB dataset, as models can possess good generalization ability even if the large dataset loses some data. Moreover, among over-sampling strategies, the performance of RSR is better than ROS, which may be attributed to the improvement of over-fitting to some extent. Our over-sampling stage has the best performance as we reduce the increase of useless data. And by modifying the noisy data, RND improves the performance of OSRD, which demonstrates the necessity of addressing noisy data. In short, our method has great performance both in two stages and can be effectively integrated with different classification algorithms.

Table 3
Performances (%) of the different combinations on four datasets

Classifier Dataset Raw RUS ROS RSR OSRD OSRD+RND

LR MR 73.9 74.3 74.8 75.3 76.1 77.0

SST-2 74.2 74.5 75.1 75.6 76.5 77.2

IMDB 71.4 72.3 72.4 72.8 73.3 74.2

CR 76.0 75.8 76.8 77.4 78.5 79.2

NB MR 72.2 72.4 73.3 73.6 74.1 75.2

SST-2 73.8 73.9 74.6 75.1 75.7 76.9

IMDB 74.1 75.1 75.2 75.3 76.4 77.3

CR 76.3 76.5 77.2 77.6 78.1 79.1

CNN MR 74.7 75.1 75.8 76.1 76.8 77.8

SST-2 76.3 76.8 77.2 78.1 78.7 79.5

IMDB 73.1 74.3 74.5 74.8 75.6 76.3

CR 79.5 80.4 80.9 81.2 81.7 82.6

RNN MR 75.5 76.0 76.2 76.6 77.8 78.8

SST-2 76.8 77.4 77.9 78.4 79.2 80.1

IMDB 74.5 75.8 75.8 76.1 76.9 77.9

CR 80.3 80.8 81.1 81.6 82.3 83.0

ACNN MR 79.5 79.8 80.2 80.5 81.4 82.2

SST-2 81.6 81.8 82.3 82.8 83.6 84.5

IMDB 80.4 81.5 81.8 82.0 82.5 83.3

CR 83.5 83.6 83.8 84.3 85.2 85.8

AC-BiLSTM MR 78.9 79.2 79.3 79.6 80.7 81.6

SST-2 82.2 82.5 83.1 83.7 84.3 85.1

IMDB 81.3 82.2 82.3 82.4 83.2 83.9

CR 83.6 84.0 84.1 84.5 85.5 86.0

Classifier	Dataset	Raw	RUS	ROS	RSR	OSRD	OSRD+RND
LR	MR	73.9	74.3	74.8	75.3	76.1	77.0
	SST-2	74.2	74.5	75.1	75.6	76.5	77.2
	IMDB	71.4	72.3	72.4	72.8	73.3	74.2
	CR	76.0	75.8	76.8	77.4	78.5	79.2
NB	MR	72.2	72.4	73.3	73.6	74.1	75.2
	SST-2	73.8	73.9	74.6	75.1	75.7	76.9
	IMDB	74.1	75.1	75.2	75.3	76.4	77.3
	CR	76.3	76.5	77.2	77.6	78.1	79.1
CNN	MR	74.7	75.1	75.8	76.1	76.8	77.8
	SST-2	76.3	76.8	77.2	78.1	78.7	79.5
	IMDB	73.1	74.3	74.5	74.8	75.6	76.3
	CR	79.5	80.4	80.9	81.2	81.7	82.6
RNN	MR	75.5	76.0	76.2	76.6	77.8	78.8
	SST-2	76.8	77.4	77.9	78.4	79.2	80.1
	IMDB	74.5	75.8	75.8	76.1	76.9	77.9
	CR	80.3	80.8	81.1	81.6	82.3	83.0
ACNN	MR	79.5	79.8	80.2	80.5	81.4	82.2
	SST-2	81.6	81.8	82.3	82.8	83.6	84.5
	IMDB	80.4	81.5	81.8	82.0	82.5	83.3
	CR	83.5	83.6	83.8	84.3	85.2	85.8
AC-BiLSTM	MR	78.9	79.2	79.3	79.6	80.7	81.6
	SST-2	82.2	82.5	83.1	83.7	84.3	85.1
	IMDB	81.3	82.2	82.3	82.4	83.2	83.9
	CR	83.6	84.0	84.1	84.5	85.5	86.0

5.2.2 The effect of threshold K

In Section 3.2, we select feature words with threshold K. The value of K affects the number of feature words to be replaced in the over-sampling stage and noise modification stage. Since our method has coincident performance on the six classification algorithms, we take RNN and ACNN as examples, setting the parameter α to 10% and testing the effect of different K between 5% and 50%. The results are shown in Figs. 7(a) and 8(a). It turns out that the highest classification accuracy is around at K = 30%. The performance of OSRD is similar to RSR at small K because only few feature words are identified. And the classification accuracy will be stable and slightly decreased at K> 30 %. It may be attributed to the limited number of replaced words. Specifically speaking, too many feature words will exist in a sentence at large K, which is far more than the words needing to be replaced. Additionally, when the value of K is too large, many words with weak class distinguishing power will be identified as feature words. These words are called pseudo-feature words in this paper. Consequently, in the over-sampling stage, pseudo-feature words instead of true feature words may be replaced in the majority class sample. And the generated sample would not be a new minority class sample, bringing interference to the classifier. Thus, taking the four datasets into account, the optimal value of K for OSRD is 30%.

Fig. 7

The effects of threshold K and parameter α on the proposed method with RNN.

Fig. 8

The effects of threshold K and parameter α on the proposed method with ACNN.

In the noise modification stage, we replace feature words belonging to the other class in the noisy data, so the value of K also influences the performance of RND. We test the performance gain of RND compared to OSRD at different K, the experimental results are shown in Figs. 7(b) and 8(b). It is obvious that the highest accuracies of the four datasets are at K = 25% or K = 30%. And the performance gain decreases significantly at K> 35 %, even leading to a negative optimization. This might be attributed to the increase of pseudo-feature words, as a result of which too many words in noisy data will be replaced. In this case, most of these replaced words are pseudo-feature words, possibly changing the identity of a sentence. Furthermore, the best performance of the large dataset IMDB is at 20% or 25%. Since IMDB has a large vocabulary, feature words with strong class distinguishing power can be extracted at small K. On the other hand, the problem of pseudo-feature words is devastating at large K, which can be observed in Figs. 7(b) and 8(b).

In conclusion, considering OSRD and RND, we suggest setting K to 30% which is an optimal solution across the board.

5.2.3 The effect of parameter α for OSRD

In the over-sampling stage, we select a certain number of words for replacement according to the ratio α. The number of replaced words can affect the quality of a new sample. We set K to the optimal value of 30%, and study the effect of different α between 5% and 50% on the performances of classifiers. Similarly, we take RNN and ACNN as examples and results are shown in Figs. 7(c) and 8(c). It is found that accuracies are high for small α, especially for large dataset IMDB. As α increases, the classification performance begins to decrease, likely because replacing too many words destroys the sentence structure and generates unintelligible sentences. Thus, the suggested value of α is 10%, which is appropriate for different datasets according to our experiments.

6 Conclusion

This paper proposes a two-stage balancing strategy for imbalanced text sentiment classification. Our approach is based on word replacement but removes the problem of randomness in conventional approaches. By means of the improved weighting TF-IGM-CW, the over-sampling stage generates new samples with explicit sentiment via rewriting representative data. And we can obtain a relatively clean dataset in the noise modification stage, where the sentiment polarity of noisy data is modified. Finally, experimental results indicate that our approach outperforms several previous resampling strategies, and can be easily integrated with different classification algorithms. Compared to raw imbalanced datasets, the total average improvement of the proposed two-stage balancing strategy is 3.0%. Specifically, the average improvement of OSRD is 2.1% and RND is 0.9%.

In this work, we only study the imbalanced binary sentiment classification problem. In fact, it is also worthy to consider the imbalanced multi-class classification problem in text sentiment classification. For instance, the sentiment dataset has the label of neutral sentiment except positive and negative sentiment. Therefore, future work focuses on the research of developing suitable term weighting schemes to study the class distribution of the multi-class dataset and investigating corresponding balancing strategies.

Footnotes

Acknowledgments

This work has been supported by the National Natural Science Foundation of China under Grants: Methodologies for Understanding Big Data and Knowledge Discovery (61836016).

References

, et al., Imbalanced text sentiment classification using universal and domain-specific knowledge, Knowledge-Based Systems 160 (2018), 1–15.

, et al., Imbalanced sentiment classification, Proceedings of the 20th ACM international conference on Information and knowledge management, 2011, 2469–2472.

Kaur

, Pannu

H.S.

and Malhi

A.K.

, A systematic review on imbalanced data challenges in machine learning: Applications and solutions, ACM Computing Surveys (CSUR) 52(4) (2019), 1–36.

, et al., Cost-sensitive and hybrid-attribute measure multi-decision tree over imbalanced data sets, Information Sciences 422 (2018), 242–256.

Nanni

, Fantozzi

and Lazzarini

, Coupling different methods for overcoming the class imbalance problem, Neurocomputing 158 (2015), 48–61.

Haixiang

, et al., Learning from class-imbalanced data: Review of methods and applications, Expert Systems with Applications 73 (2017), 220–239.

Baroni

, et al., Entailment above the word level in distributional semantics, Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, 2012, 23–32.

Zhang

, et al., Imbalanced sentiment classification enhanced with discourse marker, International Conference on Artificial Neural Networks, 2019, 117–129.

, et al., Conditional BERT contextual augmentation, International Conference on Computational Science, 2019, 84–95.

10.

Giridhara

P.K.B.

, et al., A Study of Various Text Augmentation Techniques for Relation Classification in Free Text, ICPRAM 3 (2019), 5.

11.

Fernández

, del Jesus

M.J.

and Herrera

, On the 2-tuples based genetic tuning performance for fuzzy rule based classification systems in imbalanced data-sets, Information Sciences 180(8) (2010), 1268–1291.

12.

, et al., Word embedding composition for data imbalances in sentiment and emotion classification, Cognitive Computation 7(2) (2015), 226–240.

13.

Sáez

J.A.

, Krawczyk

and Woźniak

, Analyzing the oversampling of different classes and types of examples in multi-class imbalanced datasets, Pattern Recognition 57 (2016), 164–178.

14.

Lin

W.-C.

, et al., Clustering-based undersampling in class-imbalanced data, Information Sciences 409 (2017), 17–26.

15.

Devi

, Biswas

S.K.

and Purkayastha

, Learning in presence of class imbalance and class overlapping by using one-class SVM and undersampling technique, Connection Science 31(2) (2019), 105–142.

16.

Prusa

, et al., Using random undersampling to alleviate class imbalance on tweet sentiment data, 2015 IEEE international conference on information reuse and integration, 2015, 197–202.

17.

Wang

, et al., Sample cutting method for imbalanced text sentiment classification based on BRC, Knowledge-Based Systems 37 (2013), 451–461.

18.

, et al., Toward Controlled Generation of Text, International Conference on Machine Learning, 2017, 1587–1596.

19.

Luo

, et al., A novel oversampling method based on SeqGAN for imbalanced text classification, 2019 IEEE International Conference on Big Data (Big Data), 2019, 2891–2894.

20.

Zhang

, Zhao

and LeCun

, Character-level convolutional networks for text classification, Advances in Neural Information Processing Systems 28 (2015), 649–657.

21.

Fadaee

, Bisazza

and Monz

, Data Augmentation for Low-Resource Neural Machine Translation, Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2017, 567–573.

22.

Wei

and Zou

, EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019, 6383–6389.

23.

Madabushi

H.T.

, Kochkina

and Castelle

, Cost-Sensitive BERT for Generalisable Sentence Classification with Imbalanced Data, arXiv preprint arXiv:2003.11563, (2020).

24.

Wang

, et al., Sentiment classification: The contribution of ensemble learning, Decision Support Systems 57 (2014), 77–93.

25.

Parlak

and Uysal

A.K.

, The impact of feature selection on medical document classification, 2016 11th Iberian Conference on Information Systems and Technologies (CISTI), 2016, 1–5.

26.

Chen

, et al., Turning from TF-IDF to TF-IGM for term weighting in text classification, Expert Systems with Applications 66 (2016), 245–260.

27.

Pang

and Lee

, Seeing stars: exploiting class relationships for sentiment categorization with respect to rating scales, Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, 2005, 115–124.

28.

Arora

, Hu

and Kothari

P.K.

, An Analysis of the t-SNE Algorithm for Data Visualization, Conference On Learning Theory, 2018, 1455–1462.

29.

Socher

, et al., Recursive deep models for semantic compositionality over a sentiment treebank, Proceedings of the 2013 conference on empirical methods in natural language processing, 2013, 1631–1642.

30.

Maas

, et al., Learning word vectors for sentiment analysis, Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, 2011, 142–150.

31.

and Liu

, Mining and summarizing customer reviews, Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and datamining, 2004, 168–177.

32.

Hao

and Ho

T.K.

, Machine learning made easy: A review of scikit-learn package in python programming language, Journal of Educational and Behavioral Statistics 44(3) (2019), 348–361.

33.

Kim

, Convolutional Neural Networks for Sentence Classification, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014, 1746–1751.

34.

Liu

, Qiu

and Huang

, Recurrent neural network for text classification with multi-task learning, Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, 2016, 2873–2879.

35.

Liu

, et al., Recurrent networks with attention and convolutional networks for sentence representation and classification, Applied Intelligence 48(10) (2018), 3797–3806.

36.

Liu

and Guo

, Bidirectional LSTM with attention mechanism and convolutional layer for text classification, Neurocomputing 337 (2019), 325–338.