Abstract
Class imbalance is often a problem in various real-world datasets, where one class contains a small number of data and the other contains a large number of data. It is notably difficult to develop an effective model using traditional data mining and machine learning algorithms without using data preprocessing techniques to balance the dataset. Oversampling is often used as a pretreatment method for imbalanced datasets. Specifically, synthetic oversampling techniques focus on balancing the number of training instances between the majority class and the minority class by generating extra artificial minority class instances. However, the current oversampling techniques simply consider the imbalance of quantity and pay no attention to whether the distribution is balanced or not. Therefore, this paper proposes an entropy difference and kernel-based SMOTE (EDKS) which considers the imbalance degree of dataset from distribution by entropy difference and overcomes the limitation of SMOTE for nonlinear problems by oversampling in the feature space of support vector machine classifier. First, the EDKS method maps the input data into a feature space to increase the separability of the data. Then EDKS calculates the entropy difference in kernel space, determines the majority class and minority class, and finds the sparse regions in the minority class. Moreover, the proposed method balances the data distribution by synthesizing new instances and evaluating its retention capability. Our algorithm can effectively distinguish those datasets with the same imbalance ratio but different distribution. The experimental study evaluates and compares the performance of our method against state-of-the-art algorithms, and then demonstrates that the proposed approach is competitive with the state-of-art algorithms on multiple benchmark imbalanced datasets.
Introduction
Imbalanced classification, where the majority class is dominant to the minority class, is widely used in data mining and machine learning, such as enterprise credit evaluation [1], bankruptcy prediction [2], fraud detection [3, 4] and others [5].
Many famous machine learning models have presented to solve classification problems based on reasonable assumptions of balanced class distribution. However, these algorithms do not necessarily work well on imbalanced data, because their goal is to pursue a minimized global error rate [6]. In other words, a classifier can achieve a high degree of accurate classification, although it does not correctly predict any of the instances in the minority class. For example, assuming that 0.1% of credit card transactions are fraudulent, a simple classifier that classifies all transactions as legitimate will get a classification accuracy of 99.9%. In this case, however, all fraud cases remain undetected.
Techniques aimed at improving classification performance in the case of class imbalance can be divided into two broad categories: algorithm level methods and data level methods.
Algorithm level methods include modifying classical algorithms, cost-sensitive methods and ensemble of classifiers. The strategies which modify the classification algorithm to deal with the imbalance are algorithm level techniques [7]. Such techniques include training separate classifiers for each class and changing decision thresholds [8, 9]. The purpose of cost-sensitive methods is to provide classification algorithms which different misclassification costs for each class. However, they require knowledge of the cost of misclassification, which are commonly unknown or difficult to quantify and dataset-dependent. In addition, these algorithms should be able to incorporate the misclassification cost of each class or sample into its optimization. Therefore, these methods are considered to operate at the data and algorithm level [10]. Ensemble of classifiers are designed to improve the accuracy of a single classifier by training several different models and combining their decisions to predict a single class label [10, 11]. The main disadvantage of the ensemble method is the high time complexity. Data level methods could be seen as a classifier-independent type of technology, which are used to rebalance the data distribution to make standard algorithms center on the goals of the user [12]. In particular, data level methods can be categorized as under-sampling majority class instances [13] and oversampling minority class instances [14, 15]. Unlike algorithm-level methods, which are bound to problem-specific classifiers and need to be implemented by classifiers, data-level methods, especially oversampling methods, are universal and therefore more versatile [10].
Nonlinear separability could markedly influence the performance of machine learning algorithms in the imbalanced datasets. Fortunately, SVMs are suitable for imbalanced problem as the final decision function depends only on one smaller subset of training instances known as support vectors inherently [16]. Moreover, SVMs work well on nonlinear problems by identifying the separating boundary based on a kernel function that maps the dataset to a feature space. In this paper, entropy difference (ED) is introduced in the kernel space as a measure of distribution imbalance. It is well known that entropy can measures the chaos of data distribution. ED is one of the class-position statistics, it reflects inter-class distribution by intra-class concentration. SMOTE technology is used to synthesize instances where the intra-class concentration is small. In order to reduce the ED of datasets, this paper evaluates each synthetic instance and only qualified instances can be retained.
Josey et al. [16] developed a weighted kernel-based SMOTE (WK-SMOTE) method to solve the two problems of nonlinear and class imbalance at the same time, and obtained some impressive results. In order to improve the classification performance of WK-SMOTE, in this paper, we fully consider the balance of data distribution and propose an entropy difference and kernel-based SMOTE technique for solving the imbalanced problem. First, the EDKS method maps the input data into a feature space to increase the separability of the data. Then EDKS calculates the ED in kernel space, determines the majority and minority class and finds the sparse regions in the minority class. Moreover, the proposed method balances the data distribution by synthesizing new instance and evaluating its retention capability. The main contribution of this paper is to use ED to measure the data distribution in kernel space to ensure that new instances stay within a reasonable area.
The remainder of this paper is organized as follows: Section 2 introduces the class imbalance problem, discusses some oversampling methods for rebalancing imbalanced datasets and shows some kernel methods related to nonlinear separable classification problems in brief. In Section 3, we describe the proposed entropy difference and kernel-based SMOTE in detail. The expermental results are present in Section 4. The conclusions of the paper and further work on this topic are discussed in Section 5.
Although many different types of methods have been proposed for class imbalance problems, Section 2.1 only reviews the work of data resampling related to this article. Some kernel methods related to nonlinear separable classification problems are also introduced in Section 2.2.
Imbalance problem
Undersampling methods try to create a balanced subset of the original imbalanced dataset by reducing the size of majority class instances. Random undersampling (RUS) is a nonheuristic method which aims to rebalance class distribution through the random elimination of majority class instances. Obviously, the drawback to RUS is that it loses a lot of majority instances information. In order to keep more useful information about the majority instances, recently, the clustering technique has been introduced into undersampling methods [13, 17, 18]. Clustering-based undersampling method [13] involves clustering the majority instances by k-means algorithm [19], and the authors set the parameter k equal to the number of minority instances. Then, the k cluster centers, which are produced by the k-means algorithm over the majority instances, are used to replace the entire majority class dataset. Consequently, both the majority and minority class datasets contain the same number of instances.
Oversampling techniques balance the number of instances by increasing the number of the minority class instances. The simplest method of data oversampling is Random Over-Sampling (ROS) [7], which replicates the minority instances randomly and adds them into the training dataset. However, training a classification model from a balanced data set processed by ROS may lead to overfitting. Chawla et al. [20] proposed a traversal algorithm based on linear interpolation named SMOTE, which can effectively expand the range of a small number of instances and avoid overfitting. Its main idea is to select randomly one or more of the k nearest neighbors for minority class instance and to create new positive instances along the line segments joining selected instances. Since then, many variants of SMOTE have been proposed, such as adaptive synthetic sampling approach [15], Borderline-SMOTE [21], safe-level synthetic minority oversampling technique [22], density-based synthetic minority oversampling technique [23] and majority weighted minority oversampling technique (MWMOTE) [24]. In [25], Fernández et al. provided a summative analysis about the variations of SMOTE and pointed out their differences from the original SMOTE algorithm systematically. Overall, although SMOTE reduces the risk of overfitting, it also leads to the problem of over generalization [26]. In [27], the method proposed by Zhu et al. is designed for combating the multi-class imbalance problem and avoids over generalization by a corresponding selection weight to each neighbor direction. Menardi and Torelli’s research showed the effect of class imbalance on model training and evaluation [28]. Moreover, they have proposed an unified and systematic data processing framework for imbalanced datasets based on a smoothed bootstrap resampling technique. Liu et al. [29] made full use of the correlation among different attributes and developed a fuzzy rule-based oversampling technique for addressing the imbalance problem, solving the problems of imbalance and missing values at the same time. RWO-sampling [30] is a random walk oversampling method, which redistributes different class instances by generating synthetic instances from the real data. This method also allowed the extension of the classification border. In [31], Das et al. remarked that existing oversampling methods for addressing the class imbalance problem generally did not consider the probability distribution of the minority class. Thus, they presented two probabilistic oversampling methods, termed as RACOG and wRACOG, which utilize the joint probability distribution of data attributes and Gibbs sampling to creat artificial instances for the minority class.
It is well known that the class imbalance is not a problem in itself and various data distributions may be the real cause of the performance degradation of classifiers. There have been many papers discussing how to deal with these special issues. SMOTE-IPF [14] has been devoted to controlling borderline instances and the noise produced by SMOTE. Besides, Mahalanobis distance-based oversampling technique (MDO) [32] payed attention to maintaining the covariance structure of minority instances and reduced the risk of overlap between different class regions that are considered serious challenge in the multi-class problems.
It is important to measure imbalance degree between the minority and the majority. The traditional measure of class imbalance is the imbalance ratio (IR), which is defined as the ratio of the number of majority instances to the number of minority instances. The simplicity of IR leads to its application in most studies of imbalanced data. IR reflects the imbalance degree of dataset in size, but does not measure the imbalance degree in distribution. Even if the datasets is quantitatively balanced, the imbalance of class distribution may still exist [33]. Furthermore, the classification accuracy of minority class is related to the number of information instances rather than the number of minority instances [34]. Therefore, this paper uses the entropy difference to measure the imbalanced degree of dataset distribution, which is completely different from the previous IR. The advantages of using ED can be clearly seen in Fig. 1. The two datasets have different ED and the same IR. For Fig. 1a, without overlapping regions and clear classification boundaries between two classes make it easily recognized by any simiple classifier. Figure 1b is quite different. Obviously, IR does not have the ability to distinguish these two datasets with different distributions. In short, those representative instances of minority class are crucial in the learning of minority distribution. Previous works [33, 34] have shown that the more representative instances in the minority class, the better the classifier’s performance when fixed IR. Therefore, it is not appropriate to use IR as an only indicator to measure the imbalance degree.

Entropy often used to measure the uncertainty of data distribution, and it can be considered as the contrary of distribution information. In other words, the more randomly distributed a dataset is, the less informations it contains [35]. For imbalanced data, a more dispersed and more imbalanced intra-class distribution will imply a high level of entropy. In this occasion, entropy is introduced into the kernel space as a measure of data distribution.
The classical oversampling methods, such as SMOTE [20], AdaSyn [15], borderline [21], are usually to generate instances for the minority class to balance the dataset. A new minority instance is inserted into the line segment between
where

Although SMOTE has been proved to obtain good classification performance in many applications areas, its performance is largely limited on nonlinear separable problems. For example, consider the two-dimensional classification problem shown in Fig. 2a. In order to balance the dataset artificially, SMOTE identifies minority classes and their neighbors in the input space, and generates a random minority instances on the line segments connecting them. It is worth noting that some synthetic instances have been clearly generated in the majority class region. This problem has been addressed in literatures by modifying the selection of neighbors in SMOTE. In [21], the danger set is composed of the minority instances with a large number of majority neighbors. These borderline danger instances are then used to generate instances. Such methods, however, depend on specific data distribution characteristics, which limits their application to specific domains. On the other hand, classifiers such as neural networks and SVM solve nonlinear problems by mapping the instances to a feature space and building decision function in this space. The two-dimensional non-linear separable classification task in Fig. 2a can be transformed into a three-dimensional feature space using the transformation function
As shown in Fig. 2b, the nonlinear separable problem is transformed into a linear separable problem. Hence, performing oversampling in this feature space leads to more representative instances generation. In addition to modifying the decision function, the selection of the kernel function also affects the generalization ability of SVM classifier [36, 37, 38]. Some kernel-based methods for imbalanced classification have also been proposed. For example, in [39] and [40], preimages of instances generated in feature space are used to balance the imbalanced data set.
A synthetic instance between two minority instances
where
The situation is different for instances generated in the input space. Addition of synthetic instances increases the representation of the minority class and forces the classification model to move the decision boundary toward the majority class.
In this section, we propose an ED and kernel-based SMOTE technique (EDKS). EDKS for SVM involves three key steps. In the first step, the entropy difference between two classes of data is calculated in feature space and minority classes are determined (Section 3.1). The second step, the pairs of seed and neighbor used to generate synthetic instances are identified from the minority classes, and then synthetic instances are generated on the line segments between the pairs (Section 3.2). Finally, the new synthetic instance retention capability is evaluated and only qualified instances are retained (Section 3.3). Particularly, this section is discussed in feature space. Figure 3 shows the frame of this paper.
The structure of EDKS.
Given a training dataset
the
We define a density-based instance-position statistic for the
where
where
The entropy of each class is calculated by Eq. (7). Let
where
[H] EDID (
Remark
In traditional IR measurement methods, the number of minority classes should be less than the majority class. However, the result of measuring the imbalance degree based on the ED is not always consistent. The instance number of minority class can be equal to or even more than that of majority class, which proves that the majority class distribution plays a dominant role in the classification results.
In this section, we introduce synthesize minority instances. Firstly, instance-position statistics for each minority class instance are computed by Eq. (5), and
In order to train the SVM classifier in kernel space, the inner product of every pair of instances also known as the Gram matrix needs to be computed. The inner product of training instances is conveniently represented by
where
The elements of
Similarly, the elements of
From Eqs (9)–(11), it is clear that the augmented kernel matrix K is only composed of the training instances in
It is worth noting that other oversampling algorithms like borderline and AdaSyn can also be easily adapted to operate in the feature space of SVM classifier using this method. The Euclidean distance used in such algorithms is replaced by the feature space distance
In order to reduce the ED of dataset, we evaluate each synthetic instance and retain the qualified instances which can reduce ED when add it to dataset. Therefore, the qualified instances have the ability to balance the majority and minority class on data distribution.
The implementation of EDKS is described in Algorithm 2. First, EDKS uses Algorithm 1 to determine the minority class, and finds its seed instances as well as neighbor instances. Then it uses Eqs (9)–(11) to oversampling the minority class in feature space. Finally, the retention capability of new synthetic instances is evaluated.
[H] EDKS (
Before concluding this section, we want to briefly discuss the reasonableness of the synthetic data obtained by EDKS. To overcome the limitations of SMOTE, we propose an entropy difference and kernel-based SMOTE. Kernel machines work by mapping the input data into a feature space to reduce or eliminate overlapping areas, and then building linear algorithms in the feature space to implement nonlinear counterparts in the input data space. On this basis, entropy difference is used to measure the data distribution, determine the location of the data to be generated, and detect the quality of the synthesized data. EDKS not only improves the quality of synthetic data, but also avoids data overfitting to a greately extent.
Experimental results and analyses
In order to illustrate the effectiveness of the proposed EDKS algorithm in solving class imbalance problems, the following experiments are carried out.
Experimental setup
Confusion matrix for two-class classification problem
Confusion matrix for two-class classification problem
The performance measurement is a key factor in comparison of various models. Accuracy is the most commonly used evaluation metric. However, in the framework of imbalanced datasets, accuracy is no longer a proper measure, as it does not distinguish between the numbers of correctly classified examples of different classes. Therefore, it may lead to erroneous conclusions, i.e., a classifier achieving an accuracy of 90% in a dataset with an IR value of 9, is not accurate if it classifies all examples as negatives [41]. In our experiments, Area Under the Curve (AUC) [42], Recall and G-mean [43] are together used as the performance metrics. AUC, the area under the receiver operating characteristic curve (ROC), is widely applied to estimate the accuracy of imbalanced models with all possible scopes of thresholds. It is an objective measure which is not affected by subjective factors on account of its independence from the decision criterion selected and prior probabilities of class distributions [7, 44]. Recall refers to the classification accuracy on the positive instances, and G-mean is the harmonic mean of Recall and Precision. In a two-class problem, the confusion matrix (shown in Table 1) records the results of correctly and incorrectly recognized instances of each class. The definitions of accuracy, recall, precision and G-Mean are as follows:
Apparently, the greater the values of these three selected performance metrics, the better the performance of the algorithm.
The datasets used in the evaluation are briefly described here. Nineteen binary datasets from UCI machine learning repository [45] and the KEEL data repository [46] are selected to evaluate the effectiveness of the proposed algorithm. Their detailed information is presented in Table 2. The number of total instances, the number of attributes, the number of minority instances, IR and ED have shown. The performance results on each datasets are achieved from the average of ten independent performs with stratified five fold cross-validation.
The proposed EDKS is evaluated against the following seven baseline algorithms.
Description of the 19 datasets used in the experiments
SVM [47]: Traditional SVM classifier is used on datasets without additional oversampling or weighting. Sampling methods:
SMOTE [9]: It is used to generate synthetic minority instances as shown in Eq. (1). Borderline [21]: It adds new synthetic instances near the border between the two classes. The minority instances that have more than half of their Safe-Level [22]: It adds new synthetic instances to the security areas of the minority class. The AdaSyn [15]: In adaptive synthetic sampling, the number of synthetic instances generated for each minority instance is proportional to the number of majority neighbors it has. ROS [7]: Instances of the minority class are randomly selected and added to the training dataset. WK-SMOTE [16]: It oversampling in the feature space of SVM classifier and then the synthetic instances and the original instances are distinguished by weighting.
Proper selection of hyper-parameters is critical to obtain a robust classifier that generalizes well to unknown instances. The hyper-parameters of SVM include the choice of the kernel, its parameters, and the soft-margin constraint
Tables 3–5 show the AUC values, Recall rate and G-mean values of the eight algorithms on nineteen datasets, respectively. The rank of each algorithm is given in brackets and the score with maximum value is highlighted in bold for each dataset. The average rank of each algorithm is given in last row. Table 6 shows the comprehensive performance of eight algorithms on different datasets. Comprehensive performance refers to the average of AUC, Recall and G-mean rankings of the algorithm on all datasets.
AUC score and rank of the eight algorithms on nineteen datasets
AUC score and rank of the eight algorithms on nineteen datasets
Recall score and rank of the eight algorithms on nineteen datasets
G-Mean score and rank of the eight algorithms on nineteen datasets
Comprehensive performance of eight algorithms
It can be seen that seven kinds of sampling techniques can enhance classification performance on imbalanced datasets with the classifier compared to the original data processing. But the ROS leads to classifier overfitting, resulting in limited performance improvement, which is proved in the experiments. On the other hand, SMOTE and its variants of the algorithm can avoid overfitting, and improve the performance of the imbalance problem using the oversampling technique and interpolation theory. However, the performance of datasets using these algorithms are not always better than ROS. For example, the German and glass6 datasets show better performance when using the ROS. The experiments prove that there is no absolute best algorithm to deal with all imbalanced datasets.
Table 3 shows the AUC results obtained by the algorithms, it objectively reflects the comprehensive prediction ability of classifiers for imbalanced datasets. EDKS algorithm achieves the highest AUC score on five datasets, especially on ecoli0146vs5, ecoli067vs35 and sonar. In addition, the performance of EDKS algorithm is generally good on other datasets, and significantly improved the performance of AUC in ecoli0146vs5, ecoli067vs35 and sonar. However, the AUC scores on some datasets are not satisfactory, such as Pima and vehicle0. It may be that the ED of these two datasets are very small, and the technique which we propose can not distinguish the minority class from the majority class very well, which leads to the unsatisfactory AUC score.
For imbalanced problems, the recall of the minority class is important, which means the proportion of the minority classes classified correctly, and the minority class is usually more important than the majority class. Table 4 shows that the EDKS algorithm significantly improves recall on 17 of the 19 datasets. Our proposed algorithm achieves the optimal G-mean value on some datasets, but the average ranking of G-mean score is not ideal. This is because the EDKS algorithm only uses the original data as the seed instance and does not consider the synthesis instance, thus increase the risk of overfitting.
ED and IR on all dataset, where blue and red represent ED and IR, respectively.
Figure 4 shows the imbalance degree between ED and IR metrics for all datasets. It can be seen that if these datasets are sorted by ED and IR respectively, the order obtained is basically the same. However, Some datasets have higher ED ranking than IR, such as ecoli067vs35, ecoli0vs1, glass1, etc. Interestingly, EDKS generally performs better on these datasets than other algorithms. It shows that the EDKS algorithm is more effective when the data distribution is more imbalanced than the instance size.
As for the robustness of the EDKS algorithm, if the noise is a majority class instance, there is no significant impact on the algorithm. However, if the noise comes from the minority class, it may have an impact on the subsequent model learning process. This is because the noise may be involved in the oversampling process. The following strategies can be used to deal with this problem: The method of outlier detection is used to detect the noise of the minority instances after the datasets is mapped to the feature space, such as KF algorithm [48]. In the process of oversampling, do not use noise data. If there was a missing value, it could be filled in the input space by using the padding technique, like [49], [50], and etc.
In this paper, EDKS, an entropy difference and kernel-based oversampling algorithm, is proposed to balance the class distribution in an SVM classifier. It generalizes the popular SMOTE algorithm for nonlinear separable data by finding sparse regions and generating minority instances in the feature space of classifier instead of the input data space. Compared with many baseline methods on the benchmark imbalanced datasets in the UCI machine learning library and KEEL database, the proposed algorithm can improve AUC, recall rate and G-mean score.
In future research, we will study the influence of different intra-class distributions on entropy in depth, and further improve the diversity of synthetic data by developing techniques to expand the range of seeds and neighbors in kernel space. For the case of
Footnotes
Acknowledgments
The authors would like to thank the anonymous reviewers for their valuable comments and suggestions. This research was supported by National Natural Science Foundation of China (61573266).
