Abstract
Ensemble learning is an excellent method for imbalance classification. However, the existing ensemble methods often ignore noise in the dataset, which may reduce the accuracy of classifier. In this paper, we propose a density-based undersampling algorithm (DBU) and integrate it with AdaBoost (DBUBoost) to improve the classification performance. The major contribution of this paper is the development of an undersampling strategy for dealing with both noise and class imbalance problem. We first divide the examples from each class into three categories: useful examples, noise and potentially useful examples. Then we introduce a similarity coefficient to distinguish the examples from each category. Through a selection mechanism based on similarity coefficients, we retain the useful examples and remove the noisy examples. To demonstrate the effectiveness, we compare our DBUBoost with four ensemble methods and three anti-noise methods. The experiments were conducted on 9 KEEL datasets and their noise-modified datasets. Experimental results have shown that our DBUBoost performs better than other state-of-the-art methods.
Introduction
Imbalance classification has been a research hotspot in many of current applications, such as financial crisis prediction [1], medical diagnosis [2] and e-mail filtering [3]. Focusing on two-class problem, a dataset is imbalanced if one class (called minority class) has fewer examples than the other (called majority class) [4]. Due to the asymmetric data distribution of imbalanced dataset, traditional algorithms often result in misclassification of the minority class examples. Actually, researchers tend to focus more on the minority class [1, 2, 3, 18, 19].
Ensemble learning, which embeds the sampled method into bagging or boosting, has been proven to be an effective way to address the class imbalance problem [5, 6, 7]. Among these ensemble algorithms, class imbalance problem is mainly overcome by oversampling methods or undersampling methods. In these two sampled approaches, the oversampling approach solves the class imbalance problem by generating synthetic examples. However, oversampling approach may result in over-fitting of the model in its building process [8]. As a result, some researchers believe that the undersampling approaches are more competitive than the oversampling approaches [5, 6, 7]. The simplest undersampling approach is random undersampling. Many related ensemble algorithms have been proposed by embedding random undersampling into Bagging or Boosting, e.g. UnderBagging [10], RUSBoost [5], EasyEnsemble and BalanceCascade [11]. However, random undersampling may lose some useful information, thus some researchers propose other algorithms to solve this problem, such as EUSBoost [6], RHSBoost [20], CBUBoost [7], etc. Since these proposed ensemble methods ignore the noisy examples in the dataset, it is likely to reduce the accuracy of classification. Several algorithms are proposed to eliminate the influence of noise. Jing et al. [13] combined K-nearest neighbor algorithm with DBSCAN algorithm to remove the noisy and redundant examples in the majority class. Sáez et al. [14] used an iterative-partitioning filter (IPF) to remove the noisy examples generated by the synthetic minority over-sampling technique (SMOTE). Kang et al. [17] proposed an undersampling scheme that uses K-nearest neighbor filter to remove the noisy examples in the minority class. To the best of our knowledge, there is no algorithm for imbalance classification that can remove noise from both majority and minority class.
Not all examples in the dataset are useful for classification [13, 14]. Some examples may be useless for classification, while some may degrade the classification performance. The former is called redundant example and the latter is called noise. Real-world dataset always contains many examples with similar characteristics. For these similar examples, we can select one of them to represent the whole group, while the remaining examples are treated as redundant examples. Meanwhile, noise is the example which locates deep inside the region of the other class. In feature space, similar examples are relatively close to each other while noisy example is far from other examples belonging to the same class [7]. Thus, these similar examples have a high density, while the noisy example has a low density.
In this paper, we propose a density-based undersampling algorithm (DBU) and embed it into AdaBoost (DBUBoost) to improve the classification performance. The purpose of our method is to build a non-noise balanced dataset which consists of examples with different characteristics. To achieve this purpose, we first divide the examples from each class into three categories according to their KNN and densities: useful examples, noise and potentially useful examples. Potentially useful examples can be divided into relatively useful examples and redundant examples based on their distances to useful examples. Then we introduce a similarity coefficient to distinguish the examples from each category, which the similarity coefficients of noisy and redundant examples are lower than other examples. Note that the noisy examples have a similarity coefficient equal to 0. For the majority class, we select a certain number of examples based on their similarity coefficients as the elements of re-sampled majority class. For the minority class, we remove the noise by just deleting the examples whose similarity coefficient is 0.
The main contribution of this paper is twofold. First, we propose an undersampling algorithm (DBU) to retain useful examples and remove noisy examples, which has never been done before. Second, we propose an anti-noise ensemble algorithm by embedding our DBU into Adaboost.M2, which can remove the noisy examples from both majority and minority class.
The remainder of this paper is organized as follows. Section 2 briefly describes related works for handling the class imbalance problem. Section 3 describes our DBU and DBUBoost in detail. Section 4 compares our DBUBoost with some state-of-the-art methods on KEEL datasets and their noise-modified datasets, Section 5 summarizes this paper.
Related works
The imbalance problem rises from the under-representation of the important minority class, which leads to the fact that learned models tend to focus more on the majority class examples, ignoring the minority class examples. Many algorithms have been proposed to solve such imbalance problem, which can be mainly divided into the following three categories: data level, algorithmic level and ensemble learning.
Data level
Data level algorithms try to balance the data distribution by oversampling the minority class or undersampling the majority class. The former is named as oversampling methods, while the later is named as undersampling methods.
The simplest oversampling method is random oversampling, which constructs a balanced dataset by randomly replicating some examples from the minority class. Actually, random oversampling does not enhance the representation for the minority class. Thus, Chawla et al. [22] propose SMOTE to generate synthetic minority class examples by linear interpolation. However, SMOTE may result in the over-fitting problem, since its generated minority class examples locate inside the convex hull of the majority class examples. Generally, these anomalous examples are named as noise. There are two approaches to avoid the appearance of noise. The first approach is to modify the generation mechanism of the artificial examples in SMOTE, including SL-SMOTE [23], RSB-SMOTE [24], B-SMOTE [25], LN-SMOTE [26], etc. The second approach is to embed a selection mechanism of the artificial examples into SMOTE, including SMOTE-TL [27], SMOTE-IPF [14], etc. However, these variants of SMOTE cannot eliminate the noisy examples in the original dataset. Furthermore, a main shortage of these variants is the large computational cost.
The simplest undersampling method is random undersampling, which constructs a balanced dataset by randomly removing some examples from the majority class. However, random undersampling may lose some useful information. Many algorithms have been proposed to overcome this shortage. In order to obtain a useful subset of the original dataset, EUS [6] randomly samples several data subsets and then evolves them until the currently best re-sampled dataset cannot be further improved. CBU [7] clusters the majority class examples by the k-means clustering algorithm [9] and uses the cluster centers or the nearest neighbors of the cluster centers to represent the majority class.
Algorithmic level
Algorithmic level algorithms modify the traditional classifiers for balanced datasets, such that the modified algorithms can be applied to imbalanced datasets. Tax and Duin [28] proposed support vector domain description (SVDD) by modifying support vector machine. Since SVDD only needs one class to train the learning model, it can be applied in the imbalance classification. Furthermore, some researchers incorporate the cost-sensitive strategy into neural networks [29], SVM [30], AdaBoost [31], trees [32] and use these modified classifiers to classify the imbalance dataset. In recent years, some new approaches have been proposed to incorporate cost-sensitive strategy into the deep CNN [33, 34, 35, 36]. However, these algorithmic level methods require special knowledge of both the corresponding classifier and the application domain, which may be a complicated task for researchers.
Ensemble learning
Ensemble learning, which embeds the data level method into bagging or boosting, has been proven to be an effective way to address the class imbalance problem [5, 6, 7]. Among these ensemble algorithms, the class imbalance problem is mainly overcome by the data level methods. According to the type of data level methods, ensemble methods can be divided into three categories: oversampling-, undersampling- and hybrid-based ensembles.
Oversampling-based ensembles make use of oversampling methods to deal with imbalance problem. Chawla et al. [39] proposed SMOTEBoost to learn the minority class by embedding SMOTE into boosting. Wang and Yao [40] introduced SMOTEBagging to explore the impact of diversity on imbalanced datasets. Like SMOTE, these two ensembles methods also generate noise during linear interpolation.
Undersampling-based ensembles make use of oversampling methods to deal with imbalance problem. Many related algorithms have been proposed by embedding random undersampling into Bagging or Boosting, e.g. UnderBagging [10], RUSBoost [5], EasyEnsemble and BalanceCascade [11]. Furthermore, Galar et al. [6] proposed EUSBoost algorithm, which builds a balanced dataset by using evolutionary undersampling method (EUS). Lin et al. [7] presented clustering-based ensemble algorithm (CBUBoost) to retain the useful information in imbalanced dataset. However, these ensembles methods have never considered noise in the dataset.
Hybrid-based ensembles make use of both oversampling and undersampling methods to deal with imbalance problem. Under a bagging scheme, Wang and Yao [38] proposed UnderOverBagging algorithm by combining random oversampling with random undersampling. Gong and Kim [20] introduced RHSBoost algorithm by using random undersampling and ROSE sampling [37] under a boosting scheme. Analogously, these two methods also ignore noise in the dataset.
The proposed method
In this section, we first describe our DBU algorithm in detail, and then we propose DBUBoost by embedding DBU into AdaBoost.M2.
Proposed DBU
Similar examples are relatively close to each other and noisy example is far from other examples belonging to the same class in feature space. Based on this assumption, we propose a density-based undersampling algorithm called DBU. Its purpose is to build a balanced dataset without noisy and redundant examples. To achieve this purpose, we calculate a similarity coefficient for each example, which the similarity coefficients of noisy and redundant examples are lower than other examples. To implement DBU, we first calculate a local density
In this paper, we estimate the local density of example
To detect the density-based noise, we keep
Our DBU in two dimensions. a. Data distribution; b. Decision graph 
Obviously, the local density of example
According to their KNN and local densities, we can divide the majority class examples into three categories. The three categories are defined as follows:
Useful examples, or called central examples (e.g. points 6 and 10 in Fig. 1a): Example Noise (e.g. points 26–28 in Fig. 1a): Example Potentially useful examples: Examples other than noisy and useful examples are potentially useful examples. Potentially useful examples include examples far from the central example and examples close to the central example. The former is called relatively useful examples (e.g. points 13, 17, 25, etc.) and the latter is called redundant examples (e.g. points 2, 11, 22, etc.).
In our DBU, central examples are the preferred examples selected as the elements of re-sampled majority class, followed by relatively useful examples and redundant examples, and noise is the example to be removed. To highlight noisy and redundant examples, we propose a new approach to measure the similarity, called similarity coefficient (
For a central example, the similarity coefficient
Specially, for the example with highest density, the distance is calculated by
For a noisy example, the similarity coefficient
For a potentially useful example, the similarity coefficient
As Fig. 1c shows, the central examples 6 and 10 have the highest values
Note that the value of
Boosting is an effective technique that can improve the performance of any weak classifier [5]. In boosting, the classifiers are trained iteratively, with the weights of the training examples modified according to the performance of the previous classifiers. The main idea is that the classifier should pay more attention to those examples that are difficult to learn. For imbalance classification, since the minority class examples are usually more likely to be misclassified, some researchers introduce boosting into the field of imbalance classification to improve the accuracy of minority class.
The framework of our DBUBoost.
Based on RUSBoost, we embed our DBU into AdaBoost.M2 to improve the performance for minority class (DBUBoost). Some researchers believe that the diversity of base classifiers is helpful to improve the performance of ensemble methods [6]. Without affecting the accuracy of base classifiers, our DBUBoost modifies the step 4 in our DBU to improve the diversity of base classifiers by selecting
The details of our DBUBoost algorithm are as follows. Let
In step 1, the weight of each example in the training dataset
Experimental setup
In this paper, we select 9 two-class imbalanced datasets from the KEEL dataset repository (
Datasets for experiment
Datasets for experiment
We select four ensemble methods RUSBoost [5], EUSBoost [6], CBUBoost [7], RHSBoost [20] and three anti-noise methods DBSCAN-KNN [13], EE-KF [17], SMOTE-IPF [14] as the comparing methods. The experiments are organized in two sections. In the first section, the experiments are conducted on 9 two-class datasets to test the performance of our DBUBoost. In the second section, the experiments are conducted on their noise-modified datasets to test the noise immunity of our DBUBoost. All the experiments are conducted with C4.5 as the base classifier, which has been widely used in imbalance classification. Note that the parameters in C4.5 are set according to the default parameters used in the Weka software package. In our DBUBoost, the parameter
The average accuracy is the most popular evaluation criterion to evaluate the classification performance of classifiers. Since this evaluation criterion usually results in misclassification of the minority class examples, it has been proved unsuitable for imbalance classification. Consequently, other evaluation criteria have been proposed successively, including F-measure, ROC and AUC. In this paper, all experiments use AUC as the evaluation criterion [16].
In the first experiment, we compare our DBUBoost with four ensemble algorithms: RUSBoost, EUSBoost, CBUBoost, RHSBoost and three anti-noise algorithms: DBSCAN-KNN, EE-KF and SMOTE-IPF. Table 2 shows their AUC results, where Avg. are their average AUC results on the 9 datasets. Table 3 shows the Wilcoxon’s rank-sum test results for the comparison between DBUBoost and the rest of algorithms, where
AUC results on KEEL datasets (%)
AUC results on KEEL datasets (%)
Wilcoxon rank-sum test results for the comparison between our DBUBoost and other methods
In the second experiment, we corrupt the class labels and attribute values of some examples in the 9 KEEL datasets by introducing class noise and attribute noise separately, and then conduct experiments on these noise-modified datasets to verify the noise immunity of these algorithms. Given a noise level
AUC results on noise-modified KEEL datasets (%)
AUC results on noise-modified KEEL datasets (%)
Wilcoxon rank-sum test results on class noise-modified datasets
Wilcoxon rank-sum test results on attribute noise-modified datasets
For the class noise, we randomly choose ( For the attribute noise, we randomly choose [
In this study, we introduce two different levels of noise (
Imbalance classification is an important task in many domains. Many algorithms have been proposed to solve the imbalance problem. However, the existing algorithms always lose useful examples or ignore the noisy examples in the dataset. In this paper, a new anti-noise ensemble algorithm is proposed to retain the useful examples and remove the noisy examples. The experimental results on 9 KEEL datasets show that our DBUBoost achieves the best performance. This fact is because our DBUBoost can reflect the real distribution of data, which means that the re-sampled datasets retain the useful information of the original datasets. Furthermore, the experimental results on noise-modified datasets show that our DBUBoost is the most stable algorithm. It results from that our DBUBoost can remove the noisy examples in both majority and minority class. Thus, our proposed algorithm can be widely used for the classification on imbalanced datasets, especially on imbalanced datasets containing noise.
Footnotes
Acknowledgments
This work was supported by National Natural Science Foundation of China [Grant 51275431] and the Sichuan Province Science and Technology Support Program Project [Grant 2016GZ0194, Grant 2018GZ0361].
