An anti-noise ensemble algorithm for imbalance classification

Abstract

Ensemble learning is an excellent method for imbalance classification. However, the existing ensemble methods often ignore noise in the dataset, which may reduce the accuracy of classifier. In this paper, we propose a density-based undersampling algorithm (DBU) and integrate it with AdaBoost (DBUBoost) to improve the classification performance. The major contribution of this paper is the development of an undersampling strategy for dealing with both noise and class imbalance problem. We first divide the examples from each class into three categories: useful examples, noise and potentially useful examples. Then we introduce a similarity coefficient to distinguish the examples from each category. Through a selection mechanism based on similarity coefficients, we retain the useful examples and remove the noisy examples. To demonstrate the effectiveness, we compare our DBUBoost with four ensemble methods and three anti-noise methods. The experiments were conducted on 9 KEEL datasets and their noise-modified datasets. Experimental results have shown that our DBUBoost performs better than other state-of-the-art methods.

Keywords

Imbalance classification noise undersampling AdaBoost

1. Introduction

Imbalance classification has been a research hotspot in many of current applications, such as financial crisis prediction [1], medical diagnosis [2] and e-mail filtering [3]. Focusing on two-class problem, a dataset is imbalanced if one class (called minority class) has fewer examples than the other (called majority class) [4]. Due to the asymmetric data distribution of imbalanced dataset, traditional algorithms often result in misclassification of the minority class examples. Actually, researchers tend to focus more on the minority class [1, 2, 3, 18, 19].

Ensemble learning, which embeds the sampled method into bagging or boosting, has been proven to be an effective way to address the class imbalance problem [5, 6, 7]. Among these ensemble algorithms, class imbalance problem is mainly overcome by oversampling methods or undersampling methods. In these two sampled approaches, the oversampling approach solves the class imbalance problem by generating synthetic examples. However, oversampling approach may result in over-fitting of the model in its building process [8]. As a result, some researchers believe that the undersampling approaches are more competitive than the oversampling approaches [5, 6, 7]. The simplest undersampling approach is random undersampling. Many related ensemble algorithms have been proposed by embedding random undersampling into Bagging or Boosting, e.g. UnderBagging [10], RUSBoost [5], EasyEnsemble and BalanceCascade [11]. However, random undersampling may lose some useful information, thus some researchers propose other algorithms to solve this problem, such as EUSBoost [6], RHSBoost [20], CBUBoost [7], etc. Since these proposed ensemble methods ignore the noisy examples in the dataset, it is likely to reduce the accuracy of classification. Several algorithms are proposed to eliminate the influence of noise. Jing et al. [13] combined K-nearest neighbor algorithm with DBSCAN algorithm to remove the noisy and redundant examples in the majority class. Sáez et al. [14] used an iterative-partitioning filter (IPF) to remove the noisy examples generated by the synthetic minority over-sampling technique (SMOTE). Kang et al. [17] proposed an undersampling scheme that uses K-nearest neighbor filter to remove the noisy examples in the minority class. To the best of our knowledge, there is no algorithm for imbalance classification that can remove noise from both majority and minority class.

Not all examples in the dataset are useful for classification [13, 14]. Some examples may be useless for classification, while some may degrade the classification performance. The former is called redundant example and the latter is called noise. Real-world dataset always contains many examples with similar characteristics. For these similar examples, we can select one of them to represent the whole group, while the remaining examples are treated as redundant examples. Meanwhile, noise is the example which locates deep inside the region of the other class. In feature space, similar examples are relatively close to each other while noisy example is far from other examples belonging to the same class [7]. Thus, these similar examples have a high density, while the noisy example has a low density.

In this paper, we propose a density-based undersampling algorithm (DBU) and embed it into AdaBoost (DBUBoost) to improve the classification performance. The purpose of our method is to build a non-noise balanced dataset which consists of examples with different characteristics. To achieve this purpose, we first divide the examples from each class into three categories according to their KNN and densities: useful examples, noise and potentially useful examples. Potentially useful examples can be divided into relatively useful examples and redundant examples based on their distances to useful examples. Then we introduce a similarity coefficient to distinguish the examples from each category, which the similarity coefficients of noisy and redundant examples are lower than other examples. Note that the noisy examples have a similarity coefficient equal to 0. For the majority class, we select a certain number of examples based on their similarity coefficients as the elements of re-sampled majority class. For the minority class, we remove the noise by just deleting the examples whose similarity coefficient is 0.

The main contribution of this paper is twofold. First, we propose an undersampling algorithm (DBU) to retain useful examples and remove noisy examples, which has never been done before. Second, we propose an anti-noise ensemble algorithm by embedding our DBU into Adaboost.M2, which can remove the noisy examples from both majority and minority class.

The remainder of this paper is organized as follows. Section 2 briefly describes related works for handling the class imbalance problem. Section 3 describes our DBU and DBUBoost in detail. Section 4 compares our DBUBoost with some state-of-the-art methods on KEEL datasets and their noise-modified datasets, Section 5 summarizes this paper.

2. Related works

The imbalance problem rises from the under-representation of the important minority class, which leads to the fact that learned models tend to focus more on the majority class examples, ignoring the minority class examples. Many algorithms have been proposed to solve such imbalance problem, which can be mainly divided into the following three categories: data level, algorithmic level and ensemble learning.

2.1 Data level

Data level algorithms try to balance the data distribution by oversampling the minority class or undersampling the majority class. The former is named as oversampling methods, while the later is named as undersampling methods.

The simplest oversampling method is random oversampling, which constructs a balanced dataset by randomly replicating some examples from the minority class. Actually, random oversampling does not enhance the representation for the minority class. Thus, Chawla et al. [22] propose SMOTE to generate synthetic minority class examples by linear interpolation. However, SMOTE may result in the over-fitting problem, since its generated minority class examples locate inside the convex hull of the majority class examples. Generally, these anomalous examples are named as noise. There are two approaches to avoid the appearance of noise. The first approach is to modify the generation mechanism of the artificial examples in SMOTE, including SL-SMOTE [23], RSB-SMOTE [24], B-SMOTE [25], LN-SMOTE [26], etc. The second approach is to embed a selection mechanism of the artificial examples into SMOTE, including SMOTE-TL [27], SMOTE-IPF [14], etc. However, these variants of SMOTE cannot eliminate the noisy examples in the original dataset. Furthermore, a main shortage of these variants is the large computational cost.

The simplest undersampling method is random undersampling, which constructs a balanced dataset by randomly removing some examples from the majority class. However, random undersampling may lose some useful information. Many algorithms have been proposed to overcome this shortage. In order to obtain a useful subset of the original dataset, EUS [6] randomly samples several data subsets and then evolves them until the currently best re-sampled dataset cannot be further improved. CBU [7] clusters the majority class examples by the k-means clustering algorithm [9] and uses the cluster centers or the nearest neighbors of the cluster centers to represent the majority class.

2.2 Algorithmic level

Algorithmic level algorithms modify the traditional classifiers for balanced datasets, such that the modified algorithms can be applied to imbalanced datasets. Tax and Duin [28] proposed support vector domain description (SVDD) by modifying support vector machine. Since SVDD only needs one class to train the learning model, it can be applied in the imbalance classification. Furthermore, some researchers incorporate the cost-sensitive strategy into neural networks [29], SVM [30], AdaBoost [31], trees [32] and use these modified classifiers to classify the imbalance dataset. In recent years, some new approaches have been proposed to incorporate cost-sensitive strategy into the deep CNN [33, 34, 35, 36]. However, these algorithmic level methods require special knowledge of both the corresponding classifier and the application domain, which may be a complicated task for researchers.

2.3 Ensemble learning

Ensemble learning, which embeds the data level method into bagging or boosting, has been proven to be an effective way to address the class imbalance problem [5, 6, 7]. Among these ensemble algorithms, the class imbalance problem is mainly overcome by the data level methods. According to the type of data level methods, ensemble methods can be divided into three categories: oversampling-, undersampling- and hybrid-based ensembles.

Oversampling-based ensembles make use of oversampling methods to deal with imbalance problem. Chawla et al. [39] proposed SMOTEBoost to learn the minority class by embedding SMOTE into boosting. Wang and Yao [40] introduced SMOTEBagging to explore the impact of diversity on imbalanced datasets. Like SMOTE, these two ensembles methods also generate noise during linear interpolation.

Undersampling-based ensembles make use of oversampling methods to deal with imbalance problem. Many related algorithms have been proposed by embedding random undersampling into Bagging or Boosting, e.g. UnderBagging [10], RUSBoost [5], EasyEnsemble and BalanceCascade [11]. Furthermore, Galar et al. [6] proposed EUSBoost algorithm, which builds a balanced dataset by using evolutionary undersampling method (EUS). Lin et al. [7] presented clustering-based ensemble algorithm (CBUBoost) to retain the useful information in imbalanced dataset. However, these ensembles methods have never considered noise in the dataset.

Hybrid-based ensembles make use of both oversampling and undersampling methods to deal with imbalance problem. Under a bagging scheme, Wang and Yao [38] proposed UnderOverBagging algorithm by combining random oversampling with random undersampling. Gong and Kim [20] introduced RHSBoost algorithm by using random undersampling and ROSE sampling [37] under a boosting scheme. Analogously, these two methods also ignore noise in the dataset.

3. The proposed method

In this section, we first describe our DBU algorithm in detail, and then we propose DBUBoost by embedding DBU into AdaBoost.M2.

3.1 Proposed DBU

Similar examples are relatively close to each other and noisy example is far from other examples belonging to the same class in feature space. Based on this assumption, we propose a density-based undersampling algorithm called DBU. Its purpose is to build a balanced dataset without noisy and redundant examples. To achieve this purpose, we calculate a similarity coefficient for each example, which the similarity coefficients of noisy and redundant examples are lower than other examples. To implement DBU, we first calculate a local density $\rho$ for each example. Then we introduce a similarity coefficient $\delta$ to measure the similarity between different examples. For convenience, we describe our DBU in majority class.

In this paper, we estimate the local density of example $i$ based on the distances to its $K$ nearest neighbors. To introduce our approach, we begin with the definition of $k$ -th distance.

Definition 1: ( $k$ -th distance, $\textit{dis}_{k}$ ) the $k$ -th distance of example $i$ , denoted as $\textit{dis}_{k}(i)$ , is defined as the HVDM distance (Heterogeneous Value Difference Metric) [12] to its $k$ -th nearest neighbor.

$\displaystyle\textit{dis}_{k}(i)=d_{ik}=\textit{HVDM}(i,k)$ (1)

To detect the density-based noise, we keep $K$ as the only variable and use the value $\textit{dis}_{k}(i)$ as a measure to determine the density of example $i$ .

Definition 2: (local density $\rho_{i}$ of example $i$ ) the local density of example $i$ is defined as:

$\displaystyle\rho_{i}=\frac{K}{\sum\limits_{k\in\textit{KNN}}{\textit{dis}_{k}% (i)}}$ (2)

Figure 1.

Our DBU in two dimensions. a. Data distribution; b. Decision graph $\rho$ -n, $K=$ 5; c. Decision graph $\delta$ -n, $K=$ 5.

Obviously, the local density of example $i$ is the inverse of the average distance based on its $K$ nearest neighbors. Note that the local density can be $\infty$ , if there are at least $K$ duplicates of example $i$ in the dataset. For simplicity, we assume that there are no duplicates in the dataset.

According to their KNN and local densities, we can divide the majority class examples into three categories. The three categories are defined as follows:

•

Useful examples, or called central examples (e.g. points 6 and 10 in Fig. 1a): Example $i$ is useful, if $\forall\rho_{i}>\rho_{k}$ , $k\in\textit{KNN}$ .

•

Noise (e.g. points 26–28 in Fig. 1a): Example $i$ is noise, if $\forall\rho_{i}<\rho_{k}$ , $k\in\textit{KNN}$ and $\textit{dis}_{1}(i)<\textit{dis}_{1}(i_{1})$ , example $i_{1}$ is the nearest neighbor of example $i$ .

•

Potentially useful examples: Examples other than noisy and useful examples are potentially useful examples. Potentially useful examples include examples far from the central example and examples close to the central example. The former is called relatively useful examples (e.g. points 13, 17, 25, etc.) and the latter is called redundant examples (e.g. points 2, 11, 22, etc.).

In our DBU, central examples are the preferred examples selected as the elements of re-sampled majority class, followed by relatively useful examples and redundant examples, and noise is the example to be removed. To highlight noisy and redundant examples, we propose a new approach to measure the similarity, called similarity coefficient ( $\delta$ ).

For a central example, the similarity coefficient $\delta$ is measured by calculating its minimum distance to other examples with higher densities, so that the computed distance is much larger than the typical nearest neighbor distance [15].

$\displaystyle\delta_{i}=\mathop{\min}\limits_{j:\rho_{j}>\rho_{i}}(d_{ij})$ (3)

Specially, for the example with highest density, the distance is calculated by $\delta_{i}=\max_{j}(d_{ij})$ .

For a noisy example, the similarity coefficient $\delta$ is set as 0, so that the noisy example can be removed conveniently.

$\displaystyle\delta_{i}=0$ (4)

For a potentially useful example, the similarity coefficient $\delta$ is measured by calculating its distance to the neighbor with the highest density, so that relatively useful example has a higher value $\delta$ than redundant example.

$\displaystyle\delta_{i}=(d_{il})_{l:\rho_{l}>\rho_{k}.l,k\in\textit{KNN}}$ (5)

As Fig. 1c shows, the central examples 6 and 10 have the highest values $\delta$ , while the noisy examples 26–28 have the lowest values $\delta=$ 0. For the potentially useful examples, relatively useful examples (13, 17, 25, etc.) have higher values than redundant examples (2, 11, 22, etc.). We select the top $N_{-}$ examples in descending order of $\delta$ to represent the majority class ( $N_{-}$ is the number of minority class examples), so that the noisy and redundant examples are removed automatically. The pseudo-code for our DBU is as follows.

The Pseudo-code for DBU Algorithm
Step 1. Input the majority class: $M_{a}=\{X\subseteq D\|y_{i}\in c_{1}\}$ and the minority class: $M_{i}=\{X\subseteq D\|y_{i}\in c_{2}\}$
Step 2. Get the number of majority class examples ( $N_{+}$ ) and the number of minority class examples ( $N_{-}$ )
Step 3. Do for $i=1,2,\ldots,N_{+}$
a. Calculate $k$ -th distance: $\textit{dis}_{k}(i)=\textit{HVDM}(i,k)$
b. Calculate the local density: $\rho_{i}=K/\sum\limits_{k\in\textit{KNN}}{\textit{dis}_{k}(i)}$
c. Calculate the similarity coefficient: $\displaystyle\delta_{i}=\left\{{{\begin{array}[]{ll}{\mathop{\min}\limits_{j:% \rho_{j}>\rho_{i}}(d_{ij})}&{if\,\forall\rho_{i}>\rho_{k},k\in\textit{KNN}}\\ 0&{if\,\forall\rho_{i}<\rho_{k}\textit{ and }\textit{dis}_{1}(i)>\textit{dis}_% {1}(i_{1}),k\in\textit{KNN}}\\ {(d_{il})_{l:\rho_{l}>\rho_{k}.l,k\in\textit{KNN}}}&\textit{otherwise}\\ \end{array}}}\right.$
Step 4. $S{\_}M_{a}\leftarrow$ removes the noisy examples and selects the top $N_{-}$ examples in descending order of $\delta$
Step 5. Output the balanced dataset: $S^{\prime}=S{\_}M_{a}\cup M_{i}$

Note that the value of $K$ in our DBU determines whether an example is noise. As Fig. 1a shows, when $K=$ 5, point 26 is a noisy example. However, when $K=$ 27, point 26 is not a noisy example. The best value of $K$ is selected based on the experimental results, which will be discussed in Section 4.1.

3.2 Embedding DBU into AdaBoost

Boosting is an effective technique that can improve the performance of any weak classifier [5]. In boosting, the classifiers are trained iteratively, with the weights of the training examples modified according to the performance of the previous classifiers. The main idea is that the classifier should pay more attention to those examples that are difficult to learn. For imbalance classification, since the minority class examples are usually more likely to be misclassified, some researchers introduce boosting into the field of imbalance classification to improve the accuracy of minority class.

Figure 2.

The framework of our DBUBoost.

The Pseudo-code for DBUBoost Algorithm
Step 1. Initialize $W_{1}(i)=1/N$
Step 2. Do for $t=1,2,\ldots,T$
a. Get a balanced training dataset $S^{\prime}_{t}$ with weight distribution $W^{\prime}_{t}$ using our DBU
b. Train WeakLearner using C4.5, providing it with dataset $S^{\prime}_{t}$ and its weight distribution $W^{\prime}_{t}$
c. Get a weak hypothesis $h_{t}:X\times Y\to$ [0, 1]
d. Calculate the pseudo-loss based on $D$ and $W_{t}:\varepsilon_{t}=\sum\limits_{(i,y):y_{i}\neq y}{W_{t}(i)(1-h_{t}(x_{i},y% _{i})+h_{t}(x_{i},y))}$
e. Calculate the weight update parameter: $\alpha_{t}=\varepsilon_{t}/(1-\varepsilon_{t})$
f. Update $W_{t}:W_{t+1}(i)=W_{t}(i)\alpha_{t}^{(1/2)(1+h_{t}(x_{i},y_{i})-h_{t}(x_{i},y:% y\neq y_{i}))}$
g. Normalize $W_{t+1}:W_{t+1}(i)=W_{t+1}(i)/\sum\limits_{i}{W_{t+1}(i)}$
Step 3. Output the final hypothesis: $H(x)=\mathop{\arg\max}\limits_{y\in Y}\sum\limits_{t=1}^{T}{h_{t}(x,y)\log(1/% \alpha_{t})}$

Based on RUSBoost, we embed our DBU into AdaBoost.M2 to improve the performance for minority class (DBUBoost). Some researchers believe that the diversity of base classifiers is helpful to improve the performance of ensemble methods [6]. Without affecting the accuracy of base classifiers, our DBUBoost modifies the step 4 in our DBU to improve the diversity of base classifiers by selecting $N_{-}$ examples according to the specified probabilities $\delta$ . In other words, each example $i$ in majority class has a possibility of $\delta_{i}$ being selected as an element of the re-sampled majority class. For the noisy example, since its $\delta$ value is 0, it is impossible for DBUBoost to select it. Analogously, we can also use DBU to remove the noisy examples in minority class by just removing the examples with $\delta=$ 0. Figure 2 shows the framework of our DBUBoost.

The details of our DBUBoost algorithm are as follows. Let $x_{i}$ be the feature vector of an example in dataset $D$ and $y_{i}$ be the class label of it. Let $T$ be the number of iterations and $h_{t}$ be the weak hypothesis trained on $t$ -th iteration. Let $W_{t}(i)$ be the weight of example $i$ on $t$ -th iteration.

In step 1, the weight of each example in the training dataset $D$ is initialized to 1/ $N$ . In step 2, $T$ weak hypotheses are trained. In step 2a, DBU is used to obtain a new balanced training dataset $S^{\prime}_{t}$ with a weight distribution $W^{\prime}_{t}$ . In step 2b, $S^{\prime}_{t}$ and $W^{\prime}_{t}$ are used to train the base classifier, WeakLearn, which returns a weak hypothesis $h_{t}$ (step 2c). In step 2d–2e, the pseudo-loss $\varepsilon_{t}$ and the weight update parameter $\alpha_{t}$ are calculated based on the original training dataset $D$ and its weight distribution $W_{t}$ . In step 2f–2g, the weight distribution of the next iteration ( $W_{t+1}$ ) is updated and normalized successively. In step 3, a final hypothesis $H(x)$ is obtained by combining the previous $T$ weak hypotheses.

4. Experiments

4.1 Experimental setup

In this paper, we select 9 two-class imbalanced datasets from the KEEL dataset repository (http:// www.keel.es/) to conduct the experiments. Table 1 shows the characteristics of these datasets, where Num is the number of examples, Att is the number of attributes and $I R$ is the imbalance ratio. Nevertheless, the 9 datasets do not contain noisy examples. To evaluate our DBUBoost on noisy datasets, we need to construct the corresponding noisy datasets based on these 9 datasets. The details of the construction process are described in Section 4.3. Furthermore, we did not consider the existence of missing values in the datasets, as HVDM [12] in our algorithm can handle this situation automatically.

Table 1
Datasets for experiment

Datasets	Num	Att	$I R$		$K_{ma}/K_{mi}$
pimaimb	768	8	1	.87	20/20
glass016vs2	184	9	10	.29	30/10
glass2	214	9	11	.59	28/10
ecoli4	336	7	15	.8	15/5
car-good	1728	6	24	.04	20/10
yeast1289vs7	947	8	30	.57	30/10
wine-red3vs5	691	11	68	.1	12/5
poker89vs5	2075	10	82		15/15
abalone19	4177	8	129	.44	40/15

We select four ensemble methods RUSBoost [5], EUSBoost [6], CBUBoost [7], RHSBoost [20] and three anti-noise methods DBSCAN-KNN [13], EE-KF [17], SMOTE-IPF [14] as the comparing methods. The experiments are organized in two sections. In the first section, the experiments are conducted on 9 two-class datasets to test the performance of our DBUBoost. In the second section, the experiments are conducted on their noise-modified datasets to test the noise immunity of our DBUBoost. All the experiments are conducted with C4.5 as the base classifier, which has been widely used in imbalance classification. Note that the parameters in C4.5 are set according to the default parameters used in the Weka software package. In our DBUBoost, the parameter $K$ is chosen based on its performance. Table 1 shows the best $K$ values for different datasets, where $K_{ma}$ and $K_{mi}$ represent the $K$ values of majority and minority class respectively. In DBSCAN-KNN: $p=$ 0.8, $\textit{Minpts}=$ 12, and KNN parameter $k=$ 9. The parameter $K$ in EE-KF is the same as $K_{mi}$ in Table 1. Since experiments with more iterations did not produce significantly superior results, ten iterations were used in all ensemble algorithms. Furthermore, 5-fold cross-validation strategy is used to evaluate the performance of the proposed DBUBoost.

The average accuracy is the most popular evaluation criterion to evaluate the classification performance of classifiers. Since this evaluation criterion usually results in misclassification of the minority class examples, it has been proved unsuitable for imbalance classification. Consequently, other evaluation criteria have been proposed successively, including F-measure, ROC and AUC. In this paper, all experiments use AUC as the evaluation criterion [16].

4.2 Experiments on KEEL datasets

In the first experiment, we compare our DBUBoost with four ensemble algorithms: RUSBoost, EUSBoost, CBUBoost, RHSBoost and three anti-noise algorithms: DBSCAN-KNN, EE-KF and SMOTE-IPF. Table 2 shows their AUC results, where Avg. are their average AUC results on the 9 datasets. Table 3 shows the Wilcoxon’s rank-sum test results for the comparison between DBUBoost and the rest of algorithms, where $R^{+}$ and $R^{-}$ are the sums of ranks and $P_{\textit{Wilcoxon}}$ is the $p$ -value of Wilcoxon rank-sum test result. If $P_{\textit{Wilcoxon}}<$ 0.05, it means the comparison is significantly different. From Table 2, we can observe that our DBUBoost achieves the highest average AUC results. Regarding the Wilcoxon’s rank-sum test results in Table 3, we can conclude that our DBUBoost achieves significantly better results than other algorithms.

Table 2
AUC results on KEEL datasets (%)

Datasets	RUSBoost	EUSBoost	CBUBoost	RHSBoost	DBSCAN-KNN	EE-KF	SMOTE-IPF	DBUBoost
pima	73.95	73.13	75.34	72.21	73.57	74.81	73.87	76.56
glass016vs2	67.93	78.5	79.05	77.54	70.95	71.19	66.57	76.17
glass2	69.54	73.12	63.56	72.32	74.04	73.74	68.54	82.49
ecoli4	90.68	88.87	91.05	90.05	89.77	81.74	89.77	93.1
car-good	93.01	95.37	88.87	94.51	95.76	93.59	90.7	93.34
yeast1289vs7	69.07	70.15	69.24	70.2	69.44	68.51	67.83	72.56
wine-red3vs5	73.69	74.76	74.44	76.56	75.83	75.31	74.82	79.57
poker89vs5	69.12	63.12	66.54	68.43	67.38	68.52	70.11	69.54
abalone19	66.29	68.53	72.8	68.75	69.16	66.85	61.54	73.33
Avg.	74.81	76.17	75.65	76.73	76.21	74.92	73.75	79.63

Table 3

Wilcoxon rank-sum test results for the comparison between our DBUBoost and other methods

Methods	$R^{+}$	$R^{-}$	$P_{\textit{Wilcoxon}}$
DBUBoost vs. RUSBoost	45	0	0.0039
DBUBoost vs. EUSBoost	42	3	0.0195
DBUBoost vs. CBUBoost	41	4	0.0273
DBUBoost vs. RHSBoost	40	5	0.0391
DBUBoost vs. DBSCAN-KNN	43	2	0.0117
DBUBoost vs. EE-KF	44	1	0.0078
DBUBoost vs. SMOTE-IPF	44	1	0.0078

4.3 Experiments on noise-modified KEEL datasets

In the second experiment, we corrupt the class labels and attribute values of some examples in the 9 KEEL datasets by introducing class noise and attribute noise separately, and then conduct experiments on these noise-modified datasets to verify the noise immunity of these algorithms. Given a noise level $x$ , we introduce noise according to the following schemes [10]:

Table 4
AUC results on noise-modified KEEL datasets (%)

Datasets	RUSBoost	EUSBoost	CBUBoost	RHSBoost	DBSCAN-KNN	EE-KF	SMOTE-IPF	DBUBoost
Class noise $x=$ 20
pima	71.83	69.87	69.55	70.54	72.54	68.51	68.47	74.58
glass016vs2	63.38	64.14	61.86	63.88	66.75	63.74	62.55	70.15
glass2	73.04	72.15	71.9	72.02	75.49	69.82	67.87	77.32
ecoli4	82.75	79.26	80.59	83.1	87.39	81.75	82.62	85.56
car-good	93.1	89.81	57.56	89.71	88.75	90.62	89.01	92.75
yeast1289vs7	64.37	66.29	62.64	67.25	63.44	63.17	62.08	69.74
wine-red3vs5	70.67	71.37	74.72	72.36	73.87	75.78	71.24	76.38
poker89vs5	60.44	58.17	62.56	62.45	61.54	61.83	59.16	64.12
abalone19	61.77	62.07	66.35	63.8	67.89	60.92	58.71	69.8
Avg.	71.26	70.35	67.53	71.68	73.07	70.68	69.08	75.6
Class noise $x=$ 40
pima	66.68	67.7	65.75	67.8	70.13	67.35	64.73	72.87
glass016vs2	61.05	63.6	56.05	60.01	59.01	58.18	60.02	66.14
glass2	58.29	57.68	56.53	57.72	55.21	60.2	56.82	62.47
ecoli4	75.35	76.24	77.38	76.2	79.01	78.43	75.61	78.76
car-good	89.57	88.56	50.0	87.92	87.0	89.35	87.3	90.01
yeast1289vs7	58.06	60.54	58.64	61.57	59.49	61.53	58.21	65.69
wine-red3vs5	65.72	67.62	68.33	69.63	63.04	68.75	64.82	71.1
poker89vs5	57.98	56.42	60.15	58.4	55.73	60.25	55.07	60.31
abalone19	59.88	62.35	65.52	60.73	66.15	65.84	57.17	64.17
Avg.	65.84	66.75	62.04	66.66	66.09	67.76	64.42	70.17
Attribute noise $x=$ 20
pima	71.53	72.51	71.2	72.04	72.87	70.79	69.5	73.59
glass016vs2	64.24	65.38	56.79	64.89	60.4	65.8	66.81	68.54
glass2	65.19	67.6	66.55	64.3	66.6	68.13	66.71	66.47
ecoli4	88.38	87.02	87.59	88.85	88.34	87.32	86.4	89.03
car-good	91.26	85.71	69.02	88.64	89.13	88.71	84.15	91.84
yeast1289vs7	59.3	60.43	56.80	61.37	70.95	62.63	58.92	69.03
wine-red3vs5	70.02	71.41	70.25	68.79	73.25	72.81	69.52	75.65
poker89vs5	58.61	56.43	60.24	59.03	59.61	59.8	56.14	62.54
abalone19	64.33	65.91	59.44	66.9	63.7	66.51	58.26	67.5
Avg.	70.32	70.27	66.43	70.53	71.65	71.39	68.49	73.8
Attribute noise $x=$ 40
pima	69.99	70.78	65.0	68.09	68.48	69.53	68.09	72.29
glass016vs2	60.12	63.66	53.57	63.29	57.61	63.83	61.74	65.47
glass2	63.79	62.42	52.75	61.8	62.69	64.32	58.77	64.17
ecoli4	84.21	85.33	84.58	86.51	88.4	86.4	82.01	87.84
car-good	89.42	84.61	50.0	88.07	89.78	88.15	81.43	90.7
yeast1289vs7	56.81	57.11	54.69	58.34	65.83	58.51	54.89	65.77
wine-red3vs5	63.11	62.54	60.94	64.9	61.93	62.72	64.27	65.17
poker89vs5	54.29	51.87	54.22	55.52	50.34	52.87	54.27	56.71
abalone19	62.75	61.74	52.09	62.29	62.54	59.25	53.14	64.41
Avg.	67.17	66.67	58.65	67.65	67.51	67.29	64.29	70.28

Table 5

Wilcoxon rank-sum test results on class noise-modified datasets

Methods	$R^{+}$	$R^{-}$	$P_{\textit{Wilcoxon}}$	$R^{+}$	$R^{-}$	$P_{\textit{Wilcoxon}}$
	Class noise $x=$ 20			Class noise $x=$ 40
DBUBoost vs. RUSBoost	44	1	0.0078	45	0	0.0039
DBUBoost vs. EUSBoost	45	0	0.0039	45	0	0.0039
DBUBoost vs. CBUBoost	45	0	0.0039	43	2	0.0117
DBUBoost vs. RHSBoost	45	0	0.0039	45	0	0.0039
DBUBoost vs. DBSCAN-KNN	43	2	0.0117	42	3	0.0195
DBUBoost vs. EE-KF	45	0	0.0039	41	4	0.0273
DBUBoost vs. SMOTE-IPF	45	0	0.0039	45	0	0.0039

Table 6

Wilcoxon rank-sum test results on attribute noise-modified datasets

Methods	$R^{+}$	$R^{-}$	$P_{\textit{Wilcoxon}}$	$R^{+}$	$R^{-}$	$P_{\textit{Wilcoxon}}$
	Attribute noise $x=$ 20			Attribute noise $x=$ 40
DBUBoost vs. RUSBoost	45	0	0.0039	45	0	0.0039
DBUBoost vs. EUSBoost	43	2	0.0117	45	0	0.0039
DBUBoost vs. CBUBoost	44	1	0.0078	45	0	0.0039
DBUBoost vs. RHSBoost	45	0	0.0039	45	0	0.0039
DBUBoost vs. DBSCAN-KNN	40	5	0.0391	42	3	0.0195
DBUBoost vs. EE-KF	43	2	0.0117	44	1	0.0078
DBUBoost vs. SMOTE-IPF	44	1	0.0078	45	0	0.0039

For the class noise, we randomly choose ( $x\times N_{-}$ )/100 pairs of examples from the majority and minority class respectively, then incorrectly label them.

For the attribute noise, we randomly choose [ $x\times(N_{-}+N_{+})$ ]/100 examples from the dataset and then corrupt each attribute $A_{i}$ with a random value between the minimum and maximum of domain (if $A_{i}$ is numerical), or a random value of the domain (if $A_{i}$ is nominal).

In this study, we introduce two different levels of noise ( $x=$ 20, 40) into the datasets to evaluate our algorithm. Table 4 shows the AUC results on these noise-modified datasets and Tables 5 and 6 show the Wilcoxon’s rank-sum test results. From Table 4, we can observe that the average AUC results of these methods decrease with the increase of noise level. Our DBU achieves the best average AUC results no matter which type (class and attribute noise) or what level ( $x=$ 20 and $x=$ 40) the noise is. From Tables 5 and 6, we can observe that the $p$ -values for all these scenarios are lower than 0.05, which means that there are significant differences in these comparisons.

5. Conclusions

Imbalance classification is an important task in many domains. Many algorithms have been proposed to solve the imbalance problem. However, the existing algorithms always lose useful examples or ignore the noisy examples in the dataset. In this paper, a new anti-noise ensemble algorithm is proposed to retain the useful examples and remove the noisy examples. The experimental results on 9 KEEL datasets show that our DBUBoost achieves the best performance. This fact is because our DBUBoost can reflect the real distribution of data, which means that the re-sampled datasets retain the useful information of the original datasets. Furthermore, the experimental results on noise-modified datasets show that our DBUBoost is the most stable algorithm. It results from that our DBUBoost can remove the noisy examples in both majority and minority class. Thus, our proposed algorithm can be widely used for the classification on imbalanced datasets, especially on imbalanced datasets containing noise.

Footnotes

Acknowledgments

This work was supported by National Natural Science Foundation of China [Grant 51275431] and the Sichuan Province Science and Technology Support Program Project [Grant 2016GZ0194, Grant 2018GZ0361].

References

Zhou

, Performance of corporate bankruptcy prediction models on imbalanced dataset: the effect of sampling methods, Knowledge-Based Systems 41 (2013), 16–25.

Krawczyk

Galar

Jeleń

and Herrera

, Evolutionary undersampling boosting for imbalanced classification of breast cancer malignancy, Applied Soft Computing 38 (2016), 714–726.

Bermejo

Gámez

J.A.

and Puerta

J.M.

, Improving the performance of naive bayes multinomial in e-mail foldering by introducing distribution-based balance of datasets, Expert Systems with Applications 38 (2011), 2072–2080.

Barandela

Sánchez

J.S.

GarcíA

and Rangel

, Strategies for learning in class imbalance problems, Pattern Recognition 36 (2003), 849–851.

Seiffert

Khoshgoftaar

T.M.

Hulse

J.V.

and Napolitano

, Rusboost: a hybrid approach to alleviating class imbalance, IEEE Transactions on Systems Man and Cybernetics Part A Systems and Humans 40 (2010), 185–197.

Galar

Fernández

Barrenechea

and Herrera

, Eusboost: enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling, Pattern Recognition 46 (2013), 3460–3471.

Lin

W.C.

Tsai

C.F.

Y.H.

and Jhang

J.S.

, Clustering-based undersampling in class-imbalanced data, Information Sciences 409-410 (2017), 17–26.

Drummond

and Holte

R.C.

, C4.5, class imbalance, and cost sensitivity: why under-sampling beats oversampling, in: Proc. of the Icml Workshop on Learning from Imbalanced Datasets II, Canada, 2003.

Hartigan

J.A.

and Wong

M.A.

, A k-means clustering algorithm, Applied Statistics 28 (1979), 100–108.

10.

Barandela

Valdovinos

R.M.

and Sánchez

J.S.

, New applications of ensembles of classifiers, Pattern Analysis & Applications 6 (2003), 245–256.

11.

Liu

X.Y.

and Zhou

Z.H.

, Exploratory undersampling for class-Imbalance learning, IEEE Trans Syst Man Cybern B Cybern 39 (2009), 539–550.

12.

Wilson

D.R.

and Martinez

T.R.

, Improved heterogeneous distance functions, Journal of Artificial Intelligence Research 6 (1997), 1–34.

13.

Jing

Gou

and Zhu

, An improved density-based method for reducing training data in KNN, in: Fifth International Conference on Computational and Information Sciences, IEEE, 2013.

14.

Sáez

J.A.

Luengo

Stefanowski

and Herrera

, SMOTE-IPF: addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering, Information Sciences 291 (2015), 184–203.

15.

Rodriguez

and Laio

, Clustering by fast search and find of density peaks, Science 344 (2014), 1492–1496.

16.

Bradley

A.P.

, The use of the area under the ROC curve in the evaluation of machine learning algorithms, Pattern Recognition 30 (1997), 1145–1159.

17.

Kang

Chen

X.S.

S.S.

and Zhou

M.C.

, A noise-filtered under-sampling scheme for imbalance classification, IEEE Transactions on Cybernetics 4 (2017), 4263–4274.

18.

Zhu

X.Y.

Niu

B.B.

Whitehead

E.J.

and Sun

Z.B.

, An empirical study of software change classification with imbalance data-handling methods, Statistical Analysis and Data Mining (2018), 1–32.

19.

Bak

B.A.

and Jensen

J.L.

, High dimensional classifiers in the imbalanced case, Computational Statistics and Data Analysis 98 (2016), 46–59.

20.

Gong

and Kim

, RHSBoost: Improving classification performance in imbalance data, Computational Statistics and Data Analysis 111 (2017), 1–13.

21.

Menardi

and Torelli

, Training and assessing classification rules with imbalanced data, Data Min. Knowl. Discov 28 (2014), 92–122.

22.

Chawla

N.V.

Bowyer

Hall

and Kegelmeyer

W.P.

, SMOTE: synthetic minority over-sampling technique, J. Artif. Intell. Res 16 (2002), 321–357.

23.

Bunkhumpornpat

Sinapiromsaran

and Lursinsap

, Safe-level-smote: safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem, in: PAKDD 2009, pringer, Heidelberg, 2009.

24.

Ramentol

Caballero

Bello

and Herrera

, Smote-rsb*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using smote and rough sets theory, Knowledge and Information Systems 33(2) (2012), 245–265.

25.

Han

Wang

W.Y.

and Mao

B.H.

, Borderline-SMOTE: a new oversampling method in imbalanced data sets learning, in: in Proc. AIC, 2005, pp. 878–887.

26.

Maciejewski

and Stefanowski

, Local neighborhood extension of smote for mining imbalanced data, in: 2011 IEEE Symposium on Computational Intelligence and Data Mining, IEEE, 2011, pp. 104–111.

27.

Batista

G.E.

Prati

R.C.

and Monard

M.C.

, A study of the behavior of several methods for balancing machine learning training data, ACM Sigkdd Explorations Newsletter 6(1) (2004), 20–29.

28.

Tax

D.M.

and Duin

R.P.

, Support vector domain description, Pattern Recognition Letters 20(11–13) (1999), 1191–1199.

29.

Kukar

and Kononenko

, Cost-sensitive learning with neural networks, in: ECAI, 1998, pp. 445–449.

30.

Wang

B.X.

and Japkowicz

, Boosting support vector machines for imbalanced data sets, Knowledge & Information Systems 25(1) (2010), 1–20.

31.

Lomax

and Vadera

, A cost-sensitive decision tree learning algorithm based on a multi-armed bandit framework, Computer Journal 60 (2016), 941–956.

32.

Ting

K.M.

, An instance-weighting method to induce cost-sensitive tree, IEEE Transactions on Knowledge and Data Engineering 14 (2002), 659–665.

33.

Wang

Liu

Cao

Meng

and Kennedy

P.J.

, Training deep neural networks on imbalanced data sets, in: Neural Networks (IJCNN), 2016 International Joint Conference on, IEEE, 2016, pp. 4368–4374.

34.

Raj

Magg

and Wermter

, Towards effective classification of imbalanced data with convolutional neural networks, in: IAPR Workshop on Artificial Neural Networks in Pattern Recognition, Springer, 2016, pp. 150–162.

35.

Chung

Y.A.

Lin

H.T.

and Yang

S.W.

, Cost-aware pre-training for multiclass cost-sensitive deep learning, arXiv preprint arXiv:1511.09337, 2015.

36.

Khan

S.H.

Hayat

Bennamoun

Sohel

and Togneri

, Cost-sensitive learning of deep feature representations from imbalanced data, IEEE Transactions on Neural Networks and Learning Systems 29(8) (2015), 3573–3587.

37.

Lunardon

Menardi

and Torelli

, ROSE: a package for binary imbalanced learning, R J 6(1) (2014), 82–92.

38.

Wang

and Yao

, Diversity analysis on imbalanced data sets by using ensemble models, in: Proceedings of the IEEE Symposium on Computational Intelligence and Data Mining, 2009, pp. 324–331.

39.

Chawla

Lazarevic

Hall

and Bowyer

, SMOTEboost: improving prediction of the minority class in boosting, in: Proceeding of the 7th European Conference on Principles and Practice of Knowledge Discovery in Databases, 2003, pp. 107–119.

40.

Wang

and Yao

, Diversity analysis on imbalanced data sets by using ensemble models, in: IEEE Symposium on Computational Intelligence & Data Mining, IEEE, 2009.

An anti-noise ensemble algorithm for imbalance classification

Abstract

Keywords

1. Introduction

2. Related works

2.1 Data level

2.2 Algorithmic level

2.3 Ensemble learning

3. The proposed method

3.1 Proposed DBU

4.1 Experimental setup

Table 1 Datasets for experiment

Table 2 AUC results on KEEL datasets (%)

Table 4 AUC results on noise-modified KEEL datasets (%)

Footnotes

Acknowledgments

References

Table 1
Datasets for experiment

Table 2
AUC results on KEEL datasets (%)

Table 4
AUC results on noise-modified KEEL datasets (%)