Critical Instances Removal based Under-Sampling (CIRUS): A solution for class imbalance problem 1

Abstract

The most critical issue in real world applications are class imbalance problems. Imbalanced data sets are common across different domain including banking, health care, finance and other. When such data sets are trained on typical classification algorithm they tends to be biased towards the majority class. The learning task becomes more challenging when there is also an overlap of instances from different classes. In this paper, we propose an undersampling framework for binary classification datasets by removing overlapped data points called Critical Instances Removal based Under-Sampling (CIRUS). Our method is designed to identify and eliminate majority class instances from the overlapping region. Accurate identification and elimination of these instances maximise the visibility of the minority class instances and at the same time minimises excessive elimination of data, which reduces loss of information. Extensive experiments using simulated and real-world datasets were carried out and the results show comparable performance with state-of-the-art methods across different common metrics with exceptional and statistically significant improvements in sensitivity.

Keywords

Imbalanced dataset undersampling k-NN class overlap classification

1. Introduction

In machine learning, the classification algorithms learn from previously known information for predicting the unknown events. However, most of the datasets from real world domain contains noisy instances [24]. Typical examples include finance, fraud detection, medical diagnosis, customer churn prediction and many more [10]. Training on these samples degrades the classification performance dramatically. It shows bias towards the over-represented class samples called majority class and ignores under-represented class samples called minority class. Moreover, imbalance occurs in binary classification problem and most of the time the minority samples are of great importance. This problem has been addressed by the machine learning research community over the past decades. The proposed solutions are broadly classified into data-level and algorithm-level techniques [21, 38, 2, 15].

Data level techniques consist of sampling methods to adjust the class distribution while algorithm-level techniques involve modification of existing or creation of new algorithm. Algorithm-level techniques need deep understanding of algorithms and are complicated to implement where as data-level techniques are simple and concentrate on resampling process which in turn can be applied to any classification algorithms. The most popular and commonly used resampling methods include random under-sampling, random oversampling and Synthetic Minority Oversampling TEchnique (SMOTE) [7]. Recently proposed resampling techniques include k-means clustering [12], density-based clustering [4, 6], and ensemble [40]. These techniques are meant to balance the data distribution before classification. However, a number of authors in the past argued that the performance of the classifier was not affected only by unequal class distribution but due to many other reasons such as class overlapping, small disjunct and small sample size.

Figure 1.

Overlapping data regions.

Figure 2.

After removal of overlapping data.

Consider Fig. 1, shows the class distribution of two datasets, the data is overlapped between the classes and may be difficult for the classification algorithm to train on such data. In real-world application, datasets usually not only found imbalance but also overlapped. Therefore removal of majority samples from the overlapped region as shown in Fig. 2 is a rational approach to improve the classification performance. In this work, we propose a nearest neighbor based undersampling approach for finding and eliminating the negative/majority samples from overlapped region. By using this approach, we assume that most of the majority samples from the overlapped region are eliminated from the dataset. The two fold advantages of our approach is firstly the visibility of positive/minority samples increases in the complete dataset and next, more specific overlapped majority instances are identified using k-nearest neighbor which avoid unnecessary information loss. The main contribution of this paper is to propose a framework for handling overlapping data in the decision boundary of a skewed data distribution. An extensive experiments were carried out on highly imbalanced and overlapped datasets.

1.1 Imbalanced data classification problem: An overview

A dataset is said to skewed distributed when the number of samples of one class are larger in number than the ones from other classes. Moreover, the class with smaller number of samples is usually the class of interest from the learning point of view [8]. In many real world applications, this problem is of great interest, such as telecommunication customers churn [13], oil spills detection in satellite radar images [26], fraudulent telephone call detection [14], and specifically in medical diagnosis [30, 16].

Traditional classifiers when trained on such datasets have a bias towards the classes with larger number of instances (i.e, majority class). In turn, the minority class are usually ignored by considering them as noise. In this way, minority class samples are most often misclassified even though they are important in classification. The learning task does not hinder only by skewed data distribution but there are series of issue related to this problem like small size samples, overlapping between classes and small disjuncts. In Fig. 3, we illustrate examples of the three kinds of imbalance class distribution.

Figure 3.

Example of class distribution for two-class imbalance problems [38].

Small size sample: It refer to the problem where not all the classes for a given dataset are represented equally. The high imbalanced ratio may lead to poor classification, resulting in complete uncountable for the said class [10].

Class overlapping: In the presence of overlap between the classes, the classifier tend to wrongly classify the minority instances [35]. Hence, combination of overlapping between the classes with high imbalance ratio generally results in high misclassification rate for the minority class samples.

Small disjuncts: The presence of small disjuncts in a data-set occurs when the classes are constituted of smaller sub-concepts. The existence of small disjuncts also increases the complexity of the problem because of small fraction of the data instances, usually not balanced.

The rest of this paper is organized as follows: In Section 2 we review the related work. Section 3 discuss the various challenges in handling skewed data distribution Section 4 presents the different evaluation metric for class imbalance problem domain. The proposed method is discussed in detail in Section 5. Section 6 discusses the experimental setup and results. Finally Section 7 presents the conclusion and discusses future scope.

2. Literature survey

The most popular and common approach for balancing the skewed data is by using data level techniques. The data level solutions practice re-sampling method by either oversampling the minority class instances or under-sampling the majority class instances. At the algorithm level, a new algorithm or modification of existing algorithms are proposed to handle class imbalance problem. However, in data-level techniques, a learning algorithm cannot be changed once implemented. Ensemble based methods combines data-level techniques with algorithm level methods in solving the imbalanced datasets. As the scope of this paper focused on data-level techniques and for detail review on algorithm and ensemble technique, readers are suggested to refer the following papers [29, 34, 19, 20, 36, 18]. The class imbalance problem has attracted the research community and various data-level solutions have been proposed in literature. However, if the imbalance dataset is linearly separable or sufficiently high, does not affect the results in spite of degree of imbalance. Recently few research studies showed that class overlap had a higher impact on classifier performance than skewed data distribution.

Thus, we broadly discuss the existing solutions for balancing the class distribution and class overlapping methods. The most popular and widely used data level technique is random resampling approach. It is based on undersampling the majority class instances or oversampling the minority class instances. However the main drawback of this two methods are undersampling may lead to loss of important information while oversampling may lead to overfitting. To substitute radom sampling methods, a new technique called SMOTE was introduced. This technique synthesis the minority samples based on linear interpolation using nearest neighbor concept. Various well-known extensions have been proposed such as Borderline-SMOTE [22] and SMOTE-IPF [39, 37], Safe-level-SMOTE [5] and DBSMOTE [6]. Other recent methods based on clustering [41, 33] and deep neural networks [25] have also been proposed.

As this paper deals with data-level techniques, a brief introduction to various data-level techniques are described as follow. In data-level approach, the sample dataset is modified to balance the class distribution. The foremost aim is to maintain equality in the class distribution for the datasets using sampling methods such as over-sampling, under-sampling and combination of both. The oversampling and under-sampling techniques are the two popular techniques in sampling-based classification to address the imbalanced datasets. In the oversampling technique, some samples are added to the minority class to make it balanced when very less information is available for minority class samples. In the under-sampling technique, some samples of the majority class are eliminated to make the dataset balanced. Apart from above, the hybrid techniques usually come with a combination of both over and under-sampling methods. Figure 4 presents different approaches applied at data-level to address the class imbalance problem.

Figure 4.

Different data-level techniques proposed for handling class imbalance problem [38].

These methods balanced the class distribution based on the original data. But, a common drawback is it effects by degree of imbalance. If the class imbalance is high, a drastic loss of information may encounter and if imbalance is low, overfitting of samples may be generated. Since this paper concentrates on class overlapping regions, we reviewed literature work related to the same. Class overlapping deals with the samples near the borderlines and can be extend far from the class boundaries. Few existing literature shows the solution in addressing the class overlapping problem.

In [42], the author proposed oversampling based undersampling technique, based on negative instance removal from the overlapping region. They stated that the proposed method provide significant improvements over the state-of-the-art class distribution methods. In [4] the author proposed DBMUTE based on density-based clustering methods to identify and remove the majority instances from the overlapping boundaries. Another well established method ADAptive SYNthetic sampling approach (ADASYN) [23], generates more minority samples surrounded by majority instances as its neighbours. Results showed a better sensitivity compared with other state-of-the-methods. However, the visibility of minority class were not sure by this method because the majority instances may still be present in the overlapping areas. Another methods called Edited Nearest Neighbour (ENN) [9], proposed to focus on boundary instances. It adopts k nearest neighbor (k $=$ 3) to remove majority class samples that lie in other class boundary. The author stated that setting of value k has significant impacts on the performance. The extension of ENN, Neighbourhood CLeaning rule (NCL) [27] considered both majority and minority k-nearest neighbours for discarding the majority samples and the results show a better performance over ENN. Later, combination of data cleaning and resampling approach has been proposed [39] such as SMOTE-IPF. In which noisy instances are removed before new samples are generated for minority class.

In [22] the author proposed BorderLine-SMOTE (BLSMOTE), to over sample the minority samples near the borderline. The author stated that their method behave better in terms of F-Measure compared to existing methods. Redundancy-driven modified Tomek-link based undersampling [11] to detect outlier, redundant and noisy instances had least contribution in estimating accurate class labels. Evolutionary undersampling [17], Majority Weighted Minority Oversampling TEchnique (MWMOTE) [3] works by identifying the minority class instances at boundary regions and assign weights based on the distance from majority class samples. Then, forms a cluster of these minority samples for generating synthetic data. Adaptive Semi-Unsupervised Weighted Over-sampling (A-SUWO) [32] consider minority samples closer to the boundary region and mark them as hard-to-learn samples. Those samples are not involved in generating new samples. Hence in this section, we discuss different data level techniques proposed in literature for solving class imbalance problem. Now, next section will discuss about the different challenges in handling imbalance dataset/skewed data distribution.

3. Challenges in handling skewed data distribution

Napierala and Stefanowski [31] proposed different method to analyze the minority samples by assigning it to predefined categories such as safe, borderline, rare and outliers. Such methods help in understanding the difficulties present in the data. Hence, some challenges are included here as:

a)
As a future direction, it is important to propose new classification algorithms that incorporate the different difficulties in the data. Apart, while designing the classifier, attention should be needed towards individual minority samples. Another important issue is extreme class imbalance problems.

The extreme imbalanced data sets exist in most of the real-world problems such as fraud detection with Imbalance Ratio (IR) approximately 1:3000. This poses a great challenge for classification algorithms to train on such extreme datasets. Third challenge is inefficient feature extraction for some problems such as protein data, online transaction data. It is very much important for the classifier to be trained on such high-dimensional and sparse feature set.
b)
Another way to solve class imbalance problem is by modifying the learning algorithm. However, a major drawback of such learning models gives much importance to minority samples, thus increasing the majority class misclassification. A technique needs to be proposed to select only uncertain samples and adjust the output accordingly.
c)
Recently, ensemble learning became the most popular techniques for handling class imbalance. Algorithms like Bagging, Boosting, Stacking, and Random Forests were robust in handling data difficulties. Ensemble learning along with sampling techniques provides better performance to handle the skewed distribution of data. The main drawback is diversity among majority and minority class. There is no proper indication of how large the ensemble should be constructed as their size is selected arbitrarily.
d)
Another problem is handling of multi-class imbalanced classification. The multi-class imbalance occurs when more than two classes with one majority class and multiple minority class exist. A deeper insight in handling multi-class imbalanced problems is needed.
e)
Data pre-processing technique is highly important in balancing the imbalance datasets as there are independent of classifiers. The possible difficulties appear in the data are class overlapping, noise and small disjunct. Therefore, efficient data cleaning and sampling techniques are needed to balance the data. For multi class imbalance problems, efficient sampling techniques need to be proposed.
f)
Multi class imbalance learning need special care while applying sampling techniques. Researchers should focus on developing algorithms which are robust in handling such skew distributions.
g)
In ensemble learning algorithms such as bagging and boosting, there may be different level of uncertainty while sampling the data into bags. There may be a high probability of consisting samples from the same class within a single bag. The need for proper probability distribution techniques such as normal, binomial distribution can be used to check the balance distribution in each bag.

However, many difficulties may arise due to data distribution in each of the bags and also each bag may contain certain amount of noise which makes the classifier to perform poorly. So, efficient techniques need to be proposed in handling the size of the bags and the distribution of samples into each bags.
h)
Another important and yet popular challenge in class imbalance is learning from continuous data. The process of learning from the continuous data is called data streaming. The need for active learning algorithms to address data streaming issue is still at infant. The general open issue will be based on sampling the streaming data and classifying it.
i)
In last, extraction of efficient features and instance is also of major concern. Real time data such as bank data or genomic data are essentially have high-dimensional and sparse feature. The development of new approaches for high dimensional data is much needed, that will allow at the same time for an efficient processing and boosting discrimination of the minority class. Another interesting direction is to investigate the possibilities of using decomposition based solutions.

Hence in this section, we briefly discussed about the different challenges in handling imbalance dataset/skewed data distribution. Next section presents the evaluation metric used in evaluation of classifier when trained on imbalanced dataset.

4. Evaluation metrics in skewed data distribution domain

Most of the studies in skewed data distribution domain mainly concentrate on binary classification problem. By convention, positive class labels are considered as minority class and negative class labels as majority class. Table 1 illustrates a confusion matrix of a binary-class problem. TP and TN denote the number of positive and negative examples that are correctly classified, while FN and FP denote the number of misclassified positive and negative examples respectively.

Table 1
Confusion matrix

	Positive prediction	Negative prediction
Positive class	TP	FN
Negative class	FP	TN

Accuracy is a well-known performance metric used in classification. It is defined as the ratio between the correctly classified samples to the total number of samples Eq. (4). In the imbalanced datasets, accuracy shows bias towards majority class and lead to wrong decisions. Therefore, different performance metrics are need to assess the performance of the classifier when trained on imbalanced datasets. The suitable metrics used are precision, recall, Area Under Curve (AUC) to measure the performance of classifier when trained on imbalanced datasets.

Precision is the proportion of true positive to the total number of true positive and false positive as shown in Eq. (2). Recall/sensitivity/True Positive Rate (TPR) represents how well the model detects the true positive as shown in Eq. (3). The F-Score/F-measure combines both recall and precision and defined as Eq. (4). Therefore, F-measure is suitable in imbalanced scenarios than the accuracy metric.

$\displaystyle\textit{Accuracy}=\left(TP+TN\right)/$ (1) $\displaystyle\quad\left(TP+TN+FP+FN\right)$ $\displaystyle\textit{Precision}=\left(TP\right)/\left(TP+FP\right)$ (2) $\displaystyle\textit{Recall}=\left(TP\right)/\left(TP+FN\right)$ (3) $\displaystyle\textit{F-Score}=2\ast\left(\textit{Precision}\ast\textit{Recall}% \right)/$ (4) $\displaystyle\quad\left(\textit{Precision}+\textit{Recall}\right)$

In this paper, the various performance metrics used are AUC, F-Score and G-Mean.

5. Proposed method

This section describes the proposed method in detail. In Fig. 1 we showed how class overlap makes the classification algorithms difficult with skewed data distribution. It affects the performance of the classifier when trained on highly overlapped imbalanced datasets. So, to overcome that we propose a framework for removing majority samples from the boundary regions and provide maximum visibility for minority samples. By using the proposed approach, we assume that most of the majority samples from the overlapped region are removed from the dataset. The two fold advantages of our approach is firstly the visibility of positive/minority samples increases in the complete dataset and next, more specific overlapped majority instances are identified using k-nearest neighbor which avoid unnecessary information loss.

The main contribution of this paper is to propose a framework for handling overlapping data in the decision boundary of a skewed data distribution. This is implemented by removing the majority samples that are most nearer to that of the minority class samples. The nearest neighbors of minority samples are computed based on k-nearest neighbor algorithm. The value setting for K is vital in identifying the samples to be discarded. Here, we empirically set the value of k by considering the imbalance ratio with that of the size of the dataset. So Eq. (5) shows the computation for k.

$\displaystyle K=\sqrt{N}\times\sqrt{\textit{Imb Ratio}}$ (5)

Where, N is the number of samples in the dataset and ImbRatio is the imbalance ratio i.e., proportion of majority samples towards minority samples. Unlike existing methods, k values is defined based the real world datasets and its imbalance ratio rather manual assignment. In this paper, we proposed a boundary region based undersampling method as mentioned in Algorithm 1. The method vary with the existing algorithms in terms of identification and elimination of majority samples which are overlap with minority samples. As class imbalance problems is not a problem by itself but existence of overlap classes, sample disjunct and small sample size which in turn make the classifier complicated to perform better.

Algorithm 1: Critical Instances removal based Under-Sampling (CIRUS)
Data: Training set N, K
Result: Removal of overlapped majority samples
begin:
T: training data
T ${}_{\text{pos}}$ : positive or minority instances
T ${}_{\text{neg}}$ : negative or majority instances
For each instance in minority class computes its nearest neighbours based on K (as defined in Eq. (5)).
Remove the majority instances that are nearest to most of the minority samples (the samples with more than 2 neighbour) from the majority class.
After removal of overlapped samples combine T ${}_{\text{pos}}$ and T ${}_{\text{neg}}$ samples as final undersampled dataset D ${}^{*}$ .

Table 2

Datasets used

Dataset	Features	Sample size	Minority sample size	IR%
Wisconsin	9	683	239	1.86
Pima	8	768	268	1.87
Glass0	9	214	70	2.06
Vehicle1	18	846	217	2.9
Ecoli1	7	336	77	3.36
New-thyroid2	5	215	35	5.14
Segmemt0	19	2308	329	6.02
Yeast3	8	1484	163	8.1
Vowel0	13	988	90	9.98
Yeast1vs7	7	459	30	14.3
Page-blocks13vs2	10	472	28	15.86
Abalone09-18	8	731	42	16.4
Yeast4	8	1484	51	28.1
Ecoli0137vs26	7	281	7	39.14
Yeast6	8	1484	35	41.4

Table 3

The AUC performance measure on different datasets using SVM

Dataset	SMOTE	BLSMOTE	ENN	Cluster-based under-sampling	CIRUS
Wisconsin	96.66	96.66	96.66	96.66	96.66
Pima	60.02	61.43	64.6	67.8	61.43
Glass0	68.14	64.29	74.23	80.34	64.29
Vehicle1	48.93	51.54	55.91	70.4	51.54
Ecoli1	70.11	80.85	80.85	84.95	77.46
New-thyroid2	100	92.58	98.6	95.74	98.6
Segmemt0	97.54	98.85	97.67	97.95	97.67
Yeast3	67.68	70.04	74.14	90.11	69.9
Vowel0	62.9	54.77	70.71	88.47	63.25
Yeast1vs7	84.98	94.28	94.28	94.28	94.28
Page-blocks13vs2	62.9	62.9	62.9	64.17	62.9
Abalone09-18	100	89.44	99.43	97.12	99.43
Yeast4	35.1	66.52	66.52	66.52	66.52
Ecoli0137vs26	100	100	100	96.23	100
Yeast6	75.46	37.67	53.27	74.8	37.8

In this work, we concentrate on the boundary regions and identify the majority samples which are in the overlapping class region. This process eliminate the samples with out disturbing the data and may not lead to loss of information. This method showed a better accuracy for minority class samples as discussed in Section 5. The algorithm works by identifying and eliminating the majority class samples from the boundary regions. The undersampled data are used for training the classification algorithm.

Hence in this section, we discussed the solution to handle the class overlapping problem in imbalance dataset. Next section presents the experimental results of the proposed method trained on 15 real world datasets.

Table 4

The F-Score performance measure on different datasets using SVM

Dataset	SMOTE	BLSMOTE	ENN	Cluster-based under-sampling	CIRUS
Wisconsin	91.02	91.02	91.06	91.06	91.06
Pima	57.77	55.73	57.42	50.28	55.73
Glass0	77.29	85.38	86.61	68.14	85.37
Vehicle1	67.96	66.73	54.65	41.45	66.73
Ecoli1	66.91	91	91	70.81	100
New-thyroid2	100	100	87.5	70	87.5
Segmemt0	98.43	95.57	100	94.09	100
Yeast3	71.79	76.51	75.33	60	73.08
Vowel0	100	100	100	100	100
Yeast1vs7	16.54	16.54	16.54	16.54	16.54
Page-blocks13vs2	100	100	84.73	52.6	84.73
Abalone09-18	34.29	34.29	34.29	11.54	34.29
Yeast4	37.92	37.92	0	13.85	100
Ecoli0137vs26	100	100	100	25.64	100
Yeast6	79.96	33.27	49.93	39.93	100

Table 5

The G-Mean performance measure on different datasets using SVM

Dataset	SMOTE	BLSMOTE	ENN	Cluster-based under-sampling	CIRUS
Wisconsin	92.06	92.06	92.06	92.06	92.06
Pima	57.77	55.83	57.42	50.28	55.83
Glass0	77.29	85.37	88.61	68.14	85.37
Vehicle1	57.96	66.73	53.64	41.45	66.73
Ecoli1	66.91	91	91	70.81	100
New-thyroid2	100	100	87.5	70	87.5
Segmemt0	98.43	95.57	100	94.09	100
Yeast3	71.79	76.51	75.33	60	73.08
Vowel0	100	100	100	100	100
Yeast1vs7	16.54	16.54	16.54	16.54	16.54
Page-blocks13vs2	100	100	84.73	52.6	84.73
Abalone09-18	34.29	34.29	34.29	11.54	34.29
Yeast4	37.92	37.92	37.92	13.85	100
Ecoli0137vs26	100	100	100	25.64	100
Yeast6	79.96	33.27	49.93	39.93	100

Table 6

The AUC performance measure on different datasets using J48

Dataset	SMOTE	BLSMOTE	ENN	Cluster-based under-sampling	CIRUS
Wisconsin	96.66	96.66	96.66	96.66	96.66
Pima	60.02	61.43	64.6	67.8	61.43
Glass0	68.14	64.29	74.23	80.34	64.29
Vehicle1	48.93	51.54	55.91	70.4	51.54
Ecoli1	70.11	80.85	80.85	84.95	77.46
New-thyroid2	100	92.58	98.6	95.74	98.6
Segmemt0	97.54	98.85	97.67	97.95	97.67
Yeast3	67.68	70.04	74.14	90.11	69.9
Vowel0	62.9	54.77	70.71	88.47	63.25
Yeast1vs7	84.98	94.28	94.28	94.28	94.28
Page-blocks13vs2	62.9	62.9	62.9	64.17	62.9
Abalone09-18	100	89.44	99.43	97.12	99.43
Yeast4	35.1	66.52	66.52	66.52	66.52
Ecoli0137vs26	100	100	100	96.23	100
Yeast6	75.46	37.67	53.27	74.8	37.8

Table 7

The F-Score performance measure on different datasets using J48

Dataset	SMOTE	BLSMOTE	ENN	Cluster-based under-sampling	CIRUS
Wisconsin	92.06	92.06	92.06	92.06	92.06
Pima	57.77	55.83	57.42	50.28	55.83
Glass0	77.29	85.37	88.61	68.14	85.37
Vehicle1	57.96	66.73	53.64	41.45	66.73
Ecoli1	66.91	91	91	70.81	100
New-thyroid2	100	100	87.5	70	87.5
Segmemt0	98.43	95.57	100	94.09	100
Yeast3	71.79	76.51	75.33	60	73.08
Vowel0	100	100	100	100	100
Yeast1vs7	16.54	16.54	16.54	16.54	16.54
Page-blocks13vs2	100	100	84.73	52.6	84.73
Abalone09-18	34.29	34.29	34.29	11.54	34.29
Yeast4	37.92	37.92	0	13.85	100
Ecoli0137vs26	100	100	100	25.64	100
Yeast6	79.96	33.27	49.93	39.93	100

Table 8

The G-Mean performance measure on different datasets using J48

Dataset	SMOTE	BLSMOTE	ENN	Cluster-based under-sampling	CIRUS
Wisconsin	92.06	92.06	92.06	92.06	92.06
Pima	57.77	55.83	57.42	50.28	55.83
Glass0	77.29	85.37	88.61	68.14	85.37
Vehicle1	57.96	66.73	53.64	41.45	66.73
Ecoli1	66.91	91	91	70.81	100
New-thyroid2	100	100	87.5	70	87.5
Segmemt0	98.43	95.57	100	94.09	100
Yeast3	71.79	76.51	75.33	60	73.08
Vowel0	100	100	100	100	100
Yeast1vs7	16.54	16.54	16.54	16.54	16.54
Page-blocks13vs2	100	100	84.73	52.6	84.73
Abalone09-18	34.29	34.29	34.29	11.54	34.29
Yeast4	37.92	37.92	37.92	13.85	100
Ecoli0137vs26	100	100	100	25.64	100
Yeast6	79.96	33.27	49.93	39.93	100

Table 9

The AUC performance measure on different datasets using RF

Dataset	SMOTE	BLSMOTE	ENN	Cluster-based under-sampling	CIRUS
Wisconsin	96.66	96.66	96.66	96.66	96.66
Pima	60.02	61.43	64.6	67.8	61.43
Glass0	68.14	64.29	74.23	80.34	64.29
Vehicle1	48.93	51.54	55.91	70.4	51.54
Ecoli1	70.11	80.85	80.85	84.95	77.46
New-thyroid2	100	92.58	98.6	95.74	98.6
Segmemt0	97.54	98.85	97.67	97.95	97.67
Yeast3	67.68	70.04	74.14	90.11	69.9
Vowel0	62.9	54.77	70.71	88.47	63.25
Yeast1vs7	84.98	94.28	94.28	94.28	94.28
Page-blocks13vs2	62.9	62.9	62.9	64.17	62.9
Abalone09-18	100	89.44	99.43	97.12	99.43
Yeast4	35.1	66.52	66.52	66.52	66.52
Ecoli0137vs26	100	100	100	96.23	100
Yeast6	75.46	37.67	53.27	74.8	37.8

Table 10

The F-Score performance measure on different datasets using RF

Dataset	SMOTE	BLSMOTE	ENN	Cluster-based under-sampling	CIRUS
Wisconsin	92.06	92.06	92.06	92.06	92.06
Pima	57.77	55.83	57.42	50.28	55.83
Glass0	77.29	85.37	88.61	68.14	85.37
Vehicle1	57.96	66.73	53.64	41.45	66.73
Ecoli1	66.91	91	91	70.81	100
New-thyroid2	100	100	87.5	70	87.5
Segmemt0	98.43	95.57	100	94.09	100
Yeast3	71.79	76.51	75.33	60	73.08
Vowel0	100	100	100	100	100
Yeast1vs7	16.54	16.54	16.54	16.54	16.54
Page-blocks13vs2	100	100	84.73	52.6	84.73
Abalone09-18	34.29	34.29	34.29	11.54	34.29
Yeast4	37.92	37.92	0	13.85	100
Ecoli0137vs26	100	100	100	25.64	100
Yeast6	79.96	33.27	49.93	39.93	100

Table 11

The G-Mean performance measure on different datasets using RF

Dataset	SMOTE	BLSMOTE	ENN	Cluster-based under-sampling	CIRUS
Wisconsin	92.06	92.06	92.06	92.06	92.06
Pima	57.77	55.83	57.42	50.28	55.83
Glass0	77.29	85.37	88.61	68.14	85.37
Vehicle1	57.96	66.73	53.64	41.45	66.73
Ecoli1	66.91	91	91	70.81	100
New-thyroid2	100	100	87.5	70	87.5
Segmemt0	98.43	95.57	100	94.09	100
Yeast3	71.79	76.51	75.33	60	73.08
Vowel0	100	100	100	100	100
Yeast1vs7	16.54	16.54	16.54	16.54	16.54
Page-blocks13vs2	100	100	84.73	52.6	84.73
Abalone09-18	34.29	34.29	34.29	11.54	34.29
Yeast4	37.92	37.92	37.92	13.85	100
Ecoli0137vs26	100	100	100	25.64	100
Yeast6	79.96	33.27	49.93	39.93	100

6. Experimental setup

Extensive experiments using 15 public real-world datasets were carried out for evaluation. Experimental results were compared with state-of-the-art namely, SMOTE [7], Clustering-based undersampling [28], BLSMOTE [22] and ENN [9]. Support Vector Machine (SVM), Decision Tree (J48) and Random Forest (RF) were chosen as the learning algorithms because of widely used algorithms for skewed data distribution.

6.1 Data set

We evaluate the proposed algorithm using 15 datasets from Keel repository with different imbalance ratio (IR) [1]. Table 2 shows the details of the imbalanced datasets with number of features and imbalance ratio.

6.2 Results

In this section, we compared the results using different state-of-the-art techniques namely, SMOTE, clustering-based undersampling, BLSMOTE and ENN on different classification algorithms such as SVM, J48 and RF. The metric used are AUC, F-Score and G-Mean. Tables 3–11 shows the results of different state-of-the-art techniques compared with our proposed method on different classification algorithms. The experiments are carried out using 3 classification algorithms.

Figure 5.

Comparison of the proposed method with state-of-the-methods using J48 classifier (AUC metric).

Figure 6.

Comparison of the proposed method with state-of-the-methods using SVM classifier (AUC metric).

Figure 7.

Comparison of the proposed method with state-of-the-methods using Random Forest classifier (AUC metric).

From the experimental results we observe that the performance of the proposed method is consistent across different algorithms and datasets. Figures 5–7 shows the comparison based on AUC on the proposed method with state-of-the-methods using three different classifiers.

It is clear from the experimental results that our proposed method is better than SMOTE, clustering-based undersampling, BLSMOTE and ENN for most of the datasets. CIRUS produces better performance in finding and eliminating the overlapping majority samples from the boundary region. The proposed model produces better performance using F-Score, AUC for majority datasets with SVM, Random Forest. From the experiments, we conclude that the proposed method is superior to state-of-the-methods on most of the datasets.

7. Conclusion

In this paper, we proposed a novel framework for undersampling the critical majority instances from the boundary regions. The proposed CIRUS method effectively identified and removed the majority instances in the boundary regions. Extensive experiments using real-world datasets were carried out. The proposed methods were compared against state-of-the-art methods with good performance. This method can be applied to imbalanced datasets with any classification algorithm in general.

References

Alcalá-Fdez

Fernández

Luengo

Derrac

García

Sánchez

and Herrera

, Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework, Journal of Multiple-Valued Logic & Soft Computing, 2011, 17.

Aşkan

and Sayın

, Svm classification for imbalanced data sets using a multiobjective optimization framework, Annals of Operations Research 216(1) (2014), 191–203.

Barua

Islam

M.M.

Yao

and Murase

, Mwmote-majority weighted minority oversampling technique for imbalanced data set learning, IEEE Transactions on Knowledge and Data Engineering 26(2) (2012), 405–425.

Bunkhumpornpat

and Sinapiromsaran

, Dbmute: density-based majority under-sampling technique, Knowledge and Information Systems 50(3) (2017), 827–850.

Bunkhumpornpat

Sinapiromsaran

and Lursinsap

, Safe-level-smote: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem, in: Pacific-asia Conference on Knowledge Discovery and Data Mining, Springer, 2009, pp. 475–482.

Bunkhumpornpat

Sinapiromsaran

and Lursinsap

, Dbsmote: density-based synthetic minority over-sampling technique, Applied Intelligence 36(3) (2012), 664–684.

Chawla

N.V.

Bowyer

K.W.

Hall

L.O.

and Kegelmeyer

W.P.

, Smote: synthetic minority over-sampling technique, Journal of Artificial Intelligence Research 16 (2002), 321–357.

Chawla

N.V.

Japkowicz

and Kotcz

, Special issue on learning from imbalanced data sets, ACM Sigkdd Explorations Newsletter 6(1) (2004), 1–6.

Cunningham

and Delany

S.J.

, k-nearest neighbour classifiers, Multiple Classifier Systems 34(8) (2007), 1–17.

10.

Das

Datta

and Chaudhuri

B.B.

, Handling data irregularities in classification: foundations, trends, and future challenges, Pattern Recognition 81 (2018), 674–693.

11.

Devi

Purkayastha

et al., Redundancy-driven modified tomek-link based undersampling: a solution to class imbalance, Pattern Recognition Letters 93 (2017), 3–12.

12.

Douzas

Bacao

and Last

, Improving imbalanced learning through a heuristic oversampling method based on k-means and smote, Information Sciences 465 (2018), 1–20.

13.

Ezawa

K.J.

Singh

and Norton

S.W.

, Learning goal oriented bayesian networks for telecommunications risk management, in: Proceedings of the International Conference on Machine Learning, 1996, pp. 139–147.

14.

Fawcett

and Provost

F.J.

, Combining data mining and machine learning for effective user profiling, in: Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, 1996, pp. 8–13.

15.

FernáNdez

LóPez

Galar

Del Jesus

M.J.

and Herrera

, Analysing the classification of imbalanced data-sets with multiple classes: binarization techniques and ad-hoc approaches, Knowledge-based Systems 42 (2013), 97–110.

16.

Freitas

Costa-Pereira

and Brazdil

, Cost-sensitive decision trees applied to medical data, in: International Conference on Data Warehousing and Knowledge Discovery, Springer, 2007, pp. 303–312.

17.

García

and Herrera

, Evolutionary undersampling for classification with imbalanced datasets: proposals and taxonomy, Evolutionary Computation 17(3) (2009), 275–306.

18.

Gillala Rekha

A.K.T.

and Krishna Reddy

, Chaotic salp swarm optimization using svm for class imbalance problems, in: 19th International Conference on Hybrid Intelligent Systems (HIS 2019), Springer, 2019.

19.

Gillala Rekha

A.K.T.

and Krishna Reddy

, A novel approach for solving skewed classification problem using cluster based ensemble method, Mathematical Foundations of Computing, 2020.

20.

Gong

and Kim

, Rhsboost: improving classification performance in imbalance data, Computational Statistics & Data Analysis 111 (2017), 1–13.

21.

Haixiang

Yijing

Shang

Mingyun

Yuanyue

and Bing

, Learning from class-imbalanced data: review of methods and applications, Expert Systems with Applications 73 (2017), 220–239.

22.

Han

Wang

W.-Y.

and Mao

B.-H.

, Borderline-smote: a new over-sampling method in imbalanced data sets learning, in: International Conference on Intelligent Computing, Springer, 2005, pp. 878–887.

23.

Bai

Garcia

E.A.

and Li

, Adasyn: Adaptive synthetic sampling approach for imbalanced learning, in: 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), IEEE, 2008, pp. 1322–1328.

24.

Japkowicz

and Stephen

, The class imbalance problem: a systematic study, Intelligent Data Analysis 6(5) (2002), 429–449.

25.

Johnson

J.M.

and Khoshgoftaar

T.M.

, Survey on deep learning with class imbalance, Journal of Big Data 6(1) (2019), 27.

26.

Kubat

Holte

R.C.

and Matwin

, Machine learning for the detection of oil spills in satellite radar images, Machine Learning 30(2–3) (1998), 195–215.

27.

Laurikkala

, Improving identification of difficult small classes by balancing class distribution, in: Conference on Artificial Intelligence in Medicine in Europe, Springer, 2001, pp. 63–66.

28.

Lin

W.-C.

Tsai

C.-F.

Y.-H.

and Jhang

J.-S.

, Clustering-based undersampling in class-imbalanced data, Information Sciences 409 (2017), 17–26.

29.

López

Del Río

Benítez

J.M.

and Herrera

, Cost-sensitive linguistic fuzzy rule based classification systems under the mapreduce framework for imbalanced big data, Fuzzy Sets and Systems 258 (2015), 5–38.

30.

Mazurowski

M.A.

Habas

P.A.

Zurada

J.M.

J.Y.

Baker

J.A.

and Tourassi

G.D.

, Training neural network classifiers for medical decision making: the effects of imbalanced datasets on classification performance, Neural Networks 21(2–3) (2008), 427–436.

31.

Napierala

and Stefanowski

, Types of minority class examples and their influence on learning classifiers from imbalanced data, Journal of Intelligent Information Systems 46(3) (2016), 563–597.

32.

Nekooeimehr

and Lai-Yuen

S.K.

, Adaptive semi-unsupervised weighted oversampling (a-suwo) for imbalanced datasets, Expert Systems with Applications 46 (2016), 405–416.

33.

Onan

, Consensus clustering-based undersampling approach to imbalanced learning, Scientific Programming, 2019, 2019.

34.

Patel

and Thakur

G.S.

, Classification of imbalanced data using a modified fuzzy-neighbor weighted approach, International Journal of Intelligent Engineering and Systems 10(1) (2017), 56–64.

35.

Prati

R.C.

Batista

G.E.

and Monard

M.C.

, Class imbalances versus class overlapping: an analysis of a learning system behavior, in: Mexican International Conference on Artificial Intelligence, Springer, 2004, pp. 312–321.

36.

Rekha

and Tyagi

A.K.

, Necessary information to know to solve class imbalance problem: From a user’s perspective, in: Proceedings of ICRIC 2019, Springer, 2020, pp. 645–658.

37.

Rekha

Tyagi

A.K.

and Krishna Reddy

, Solving class imbalance problem using bagging, boosting techniques, with and without using noise filtering method, International Journal of Hybrid Intelligent Systems (Preprint) (2019), 1–10.

38.

Rekha

Tyagi

A.K.

and Krishna Reddy

, A wide scale classification of class imbalance problem and its solutions: a systematic literature review, Journal of Computer Science 15 (2019), 886–929.

39.

Sáez

J.A.

Luengo

Stefanowski

and Herrera

, Smote-ipf: addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering, Information Sciences 291 (2015), 184–203.

40.

Sun

Song

Zhu

Sun

and Zhou

, A novel ensemble method for classifying imbalanced data, Pattern Recognition 48(5) (2015), 1623–1637.

41.

Tsai

C.-F.

Lin

W.-C.

Y.-H.

and Yao

G.-T.

, Under-sampling class imbalanced datasets by combining clustering analysis and instance selection, Information Sciences 477 (2019), 47–54.

42.

Vuttipittayamongkol

Elyan

Petrovski

and Jayne

, Overlap-based undersampling for improving imbalanced data classification, in: International Conference on Intelligent Data Engineering and Automated Learning, Springer, 2018, pp. 689–697.

Critical Instances Removal based Under-Sampling (CIRUS): A solution for class imbalance problem 1

Abstract

Keywords

1. Introduction

Table 1 Confusion matrix

6.1 Data set

6.2 Results

References

Table 1
Confusion matrix