Impact of class imbalance ratio on ensemble methods for imbalance problem: A new perspective

Abstract

Class imbalance problem (CIP) exists when the class distribution is not uniform. Many real-world scenarios face CIP which attracted the researcher’s attention to this problem. Training machine learning (ML) models with class imbalanced datasets is a challenging problem. Ensemble methods in ML involve training multiple classifiers, combining or averaging their predictions to come to a final prediction. Specifically designed ensemble-based methods can overcome the difficulty faced by traditional classifiers and can handle the CIP. The performance of 19 ensemble methods for 44 unbalanced datasets is assessed in this paper in order to observe the effects of the class imbalance ratio (CIR). For performance evaluation, we divide these datasets into three categories, i.e., Slightly Imbalance (SI), Moderately Imbalance (MI) and Highly Imbalance (HI) based on CIR. With the proposed perspective, we observe that different ensemble methods perform well in different categories suggesting that the percentage of minority or majority class could be a criterion for the selection of ensemble methods for class imbalance datasets. Moreover, visual representations and different non-parametric statistical tests are also used to have more reliable results.

Keywords

Ensemble methods boosting bagging hybrid approaches classification

1 Introduction

Machine learning (ML) uses various algorithms to extract information from a huge amount of raw data. These ML algorithms have been applied in different domains. Most learning algorithms usually assume that training datasets are balanced [1]. However, many real-world applications frequently experience imbalance dataset, which shows up when one class’s proportion is higher than that of the other class. The class having a larger number of samples is called the majority class and the one having a smaller number of samples is called the minority class [2]. However, most ML algorithm learning procedure is guided by evaluation criteria that can result in relegating the minority class and becoming biased towards the majority class. This problem is popularly referred to as the Class Imbalance Problem (CIP). It makes classifier learning from imbalanced datasets a challenging task [3]. This situation is significant since it can be seen in many real-world domains like fault diagnosis [4], anomaly detection [5] etc.

Several approaches have been reported in the literature to address the issue of CIP. These approaches are grouped into 3 categories: Data level (DL), Algorithm level (AL) and Hybrid level (HL) [2, 6]. The DL approaches are considered to be external approaches. In this, either instances of the over-represented class are under-sampled or the instances of the under-represented class are over-sampled to balance the dataset. This prevents ML classifiers from getting biased towards the majority class. However, data level approaches may lead to overfitting due to random over-sampling or may cause loss of information in under-sampling. Then there are AL approaches that are considered to be internal approaches. Classification algorithms are either modified or new algorithms are developed under this category. However, AL approaches require in-depth knowledge and understanding of the algorithm. Lastly comes the HL approaches that take advantage of both the DL and AL approaches. Cost-sensitive approaches and ensemble methods come under this category [3, 6]. Cost-sensitive approaches assign weights to each instance in the dataset, and these weights are updated based on the classifier’s performance. Weights represent the probability of an instance getting selected for the classifier’s training in the next iteration. Ensemble approach involves training different classifiers (base classifiers) and combining their predictions to come to a final hypothesis as shown in Fig. 1. Ensemble methods improve the overall performance by handling the overfitting issue and reducing the variance of the model. Each classifier utilizes different training data set to minimize bias.

Fig. 1

An ensemble approach.

In the recent years, many researchers have provided the comparative studies using state-of-the-art techniques with Imbalanced datasets.

Wongvorachan et al. in 2023 [7] compared a variety of sampling methods in a dataset to address the different ratios of the CIP. The authors used Random Forest for evaluating the performance of each sampling method. The findings indicated that hybrid resampling for Highly imbalanced data and random oversampling for moderately imbalanced data appeared to work well. Jiang et al. in 2023 [8] proposed a new taxonomy for imbalanced data learning approaches and performed comparison of traditional approaches and Generative adversarial networks. But as reported by the researchers, the study has few limitations such as more datasets were not explored and upcoming techniques such as transfer learning etc., were not taken into consideration. Singh et al. in 2022 [9] have contrasted the effectiveness of class balancing techniques and classification algorithms on a single dataset. The authors observed that ensemble classification models like AdaBoost, XGBoost, and Random Forest performed well when oversampling was followed by under-sampling.

In the literature, there are multiple studies which provide the comparison of the ensemble-based methods which have been used for the CIP [3, 6]. Here, the researchers have put these methods into 3 categories: Boosting-based, Bagging-based and Hybrid methods.

The performance of these methods was compared on different datasets according to the category of the ensemble methods.

Research gap identified: From the above studies, we identify that there is a lack of a framework where the performance of these methods is measured w.r.t. the degree of imbalance in datasets.

Contrary to previous studies, our study aims to analyze the impact of CIR on the performance of the ensemble methods using new perspective of dataset categorization (discussed in Section 4) as Slightly, Moderately and Highly Imbalance datasets. CIR measures the degree of imbalance in a dataset [6]. It is the ratio of the majority class size to the minority class size. Methods may perform well with low CIR datasets and may fail for the datasets with high CIR.

To address this research gap, we design the experimental framework with 19 ensemble methods using 44 imbalanced datasets. Out of these 19 ensemble methods, 10 are Boosting based, 7 are Bagging based, whereas the remaining 2 are Hybrid methods.

The key contributions of this study are as follows:

We propose a new perspective for the performance evaluation of ensemble methods using dataset categorization. The 44 imbalanced datasets used in this study have been categorized as Slightly Imbalance, Moderately Imbalance and Highly Imbalance datasets depending on the CIR. Figure 2. shows the visual representation of these categories, including balanced dataset.

Methods are evaluated using Area Under the curve (AUC) to analyze their performance category-wise. Finally, different non-parametric statistical tests as employed in the literature [3, 6] are employed for the comparison of ensemble methods.

For Slightly and Moderately imbalance datasets, Over-sampling and under-sampling approaches perform considerably well, although Under-sampling performed better for Highly imbalanced datasets.

For Slightly imbalance datasets, UnderBagging performs better, whereas SmoteBagging outperforms the other methods for Moderately imbalance datasets. RUSBoost had the best performance for Highly imbalance datasets.

It is observed that EUSBoost method takes more model learning time in comparison with other methods whereas there are no significant differences noticed in the model learning time for others.

Fig. 2

Datasets with different degrees of imbalance data.

In several application fields such as sentimental analysis, medical diagnosis, credit card fault detection, etc., imbalance data is a serious issue that impairs the model’s performance. Overall, this study provides a novel viewpoint to researchers and practitioners to consider the CIR for the best imbalance classification model.

The organization of the paper is as follows: Section 2 describes the state of the art of different ensemble methods used. Section 3 discusses the performance metric used for performance evaluation measures and Section 4 presents the dataset properties. Section 5 depicts the Experimental setup and results and Section 6 concludes the paper.

2 Ensemble methods

This section discusses the ensemble-based methods to handle the CIP. An ensemble approach [10 –15] tries to increase the performance of single classifiers by inducing and combining numerous classifiers to create a new classifier that outperforms them all. As a result, the primary concept is to build numerous classifiers and combine their predictions when unknown cases are provided. Figure 3 shows the ensemble methods reviewed in the paper.

Fig. 3

Ensemble methods used to handle CIP.

2.1 Ensemble using boosting

Boosting aims to model the weak learner into a strong learner by increasing the accuracy of their prediction. Boosting involves training numerous weak classifiers successively, with the misclassified cases being selected by the next classifier after each iteration and so on. The process continues until the required accuracy is obtained or a certain number of classifiers are employed.

Several approaches have been proposed in the literature and as discussed below.

2.1.1 AdaBoost (1997)

Freund and Schapire [16] proposed Adaptive boosting algorithm.

AdaBoost Algorithm
Input:N labelled training instances {x_i, y_i}, i : =1, . . . , N_i, x_i
are in some domain Ꞷ andy_i ∈ {-1, + 1}.
Output: Final Hypothesis H(x)
1. for each iteration t = 1,…,T,
distribution Dt is calculated where D₁ (i): 1/n for
i = 1,…,n.
2. Weak hypothesis is computed using weak learner,
h_t: Ꞷ ⟶ {-1, + 1} where goal is to find a weak
hypothesis
with low weighted error ɛrlap -- ι w.r.t. D_t
3. Choose assigned weighted α _t
$α_{t} = \frac{1}{2} \ln ((1 - ɛ rlap - - ι) / ɛ rlap - - ι)$
4. Update D_i +1 for i = 1, . . . ,N equation
Drlap -- ι + 1 (i) = Drlap -- ι (i) exp (αrlap -- ιy_i h_t (x_i)) /Z_t
5. Compute final hypothesis.
$H (x) = sign \sum_{t = 1}^{T} α t h_{t} (x)$

All the instances are assigned with equal weights and the entire dataset is used to train the classifiers sequentially. After every iteration, the next classifier is more focused on the instances that are hard to classify. The AdaBoost algorithm selects the class label for the unseen instance on a majority basis. Eventually, each weak classifier learns and the overall accuracy is improved [17 –19].

2.1.2 AdaBoost NC (2010)

AdaBoost NC, a variation of AdaBoost proposed by S. Wang et al. [20] is a negative correlation learning algorithm for classification ensembles. Unlike AdaBoost, AdaBoost NC uses the penalty term to improve diversity. AdaBoost NC attempts to handle the overfitting issue reported in AdaBoost and thus has better generalization ability. Moreover, in AdaBoost NC, the information exchange is better than in AdaBoost which results in lower computation time.

2.1.3 AdaC2 (2007)

AdaC2 was proposed by Sun et al. [21] as a variation of AdaBoost which represents the different identification importance by introducing cost items. AdaC2 is able to differentiate between different types of samples and ensures that the learning is biased towards the minority class.

2.1.4 AdaBoost M1 and AdaBoost M2

These two algorithms are the extension of the AdaBoost algorithm. Unlike AdaBoost, these two are for multiclass problems [16, 22] and are different in terms of weight update calculation.

2.1.5 SmoteBoost (2003)

SmoteBoost [23] was proposed by Chawla and Hall which combines the smote preprocessing technique with the boosting method to handle the class imbalance problem. However, SmoteBoost takes more model learning time for preprocessing and may create synthetic data points around the noise instances.

2.1.6 Databoost-IM (2004)

Databoost-IM was proposed by Hongyu Guo and Herna L Viktor [24] which combines boosting with data generation to increase classifier prediction capability and address the problem of class imbalance. However, DataBoost-IM takes significantly more model learning time as compared to AdaBoost.

2.1.7 MsmoteBoost (2009)

MsmoteBoost was proposed by Hu et al. [25] as an improvement to Smote algorithm. MsmoteBoost combines Msmote and Adaboost to handle the class imbalance problem. Unlike Smote algorithm, Msmote also considers the latent noises and recognizes the minority class better than Smote algorithm. However, the Msmote algorithm does not consider the differences in important features.

2.1.8 RUSBoost (2009)

The random under-sampling boosting method [26] was proposed by Seiffert et al. is a variation of the SmoteBoost algorithm. RUSBoost combines random under-sampling with boosting. Unlike SmoteBoost, RUSBoost is faster and simpler. However, RUSBoost does not employ an intelligent method for under-sampling and thus, it may lead to the loss of potential information.

2.1.9 EUSBoost (2013)

Evolutionary under-sampling boosting [27] was proposed by Galer as a variation of the RUSBoost algorithm. Unlike RUSBoost, EUSBoost performs evolutionary under-sampling and obtains the best subset which cannot be further improved for each boosting iteration. Thus, EUSBoost improves the predictive accuracy. However, EUSBoost takes more model learning time than other ensembles.

2.2 Ensemble using Bagging

2.2.1 Bagging (1996)

The algorithm proposed by Breiman [28] is also known as Bootstrap aggregating. In this algorithm, several classifiers work parallelly with diverse subsets of the original dataset [29, 30]. It helps to overcome the issues of overfitting and improves the accuracy of the classifier [20 –22].

Bagging Algorithm
Input: N training set, T = number of iterations,
m = Size of Bootstrap, h = weak learner
Output: Final hypothesis (Bagged Classifier):
$H (x) = sign \sum_{t = 0}^{T} h_{t} (x)$
where h_i ∈ [–1,+1] are weak learners.
1. For t = 1 to T
2. N_t ← RandomSampleReplacement (m, N)
3. H_t ← h (N_t)

2.2.2 Underbagging

The UnderBagging algorithm combines the under-sampling technique with bagging [31]. In UnderBagging, the number of instances drawn will depend on the re-sampling rate, i.e., a. Its value ranges from 10 to 100. In each subset, the majority class instances are undersampled.

2.2.3 Overbagging

The OverBagging algorithm combines the over-sampling technique with bagging [31]. In OverBagging, a = 100, which leads to drawing all the instances from the majority class, and oversampling will be done to the minority class instances.

2.2.4 UnderOverbagging (UOB)

The UnderOverBagging algorithm uses the hybrid of oversampling and under-sampling with bagging [31]. UnderOverBagging considers different resampling rates in every iteration and creates more diverse subsets. As the value of a gets close to 100, the ensemble changes from UnderBagging to Over Bagging, thus modelling a hybrid.

2.2.5 IIVotes (2010)

The Imbalanced Ivotes method was proposed by J. Blaszczynski et al. [32] to address the CIP and to have a better balance between sensitivity and specificity. IIVotes combines the selective pre-processing method, i.e., SPIDER, and the adaptive ensemble, Ivotes proposed by Breiman [33]. IIVotes is more focused on the minority class and can achieve acceptable overall accuracy.

2.2.6 Smotebagging and Msmotebagging (MSB)

Here, we are using the smote and msmote with the bagging method respectively [3, 31]. Both techniques involve the creation of synthetic samples during the subset generation. The instances of the under-represented class are oversampled to address the CIP.

2.3 Ensemble using hybrid bagging and boosting

Hybrid ensembles combine the benefits of both bagging and boosting. Xu-Ying Liu et al. [34] proposed two strategies in the literature, which are discussed below.

2.3.1 Easy ensemble

It is an unsupervised approach and appears as an ensemble of ensembles. The output is a single ensemble. It employs bagging as the main ensemble learning approach. It explores the majority class set N by creating subsets N1, N2,., NT and then combines Ni and P (minority class instance set) to train Adaboost. In the end, all the predictions are collected and voting is done to classify the final output [34].

2.3.2 BalanceCascade

It works in a supervised manner. Unlike EasyEnsemble, BalanceCascade removes the correctly classified majority class instances from N after every iteration and thus reducing the size of the subset to be learned by the next Adaboost ensemble. The learning process terminates after the last iteration T, when the size of N becomes less than P. BalanceCascade reduces the model learning time and works better than under-sampling [34].

3 Performance evaluation metric

To evaluate the performance of the classifiers, different performance metrics such as Precision, Recall, F1 Score, AUC etc., have been used in the literature. In this paper, we use AUC to compare the performance of the ensemble methods as it is a better measure having advantages such as the measurement of quality using the AUC metric is unaffected in imbalanced datasets [35]. AUC is a curve in which the percentage of correctly classified positive instances, i.e., true positive rate TP_rate and the percentage of correctly classified negative instances, i.e., True negative rate TN_rate are plotted against the x-axis and y-axis The more the area under the ROC (Receiver Operator Characteristics), the better the classifier is performing [6]. AUC is evaluated using equation (1): $AUC = \frac{TPrate + TNrate}{2}$ (1)

4 Dataset analysis

We have used 44 imbalanced datasets obtained from the KEEL dataset repository. These datasets have no missing values and the class attribute represents two values which are ‘positive’ and ‘negative’. Table 1 lists the attributes of the datasets.

Table 1
Properties of datasets

Datasets #Dimensions Dataset Size minority class % IR

Glass1 9 214 35.46 1.82

Ecoli-0_vs_1 7 220 34.97 1.86

Wisconsin 9 683 34.97 1.86

Pima 8 768 34.84 1.87

Iris0 4 150 33.33 2.00

Glass0 9 214 32.68 2.06

Yeast1 8 1484 28.90 2.46

Haberman 3 306 26.46 2.78

Vehicle2 18 846 25.77 2.88

Vehicle1 18 846 25.64 2.90

Vehicle3 18 846 25.06 2.99

Glass-0-1-2-3_vs_4-5-6 9 214 23.81 3.20

Vehicle0 18 846 23.53 3.25

Ecoli1 7 336 22.94 3.36

New-thyroid2 5 215 16.29 5.14

New-thyroid1 5 215 16.29 5.14

Ecoli2 7 336 15.48 5.46

Segment0 19 2308 14.25 6.02

Glass6 9 214 13.55 6.38

Yeast3 8 1484 10.99 8.10

Ecoli3 7 336 10.42 8.60

Page-blocks0 10 5472 10.21 8.79

Yeast-2_vs_4 8 514 9.92 9.08

Yeast-0-5-6-7-9_vs_4 8 528 9.66 9.35

Vowel0 13 988 9.11 9.98

Glass-0-1-6_vs_2 9 192 8.86 10.29

Glass2 9 214 7.94 11.59

Shuttle-c0-vs-c4 9 1829 6.72 13.87

Yeast-1_vs_7 7 459 6.54 14.30

Glass4 9 214 6.07 15.46

Ecoli4 7 336 5.95 15.80

Page-blocks-1-3_vs_4 10 472 5.93 15.86

Abalone9-18 8 731 5.75 16.40

Glass-0-1-6_vs_5 9 184 4.89 19.44

Shuttle-c2-vs-c4 9 129 4.65 20.50

Yeast-1-4-5-8_vs_7 8 693 4.33 22.10

Glass5 9 214 4.21 22.78

Yeast-2_vs_8 8 482 4.15 23.10

Yeast4 8 1484 3.44 28.10

Yeast-1-2-8-9_vs_7 8 947 3.17 30.57

Yeast5 8 1484 2.96 32.78

Ecoli-0-1-3-7_vs_2-6 7 281 2.49 39.14

Yeast6 8 1484 2.36 41.40

Abalone19 8 4174 0.77 129.44

Datasets	#Dimensions	Dataset Size	minority class %	IR
Glass1	9	214	35.46	1.82
Ecoli-0_vs_1	7	220	34.97	1.86
Wisconsin	9	683	34.97	1.86
Pima	8	768	34.84	1.87
Iris0	4	150	33.33	2.00
Glass0	9	214	32.68	2.06
Yeast1	8	1484	28.90	2.46
Haberman	3	306	26.46	2.78
Vehicle2	18	846	25.77	2.88
Vehicle1	18	846	25.64	2.90
Vehicle3	18	846	25.06	2.99
Glass-0-1-2-3_vs_4-5-6	9	214	23.81	3.20
Vehicle0	18	846	23.53	3.25
Ecoli1	7	336	22.94	3.36
New-thyroid2	5	215	16.29	5.14
New-thyroid1	5	215	16.29	5.14
Ecoli2	7	336	15.48	5.46
Segment0	19	2308	14.25	6.02
Glass6	9	214	13.55	6.38
Yeast3	8	1484	10.99	8.10
Ecoli3	7	336	10.42	8.60
Page-blocks0	10	5472	10.21	8.79
Yeast-2_vs_4	8	514	9.92	9.08
Yeast-0-5-6-7-9_vs_4	8	528	9.66	9.35
Vowel0	13	988	9.11	9.98
Glass-0-1-6_vs_2	9	192	8.86	10.29
Glass2	9	214	7.94	11.59
Shuttle-c0-vs-c4	9	1829	6.72	13.87
Yeast-1_vs_7	7	459	6.54	14.30
Glass4	9	214	6.07	15.46
Ecoli4	7	336	5.95	15.80
Page-blocks-1-3_vs_4	10	472	5.93	15.86
Abalone9-18	8	731	5.75	16.40
Glass-0-1-6_vs_5	9	184	4.89	19.44
Shuttle-c2-vs-c4	9	129	4.65	20.50
Yeast-1-4-5-8_vs_7	8	693	4.33	22.10
Glass5	9	214	4.21	22.78
Yeast-2_vs_8	8	482	4.15	23.10
Yeast4	8	1484	3.44	28.10
Yeast-1-2-8-9_vs_7	8	947	3.17	30.57
Yeast5	8	1484	2.96	32.78
Ecoli-0-1-3-7_vs_2-6	7	281	2.49	39.14
Yeast6	8	1484	2.36	41.40
Abalone19	8	4174	0.77	129.44

Based on the CIR, we have divided these datasets into three categories: 11 belonging to Slightly imbalance, 11 for Moderately imbalance and 22 for Highly imbalance datasets. Table 2 shows the criteria used for the division of these categories. Here, the minority class instances are represented by ’m’ and majority class instances are represented by ‘M’.

Table 2

Criteria used for datasets categorization

Categories	Percentage
	Majority class (M)	Minority class (m)
Slightly Imbalance	50 < M≤75	50 > m≥25
Moderately Imbalance	75 < M<90	10 < m<25
Highly Imbalance	M≥90	m≤10

The datasets fall under the Slightly imbalance category if the percentage of the majority class is more than 50% but less than equal to 75% or the percentage of the minority class is less than 50% but greater than equal to 25%.

In the Moderately imbalance category, the percentage of the majority class is more than 75% but less than 90% or the percentage of the minority class is greater than 10% but less than 25%.

For the Highly imbalance category, the percentage of the majority class is greater than or equal to 90% or the percentage of the minority class is less than equal to 10%.

Table 3 shows the division of datasets category-wise.

Table 3

Category-wise division of datasets

Slightly imbalance (11)	Moderately imbalance (11)	Highly imbalance (22)
Glass1	Glass-0-1-2-3_vs_4-5-6	Yeast-2_vs_4
Ecoli-0_vs_1	Vehicle0	Yeast-0-5-6-7-9_vs_4
Wisconsin	Ecoli1	Vowel0
Pima	New-thyroid2	Glass-0-1-6_vs_2
Iris0	New-thyroid1	Glass2
Glass0	Ecoli2	Shuttle-c0-vs-c4
Yeast1	Segment0	Yeast-1_vs_7
Haberman	Glass6	Glass4
Vehicle2	Yeast3	Ecoli4
Vehicle1	Ecoli3	Page-blocks-1-3_vs_4
Vehicle3	Page-blocks0	Abalone9-18
		Glass-0-1-6_vs_5
		Shuttle-c2-vs-c4
		Yeast-1-4-5-8_vs_7
		Glass5
		Yeast-2_vs_8
		Yeast4
		Yeast-1-2-8-9_vs_7
		Yeast5
		Ecoli-0-1-3-7_vs_2-6
		Yeast6
		Abalone19

5 Experimental setup and results

In this paper, we perform the comparative analysis of 19 ensemble methods with Slightly, Moderately and Highly imbalance datasets to address the CIP. We have used C4.5, i.e., the Decision tree as the base classifier in the KEEL tool. Different statistical tests are conducted using the AUC metric to evaluate the performance of the ensemble methods. Initial parameter settings in the tool are shown in Table 4.

Table 4
Initial parameter settings

Parameter Value

Pruned True

Cross-Validation 5 Folds

Instances per leaf 2

Confidence Level 0.25

Number of Classifiers 10

Base Classifier C4.5

Parameter	Value
Pruned	True
Cross-Validation	5 Folds
Instances per leaf	2
Confidence Level	0.25
Number of Classifiers	10
Base Classifier	C4.5

The average performance of the classifiers in terms of AUC values category-wise is shown in Table 5.

Table 5

Category-wise average AUC values comparison of the classifiers

Ensemble methods	SI	MI	HI
AdaBoost	0.8021	0.8815	0.7586
AdaC2	0.8083	0.9120	0.7893
AdaBoost NC	0.8000	0.8992	0.7524
AdaBoost M1	0.8010	0.8806	0.7573
AdaBoost M2	0.8010	0.8806	0.7573
SmoteBoost	0.8203	0.9225	0.8099
EUSBoost	0.8211	0.9322	0.8017
DataBoost	0.5709	0.6608	0.7561
MsmoteBoost	0.8161	0.9260	0.7764
RUSBoost	0.8272	0.9184	0.8456
Bagging	0.7965	0.8913	0.7263
SmoteBagging	0.8200	0.9343	0.8161
UnderBagging	0.8358	0.9276	0.8396
OverBagging	0.8107	0.9012	0.7831
UOB	0.8040	0.9175	0.8053
IIVotes	0.8131	0.9054	0.7567
MSB	0.8161	0.9275	0.8038
BalanceCascade	0.8067	0.9133	0.8095
EasyEnsemble	0.8080	0.9149	0.8044

5.1 Category-wise evaluation using visual representation

Figures 4–6 shows the category-wise average performance of all the ensemble methods.

Fig. 4

Average performance for SI datasets using AUC.

Fig. 5

Average performance for MI datasets using AUC.

Fig. 6

Average performance for HI datasets using AUC.

From the visual representations, it is observed that for Slightly imbalance datasets, the variation in AUC values ranges from 0.5709 to 0.8358 and the best performing classifier is UnderBagging. For Moderately imbalance datasets, the average performance of the classifiers is best with AUC values ranging from 0.6608 to 0.9343 and SmoteBagging outperformed other classifiers. In the case of Highly imbalance datasets, the AUC values range from 0.7263 to 0.8456 and RUSBoost attained the best results under this category.

The results depict that the overall performance of the classifiers in Highly imbalance datasets is least which can be attributed to fewer minority class instances for the classifier’s learning. Moreover, the performance variation of the classifiers in Highly imbalanced datasets is more because of the difficulty in the learning process.

5.2 Statistical tests and analysis

We are using three statistical tests namely the Friedman Test (F Test), Holm post-hoc (HP) and Wilcoxon Matched Pair signed-rank test (WMPSR Test) to perform the empirical assessment of ensemble methods for different categories of datasets. For the average ranking of the ensemble methods, F Test is used where the performance of the ensemble method is inversely proportional to their average ranking and then HP analysis is performed to find the significant difference between the ensemble methods. Thereafter, we are applying the WMPSR Test to find the best performer out of all the ensemble methods in each category.

5.2.1 Slightly imbalance datasets

For Slightly imbalance datasets, we run the F Test as shown in Fig. 7. As per the F Test, UnderBagging achieves the lowest rank and turns out to be the top performer.

Fig. 7

Average rankings obtained by Friedman Test for Slightly imbalance category.

Table 6 shows the F test statistics for the Slightly imbalance category. Here, N represents the number of datasets used and the degree of freedom is one less than the methods used. The test computed the p-value as 0.004214, which means significant differences exist between the ensemble methods. Hence, we used the HT by taking UnderBagging as the control method. The result of the HP comparison is shown in Table 7. The hypothesis for all the ensemble methods except RUSBoost is rejected as the p-value is less than 0.05. For RUSBoost, the p-value is 0.05, which means that no significant differences exist between UnderBagging and RUSBoost. The null hypothesis is not rejected. We run the WMPSR Test to compare UnderBagging and RUSBoost.

Table 6

Friedman test statistics

N	44
Chi-Square	37.72
Degree of freedom	18
p-value	0.004214

Table 7

Holm Post-hoc comparison for SI category

Control method: UnderBagging (Rank: 4.7273)
Algorithm	Z value	Holm	Hypothesis
			(α=0.05)
DataBoost	3.731846	0.002778	Rejected
BalanceCascade	3.125658	0.002941	Rejected
EasyEnsemble	3.011998	0.003125	Rejected
Bagging	3.011998	0.003333	Rejected
OUB	2.955167	0.003571	Rejected
AdaBoost NC	2.86045	0.003846	Rejected
AdaBoost M1	2.822564	0.004167	Rejected
AdaBoost M2	2.822564	0.004545	Rejected
AdaBoost	2.784677	0.005	Rejected
OverBagging	2.5763	0.005556	Rejected
AdaC2	2.557356	0.00625	Rejected
MSB	2.121659	0.007143	Rejected
MsmoteBoost	2.064829	0.008333	Rejected
SmoteBagging	1.667018	0.01	Rejected
IIVotes	1.136603	0.0125	Rejected
EUSBoost	1.003999	0.016667	Rejected
SmoteBoost	0.852452	0.025	Rejected
RUSBoost	0.644075	0.05	Not Rejected

Table 8 shows the WMPSR Test statistics comparison for UnderBagging vs RUSBoost.

Table 8

Wilcoxon test statistics for UnderBagging vs RUSBoost

Comparison	R⁺	R^–	Exact P-value	Hypothesis
UnderBagging vs RUSBoost	43.0	12.0	0.13086	Not Rejected

R⁺ represents the sum of ranks for the dataset where the number of times the first method (UnderBagging) performed better than the second method (RUSBoost). R^– rank tells us the sum of ranks for the dataset where the number of times the second method performed better than the first method. The p-value recorded is 0.13086, i.e., greater than 0.05; therefore, the null hypothesis is not rejected and no significant differences exist between the methods. As R⁺ is greater than R^–, UnderBagging is selected as the best performer in Slightly imbalanced datasets.

5.2.2 Moderately imbalance datasets

For Moderately imbalanced datasets, F Test average rankings are shown in Fig. 8. As per the F Test, SmoteBagging achieves the lowest rank and turns out to be the top performer.

F Test statistics for Moderately imbalanced datasets are shown in Table 9.

Fig. 8

Average rankings obtained by Friedman Test for Moderately imbalanced datasets.

Table 9

Friedman test statistics

N	44
Chi-Square	75.16
Degree of freedom	18
p-value	0.0000

The p-value computed is 0.0000, which means significant differences exist. Hence, we used the HP test by taking SmoteBagging as the control method. The result of the HP comparison is shown in Table 10. The hypothesis for all the ensemble methods except EUSBoost is rejected. For all the ensemble methods, the p-value is less than 0.05. The p-value of EUSBoost is 0.05, which means no significant differences exist between SmoteBagging and EUSBoost. The null hypothesis is not rejected.

Table 10

Holm Post-hoc comparison for MI category

Control method: SmoteBagging (Rank: 4.9545)
Algorithm	Z value	Holm	Hypothesis
			(α=0.05)
AdaBoost M1	4.319091	0.002778	Rejected
AdaBoost M2	4.319091	0.002941	Rejected
AdaBoost	4.110714	0.003125	Rejected
DataBoost	3.959167	0.003333	Rejected
Bagging	3.466639	0.003571	Rejected
OverBagging	2.841507	0.003846	Rejected
AdaBoost NC	2.500526	0.004167	Rejected
EasyEnsemble	2.273206	0.004545	Rejected
AdaC2	2.121659	0.005	Rejected
BalanceCascade	1.989055	0.005556	Rejected
IIVotes	1.989055	0.00625	Rejected
UOB	1.477584	0.007143	Rejected
RUSBoost	1.439697	0.008333	Rejected
MSB	1.136603	0.01	Rejected
MsmoteBoost	0.833509	0.0125	Rejected
SmoteBoost	0.549358	0.016667	Rejected
UnderBagging	0.454641	0.025	Rejected
EUSBoost	0.17049	0.05	Not Rejected

We run the WMPSR Test to compare SmoteBagging and EUSBoost. Table 11 shows the WMPSR Test statistics comparison for SmoteBagging vs EUSBoost. The p-value recorded is≥0.2, which is greater than 0.05. The null hypothesis is not rejected. As R⁺ is greater than R^–, SmoteBagging turns out to be the best performer in Moderately imbalance datasets.

Table 11

Wilcoxon test statistics for SmoteBagging vs EUSBoost

Comparison	R⁺	R^–	Exact P-value	Hypothesis
SmoteBagging vs EUSBoost	41.0	25.0	≥0.2	Not Rejected

5.2.3 Highly imbalance datasets

The F test rankings for Highly imbalance datasets are shown in Fig. 9. As per the F Test, RUSBoost achieves the lowest rank and turns out to be the top performer.

Table 12 shows the F test statistics for Highly imbalance.

Fig. 9

Average rankings obtained by Friedman Test for Highly imbalanced category.

The p-value computed is 0.0000, which means significant differences exist. Hence, we use the HP test by taking RUSBoost as the control method. The result of the HP comparison is shown in Table 13. The hypothesis for all ensemble methods except for UnderBagging is rejected. For all the ensemble methods, the p-value is less than 0.05. The p-value of UnderBagging is 0.05, which means no significant differences exist between RUSBoost and UnderBagging. We run the WMPSR Test to compare RUSBoost and UnderBagging. Table 14 shows the WMPSR Test statistics comparison for RUSBoost vs UnderBagging. The p-value recorded is≥0.2, which is greater than 0.05; therefore, the null hypothesis is not rejected. As R⁺ is greater than R^–, RUSBoost turns out to be the best performer in Highly imbalanced datasets.

Table 12

Friedman test statistics for boosting based ensembles

N	44
Chi-Square	123.54
Degree of freedom	18
p-value	0.0000

Table 13

Holm Post-hoc comparison for HI category

Control method: RUSBoost (Rank: 4.2273)
Algorithm	Z value	Holm	Hypothesis
			(α=0.05)
Bagging	6.617127	0.002778	Rejected
IIVotes	5.612502	0.002941	Rejected
AdaBoost NC	5.371392	0.003125	Rejected
DataBoost	5.183862	0.003333	Rejected
AdaBoost M1	4.902567	0.003571	Rejected
AdaBoost M2	4.902567	0.003846	Rejected
Adaboost	4.875777	0.004167	Rejected
MsmoteBoost	4.621273	0.004545	Rejected
OverBagging	3.777388	0.005	Rejected
AdaC2	2.946898	0.005556	Rejected
EasyEnsemble	2.906713	0.00625	Rejected
BalanceCascade	2.531654	0.007143	Rejected
MSB	2.357519	0.008333	Rejected
SmoteBoost	2.076224	0.01	Rejected
UOB	2.036039	0.0125	Rejected
EUSBoost	1.781534	0.016667	Rejected
SmoteBagging	1.393079	0.025	Rejected
UnderBagging	0.75012	0.05	Not Rejected

Table 14

Wilcoxon test statistics for RUSBoost vs UnderBagging

Comparison	R⁺	R^–	Exact P-value	Hypothesis
RUSBoost vs. UnderBagging	142.0	88.5	≥0.2	Not Rejected

6 Conclusion

In this paper, we observe the impact of the CIR by evaluating the performance of 19 ensemble methods using 44 imbalanced datasets with different values of CIR. For performance evaluation, we have categorized the datasets as Slightly Imbalance, Moderately Imbalance and Highly Imbalance datasets. It is observed that the best-performing methods vary as the degree of imbalance increases. The experimental results show that UnderBagging, SmoteBagging and RUSBoost are the top performing models in Slightly, Moderately and Highly imbalanced datasets respectively. We also observe that the learning of the ensemble methods is significantly affected in Highly imbalanced datasets. We use visual representations and non-parametric statistical tests namely F Test, HP and WMPSR Test. The study has few limitations such as issues like class-overlapping and noise disturbance are not considered under this work. For future work, we will observe the joint impact of CIR and noise disturbance on the performance of ensemble methods for CIP.

Footnotes

Acknowledgments

The authors express the gratitude towards USICT, GGSIPU, New Delhi, India for the opportunity to do this research.

References

Batista

G.E.

, Prati

R.C.

and Monard

M.C.

, A study of the behavior of several methods for balancing machine learning training data, ACM SIGKDD Explorations Newsletter 6(1) (2004), 20–29.

Gosain

, Sardana

Handling class imbalance problem using oversampling techniques: A review, In International conference on advances in computing, communications and informatics, IEEE (2017), 79–85.

Galar

, Fernandez

, Barrenechea

, Bustince

and Herrera

, A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches, Part C (Applications and Reviews), IEEE Transactions on Systems, Man, and Cybernetics 42(4) (2011), 463–484.

Yang

, Tang

W.H.

, Shintemirov

and Wu

Q.H.

, Association Rule Mining-Based Dissolved Gas Analysis for Fault Diagnosis of Power Transformers, In IEEE Transactions on Systems, Man, and Cybernetics 39(6) (2009), 597–610.

Khreich

, Granger

, Miri

and Sabourin

, Iterative Boolean combination of classifiers in the ROC space: An application to anomaly detection with HMMs, Pattern Recognition 43(8) (2010), 2732–2752.

Kaur

and Gosain

, Empirical Assessment of Ensemble based Approaches to Classify Imbalanced Data in Binary Classification, International Journal of Advanced Computer Science and Applications 10(3) (2019).

Wongvorachan

, Surina

and Bulut

, A Comparison of Undersampling, Oversampling, and SMOTE Methods for Dealing with Imbalanced Classification in Educational Data Mining, Information 14(1) (2023), 54.

Jiang

, Lu

, Wang

and Ding

, Benchmarking state-of-the-art imbalanced data learning approaches for credit scoring, , Expert Systems with Applications 213 (2023), 118878.

Singh

, Ranjan

R.K.

and Tiwari

, Credit card fraud detection under extreme imbalanced data: a comparative study of data-level algorithms, Theoretical Artificial Intelligence 34(4) (2022), 571–598.

10.

Jiang

, Xie

, Wen

and Ren

, Modeling highly imbalanced crash severity data by ensemble methods and global sensitivity analysis, Security 14(4) (2022), 562–584.

11.

Vaheed

, Pratap Singh

, Nayak

and Rao

, Student’s Academic Performance Prediction Using Ensemble Methods Through Educational Data Mining, In Smart Intelligent Computing and Applications 1 (Springer, 2022), 1 (2022), 215–224.

12.

Wijaya

D.R.

, Afianti

, Arifianto

, Rahmawati

and Kodogiannis

V.S.

, Ensemble machine learning approach for electronic nose signal processing, , Sensing and Bio-Sensing Research 36 (2022), 100495.

13.

Yong

, Wei

, Li

K.C.

, Shen

, Zhou

, Wozniak

, Polap

and Damasevičius

, , Ensemble machine learning approaches for webshell detection in Internet of things environments, Transactions on Emerging Telecommunications Technologies 33(6) (2022), e4085.

14.

Almazroi

A.A.

, Usmani

R.S.A.

COVID-19 cases prediction in Saudi Arabia using tree-based ensemble models, Intelligent Automation and Soft Computing (2022), 389–'400.

15.

Ogutu

R.V.A.

, Rimiru

R.M.

and Otieno

, Target Sentiment Analysis Ensemble for Product Review Classification,, Journal of Information Technology Research 15(1) (2022), 1–13.

16.

Freund

and Schapire

R.E.

, A decision-theoretic generalization of on-line learning and an application to boosting, Journal of computer and system sciences 55(1) (1997), 119–139.

17.

Thilakavathy

and Diwan

, An adaboost support vector machine based Harris Hawks optimization algorithm for intelligent quotient estimation from MRI Images,, Neural Processing Letters (2022), 1–18.

18.

Ding

, Zhu

, Chen

and Li

, An Efficient AdaBoost Algorithm with the Multiple Thresholds Classification, Applied Sciences 12(12) (2022), 5872.

19.

Sevinç

, An empowered AdaBoost algorithm implementation: A COVID-19 dataset study, Industrial Engineering 165 (2022), 107912.

20.

Wang

, Chen

, Yao

Negative correlation learning for classification ensembles, In International Joint Conference on Neural Network, IEEE (2010), 1–8.

21.

Sun

, Kamel

M.S.

, Wong

A.K.

and Wang

, Cost-sensitive boosting for classification of imbalanced data, Pattern Recognition 40(12) (2007), 3358–3378.

22.

Schapire

R.E.

and Singer

, Improved boosting algorithms using confidence-rated predictions, Machine Learning 37(3) (1999), 297–336.

23.

Chawla

N.V.

, Lazarevic

, Hall

L.O.

, Bowyer

K.W.

SMOTEBoost: Improving prediction of the minority class in boosting, In European conference on principles of data mining and knowledge discovery (Springer, 2003), 107–119.

24.

Guo

and Viktor

H.L.

, Learning from imbalanced data sets with boosting and data generation: the databoost-im approach, ACM SIGKDD Explorations Newsletter 6(1) (2004), 30–39.

25.

, Liang

, Ma

and He

, MSMOTE: Improving classification performance when training data is imbalanced, In second international workshop on computer science and engineering IEEE 2 (2009), 13–17.

26.

Seiffert

, Khoshgoftaar

T.M.

, Van Hulse

and Napolitano

, RUSBoost: A hybrid approach to alleviating class imbalance, (Part A: Systems and Humans), IEEE Transactions on Systems, Man, and Cybernetics 40(1) (2009), 185–197.

27.

Galar

, Fernández

, Barrenechea

and Herrera

, EUSBoost: Enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling, Pattern Recognition 46(12) (2013), 3460–3471.

28.

Breiman

, Bagging predictors, Machine Learning 24(2) (1996), 123–140.

29.

Das

, Rathore

, Roy

, Chakraborty

, Jatav

R.S.

, Sethi

and Kumar

, Comparison of bagging, boosting and stacking algorithms for surface soil moisture mapping using optical-thermal-microwave remote sensing synergies, , Catena 217 (2022), 106485.

30.

Meira

, Oliveira

F.L.C.

and de

L.M.

, Menezes, Forecasting natural gas consumption using Bagging and modified regularization techniques, , Energy Economics 106 (2022), 105760.

31.

Wang

and Yao

, Diversity analysis on imbalanced data sets by using ensemble models, In IEEE symposium on computational intelligence and data mining, IEEE (2009), 324–331.

32.

Błaszczyński

, Deckert

, Stefanowski

and Wilk

, Integrating selective pre-processing of imbalanced data with ivotes ensemble, In International conference on rough sets and current trends in computing, (Springer, 2010), (2010), 148–157.

33.