A review of intelligent data analysis: Machine learning approaches for addressing class imbalance in healthcare

Abstract

Intelligent data analysis rapidly transforms healthcare care by improving patient care and predicting health outcomes through machine learning (ML) techniques. These advanced analytical methods allow intelligent healthcare systems to process large amounts of health data, improving diagnosis, treatment, and patient monitoring. The success of these systems is highly dependent on the quality and balance of the data they analyze. Class imbalance, a situation where certain classes dominate the dataset, can significantly affect the accuracy and effectiveness of ML models. In healthcare, it is not only crucial, but urgent, to accurately represent all conditions, including rare diseases, to ensure proper diagnosis and treatment. For this analysis, data was gathered from six reputable academic databases: ScienceDirect, IEEE Xplore, Scopus, Web of Science, Google Scholar, and PubMed. This review offers a comprehensive overview of current approaches to handling class imbalance, including data preprocessing methods like oversampling, undersampling, hybrid techniques, and ensemble learning strategies such as bagging, boosting, and AdaBoost. It also addresses the limitations of these methods and the ongoing challenges in effectively managing class imbalance in healthcare data. Furthermore, the review explores innovative and promising strategies that have shown success in overcoming class imbalance, with a particular emphasis on fairness, diversity, and ethical considerations, offering a hopeful outlook for the future of healthcare data analysis. The discussion highlights how class imbalance can impact the accuracy and reliability of intelligent healthcare systems, underscoring its significance in improving patient care, healthcare delivery, and the broader medical community.

Keywords

machine learning class imbalance imbalanced data medical data preprocessing techniques data quality

1 Introduction

Class imbalance presents challenges across various disciplines, bearing particularly significant implications in the healthcare sector due to its direct impacts on patient outcomes and overall health. Addressing class imbalances in medical datasets necessitates a comprehensive understanding of the ethical, clinical, and public health considerations that differentiate healthcare from other fields such as finance and security. The emergence of intelligent healthcare systems, underpinned by advanced technologies, data analytics, and connectivity, has transformed traditional medical practices. These systems leverage smart devices, infrastructure, Artificial Intelligence (AI), the Internet of Things (IoT), big data analytics, and Machine Learning (ML) to enhance patient care, improve operational efficiency, and drive superior health outcomes. Nonetheless, managing class imbalance within this framework is imperative to ensure that these intelligent healthcare solutions remain reliable and equitable.^1,2 These networks generate vast amounts of data necessitating data classification. In recent years, there has been a notable increase in the exploration of data classification. Imbalanced data collection is commonplace in ML and results in the development of inaccurate classification algorithms. Various methodologies have been employed to classify unbalanced data sets. The majority of traditional classifier learning methods operate under the presumption of a generally balanced class distribution and equal error costs, which constitutes a significant limitation when classifying data characterized by an uneven class distribution.³ Given the potential for bias, it is imperative to exercise stringent caution concerning the quantity, quality, and processing of the data.

The dependability of this reviewer model is compromised along with the ethical, fairness, and diversity questions. As a type of inaccuracy in ML, “data bias” occurs when sections of a data collection are given excessive weight compared to others.^4,5 It is critical to understand how class imbalances in medical data sets can impact machine learning models, leading to skewed predictions and reduced accuracy due to disproportionate representation of classes, such as gender imbalances and socioeconomic disparities. In addition to complicating the development of accurate healthcare analytics, this issue exacerbates existing inequalities, raising ethical concerns and costs. The development of machine learning applications in healthcare must use sophisticated methodologies to ensure that they accurately reflect and serve diverse patient populations, advancing fair and effective healthcare delivery.^6–8 The existing literature classified the methods into three areas data-level, algorithmic, and cost-sensitive. It is critical to determine whether the mislabeling of minority classes affects the cost of classifying fraudulent transactions and whether the classifier declares fraudulent transactions as usual.^9,10 Classifiers often overlook the class represented by a few examples, although this class is essential, and demonstrate correctness by treating it as the majority class. Data that might be helpful for training are lost if samples from the minority class are eliminated.

The predominant strategy employed to attain an even distribution of instances across all categories is under-oversampling. In contrast to algorithmic methodologies, preprocessing procedures restrict the number of classifiers utilized as cost-sensitive approaches. Conversely, problem-dependent methods involve the meticulous selection and application of classifiers that are specifically designed to confront the distinctive challenges associated with imbalanced data. Data are externally resampled to equilibrate the instance count within each class. For a dataset to be classified as highly divergent, it must contain a substantially greater number of instances from one class relative to another. The class ratio indicates that the number of samples from the majority class surpasses those from the minority class, which can vary from (100 - 1) to (1000 - 1). Current research endeavors aim to minimize the impact of unbalanced data on classification algorithms by developing new algorithms or enhancing existing ones. Conventional practice entails the use of ensemble learning or a cost-sensitive learning approach.¹¹ This enhances the accuracy of the baseline classification algorithm, while the former increases the penalty for misclassifying minority class members compared to majority class members. The cost-sensitive learning technique utilizes

Undersampling, Oversampling, and Hybrid Sampling. Undersampling is a methodology utilized within the domain of machine learning and data analysis to mitigate the problem of class imbalance within a dataset. Class imbalance transpires when one class (or category/label) in a classification task has a markedly lower number of instances compared to another class. This imbalance can lead to skewed model performance, where the model may excel in predicting the majority class but perform inadequately for the minority class.¹² Through oversampling, the distribution of data is improved by augmenting the quantity of samples in the undersampled category. In conclusion, hybrid sampling techniques amalgamate oversampling and undersampling to yield results of statistical significance.^13,14 This systematic literature review (SLR) is crucial to deepen an understanding of the complexities associated with class imbalance, biases in favor of dominant data, and overall data integrity in healthcare analytics. Although a substantial body of literature exists on imbalanced datasets, most studies fail to address the challenges and intricacies pertinent to medical data. This SLR endeavors to bridge these lacunae by investigating the impact of class imbalances on healthcare classifier performance, contrasting various preprocessing techniques, and identifying optimal strategies to balance medical data and classifiers.

Moreover, this SLR addresses data preprocessing methods, classification algorithms, model evaluation, and challenges with future prospects. Initially, we present a comprehensive research methodology for conducting this SLR, followed by a taxonomy of the established areas of unbalanced learning application, and subsequently, we examine the applications within each category. Ultimately, the reviewed publications are synthesized to furnish new avenues for the investigation of unbalanced learning challenges and rare event identification. Machine Learning algorithms are explored to tackle classification challenges in unbalanced medical datasets, encompassing preprocessing techniques, joint learning strategies, algorithmic approaches, and cost sensitivity. Various preprocessing methods are compared to address classification imbalances, thereby achieving optimal balance and classification of healthcare data. Additionally, the SLR classifies and scrutinizes current unbalanced learning applications and proposes new methodologies for the identification of uncommon events and unbalanced learning. As part of the objectives, a research methodology for the SLR will be developed, encompassing the analysis of data-level, algorithmic, and cost-sensitive techniques, along with an overview of imbalanced learning applications.

The remainder of the paper is structured as follows: Section 2 outlines the research methodology for this literature review. Section 3 examines the taxonomy of the literature concerning imbalanced data. Section 4 delves into data-level algorithms, exploring both internal and external data-level algorithms. Section 5 discusses the evaluation metrics for imbalanced datasets, while Section 6 addresses the extant challenges and opportunities.

2 Research methodology

The SLR is one type of review in which the sequence of steps is used to reduce research bias. This SLR is based on existing literature on ML algorithms to overcome classification problems in unbalanced medical data sets. Figure 1 shows the research methodology framework to carry out this SLR and report the results. This SLR adopted the three-step review process that includes planning, conducting, and documenting.

Figure 1.

Research methodology framework.

The steps of the research methodology, including planning, conducting and documenting the SLR. In the planning phase, the SLR begins by identifying the existing literature on ML algorithms to overcome classification problems in imbalanced medical data sets. First, research questions are created to extract exact studies from the literature, such as:

What are the different ML algorithms for healthcare data analysis?

What are the possible solutions to overcome the problem of an imbalanced data set?

What are the current solutions to solve the problems of the unbalanced data set?

The data for this analysis were meticulously gathered from six esteemed academic databases, namely ScienceDirect, IEEE Xplore, Scopus, Web of Science, Google Scholar, and PubMed. Following the identification of a need for a systematic literature review (SLR), the research questions were employed to explicitly refine and direct the review process. An analysis of these databases facilitated the retrieval of approximately 1200 peer-reviewed articles, originating from an initial collection of approximately 500 literature sources. To uphold the quality of the review, stringent inclusion and exclusion criteria were enforced. Only studies published in scientific peer-reviewed journals were considered, whereas studies from alternative fields were excluded.

Throughout the review phase, each article was assessed against quality evaluation criteria, as delineated in Section 3. In conclusion, 152 works were earmarked for an in-depth examination in accordance with these criteria. These selected articles were systematically analyzed and documented to substantiate the findings and conclusions presented in our evaluation.

2.1 Data quality requirements

The data quality requirements for imbalanced data are vital in many domains, including artificial intelligence and machine learning. Classification tasks become more difficult in imbalanced data sets when minority classes are underrepresented. Researchers have investigated many methods to improve the performance of classification models when faced with unbalanced data in an effort to resolve this issue. General classification performance can be improved with a balanced dataset as compared to an imbalanced one, according to studies.¹⁵ However, the traits of imbalanced datasets and the number of classifiers required to successfully manage them while preserving individual quality and complementarity must be carefully considered.¹⁶ Several ensemble learning-based methods have been suggested for dealing with imbalanced datasets; they include EasyEnsemble, SMOTEBagging, Balanced Random Forest, and SMOTEBoost.¹⁷ Ensemble models and intelligent algorithms for base classification are necessary to improve the precision of classification data in many domains, including diagnosis and treatments for prostate cancer.¹⁸ Investigating deep neural networks to deal with extremely imbalanced data in bioinformatics has brought attention to the importance of complex procedures in the resolution of data quality issues.¹⁹

Inconsistencies in real-world data pose significant challenges for statistical analysis. Imbalances are common in medical datasets, making data analysis solutions to identify healthy individuals more expensive to develop. Class imbalance is a growing concern in data mining, particularly when constructing analysis tools from nonuniform medical datasets. This problem increases the cost of misdiagnosing healthy individuals as ill. To improve clinical trial data analysis and quality, a quality assurance system has been implemented in medical records. Introduce a user-centered data quality theory grounded in well-defined concepts and three domain- specific language categories. An ensemble classification method is proposed that offers implicit regularization to mitigate overfitting issues, particularly when handling binary imbalanced data. This approach involves creating two additional virtual spaces alongside the original data set for use with the Support Vector Machine (SVM) classifier.²⁰ Additionally, fuzzy concepts are incorporated to expedite search times, and a four-step approach is presented: component analysis, feature selection, SVM-based minor classification, and sampling. To address the challenge of data asymmetry, random sampling is a commonly used technique. However, it has limitations, highlighting the need for a reliable genetic algorithm-based approach to determine sample ratios. To evaluate effectiveness, 14 data sets were used and a novel weighting method was introduced to improve the performance of less capable classifiers, using an improved AdaBoost algorithm.^21,22 This model handles imbalanced data by considering the Hellinger distance between samples. Additionally, it addresses feature drift issues, aiding in the identification of relevant features for spam detection, thereby improving spam detection capabilities.^23,24

A range of oversampling methods have been employed to equate the number of instances in the minority class with those in the majority class. Nonetheless, the introduction of synthetic examples through oversampling may contribute noise to the dataset. An alternative study²⁵ indicates that synthetic instances can modify the decision boundaries of classifiers. This investigation underscores the advantages of integrating data-level and ensemble methodologies to alleviate the risks tied to data resampling, which might lead to the irreversible loss of critical instances. Classifier performance was assessed using eight distinct datasets, subsequent to the application of specific preprocessing techniques.^26,27 Fuzzy methodologies were implemented involving two families of classifiers; one operating at the bag level, and the other at the individual instance level to tackle the issue of imbalance.²⁸ Furthermore, a price-sensitive categorization method was proposed, which mitigates classification errors by inducing a high bias.²⁹

2.2 Visualization techniques for imbalanced data

To address the challenges associated with imbalanced data, a variety of visualization techniques have been proposed. Researchers have conducted a thorough review that investigates progress in the field of imbalanced data learning, encompassing present issues, characteristics, emerging technologies, and the metrics employed to assess learning effectiveness. Furthermore, it acts as a catalyst for prospective research by emphasizing potential opportunities, challenges, and directions within imbalanced data learning.¹⁵ The review delves into various aspects of imbalanced data learning, including data streams, classification, clustering, and regression applied in practical contexts. The study systematically categorizes open challenges inherent in imbalanced learning, which include binary imbalanced data, multi-label learning, imbalances in big data, as well as unsupervised and semi-supervised learning.¹⁶

In,³⁰ an innovative data visualization mechanism for the healthcare sector is introduced, engineered to automate the collection of data from medical documents. This solution utilizes a variety of data processing and visualization instruments to compile informative data advantageous for decision-making processes. Additionally, it encompasses the creation of a comprehensive medical information dashboard, with empirical results attesting to its effectiveness in visualizing healthcare knowledge. In,³¹ a meticulous taxonomy of existing applications in the realm of imbalanced learning, notably within data mining and machine learning, is presented. This research assimilates prior reviews, brings forth novel insights, and offers insightful recommendations for future research trajectories. Addressing the pivotal challenge of evaluating imbalanced datasets,³² elucidates an optimal model that integrates a combination of methodologies, including G- means, F-measure, Likelihood Ratio, Youden Index, alongside a variety of metrics such as AUC, Partial AUC, Weighted AUC, Cumulative AUC, and AUL. This holistic approach augments accuracy in churn prediction assessment models, a predominant challenge in data mining. Furthermore, in,³³ the authors underscore the underemphasis placed on the assessment of imbalanced data relative to balanced data. They propose the application of Deep Neural Networks (DNN) for the classification of imbalanced data, employing Means False Error (MFE) and Means Squared False Error (MQFE) as methods for training DNN models.

3 Taxonomy of literature on imbalanced data

Over the years, numerous strategies have been implemented and evaluated to address the issue of imbalanced data. This section examines various data preprocessing techniques. Nonetheless, integrating these and other methodologies may tackle class division problems in multiple ways. The conclusions derived from the reported studies might provide grounds for optimism. It could modify the ensemble learning method without altering the classifier, owing to the algorithms that support these techniques. Alterations have been made to the ensemble learning algorithm; achieving the learning stage necessitates initial data preparation. Moreover, alternative possibilities exist. Figure 2 depicts the taxonomy of literature studies concerning imbalanced data approaches.

Figure 2.

Taxonomy of literature studies for imbalanced data solution methods.

3.1 Preprocessing techniques

Machine learning classifiers face significant challenges in addressing outliers and imbalanced datasets. As cited in,³⁴ a selective data preparation strategy was proposed, utilizing an artificially constructed subset that modules can integrate to address outlier instances. To enhance the diversity within the training data, this strategy employed the Synthetic Minority Oversampling Technique (SMOTE), applied subsequent to the oversampling of anomalies, irrespective of class. The objective is to lessen the influence of outliers on the training dataset. Findings indicate that selective oversampling provides an improved classification method for SMOTE, as illustrated in Figure 2. Three principal techniques exist for managing imbalanced data: undersampling, oversampling, and hybrid methodologies, each intended to equilibrate the class distribution within the dataset.

Generative models such as Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs) have gained significant traction in the management of imbalanced datasets. In the context of tasks such as fault diagnosis and anomaly detection, GANs have shown considerable effectiveness in generating synthetic data that closely resembles actual data.^35,36 Conversely, VAEs are recognized as powerful instruments for unsupervised learning, facilitating the representation of intricate high-dimensional data within a low- dimensional latent space. Recent scholarly investigations have explored the integration of GANs and VAEs to enhance generative model performance. Scholars have endeavored to optimize both generative and reconstructive capabilities by leveraging the strengths inherent in both models.³⁷

3.1.1 Under-sampling

The principal aim of undersampling methodologies is to enhance the precision of the minority class by diminishing the size of the majority sample populations. Randomly selecting and subsequently removing a sample from the majority class exemplifies a straightforward undersampling technique. Classifier performance may be adversely affected when random samples are eliminated from the majority class, as this could result in the loss of crucial information among the remaining instances. Researchers have devised algorithms designed to meticulously select models from the extensive majority classes that do not contain pertinent data.³⁴ The stacking of community undersampling ensembles is a methodological approach employed to mitigate class imbalance via undersampling. This technique utilises a stacked ensemble as an undersampling strategy, with considerations for class imbalance, class overlap, and class noise. Rather than implementing undersampling solely as a preprocessing step, this method assimilates the undersampling process into the stacked ensemble itself, thereby establishing what is termed a stacked undersampling ensemble.

This approach applies a stacked ensemble as an undersampling strategy, taking into account class imbalance, class overlap, and class noise. Instead of treating undersampling as merely a preprocessing step, this approach incorporates the stacked ensemble directly, thereby forming a stacked undersampling ensemble. In addressing the degradation issue, a stacked ensemble is employed as an undersampling strategy that considers class imbalance, class overlap, and class noise. By integrating the undersampling process within the ensemble itself, a stacked undersampling ensemble is developed. A novel procedure for undersampling is employed, utilising a three-stage framework that includes demising, fuzzy K-Means clustering, and representative sample selection.³⁵ This approach incorporates a denoising phase preceding the clustering- based undersampling strategy to enhance the system's resilience to noisy data.

To select a representative of the majority class, a random sample is used.³⁶ Clustering-based undersampling is used in predictive modeling of healthcare-associated infections to under-sample just certain parts of the whole space of variables, hence reducing the likelihood of overfitting and the severity of under-negative sampling's consequences. A clustering method is used in an n-dimensional space, where n is the number of attributes other than the class attribute, to generate these areas.³⁷

The predictive diagnosis of diseases with imbalanced data sets is suggested. To remedy this sampling discrepancy, it used an undersampling technique based on the overlap between the two groups, which resulted in more representative data for the underrepresented minority in the overlapping areas. This is achieved by identifying and excluding negative class samples from the overlap area, increasing class reparability in the data space.³⁸ It implemented a pair of undersampling techniques where the first cluster centers and the second relied on nearest neighbors to strike a class-balancing chord. Five different classification methods are used to evaluate the performance of the new method on the various datasets. The experimental findings showed in,³⁹ that these methods are more accurate when a comparison of sample techniques is used to imbalanced medical data. This research examined the impact of imbalance strategies on the diagnosis of patients with lung cancer using three classifiers (logistic regression, random forest, and linear svc), and compared 10 undersampling techniques, seven oversampling methodologies, and two integrated sampling techniques. The findings favor oversampling techniques over undersampling and hybrid approaches.⁴⁰

3.1.2 Oversampling

Oversampling strategies, in contrast to undersampling ones, focus on increasing the number of samples from the minority group. To improve the classification performance of minority samples, the SMOTE uses linear interpolation on comparable neighbor samples randomly chosen to generate additional minority samples.^41,42 However, there are drawbacks to using oversampling techniques, such as the introduction of overlap, boundaries, noise, and other artifacts into the samples. A set of SMOTE algorithm improvements is presented to address the issues. Most recently, the misclassification-oriented synthesized minority oversampling approach (m-smote) and edited closest neighbor based on Random Forest (RF) have been used in conjunction with synthesis from minority class instances. To collect more data from underrepresented groups, M-smote is often used. The m-smote oversampling rate is calculated from the RF misclassification rate.

To preserve the distributional characteristics of the original data. In,^43,44 two oversampling methods are proposed including adaptive-smote and selectively selecting groups of inner and danger data from the minority class to synthesize a new minority class based on the selected data. Danger data refers to vulnerable data in ML systems such as model theft, system hijacking, data poisoning, and evasion attacks. The second approach, known as Gaussian oversampling, combines dimensionality reduction with the Gaussian,⁴⁵ distribution to produce a flatter distribution with a narrower tail. Synthetic minority oversampling is used to create minority class instances and rebalance medical datasets. This method is improved by applying the orchard approach. In,⁴⁶ an interclass and intra-cluster distribution-aware distributed fuzzy-based adaptable synthetic oversampling technique is proposed. Fuzzy clustering of c-means, together with a weighted distribution and a mixed synthetic approach, comprise the components. According to the definition of natural neighbors of the synthetic minority oversampling approach presented in.⁴⁷

3.1.3 Hybrid sampling

Hybrid sampling is all about finding the right balance in datasets by boosting the number of minority samples and reducing the majority ones. This method helps to tackle the common problems with oversampling—like overfitting—and undersampling, which can result in losing important data. By blending these two approaches, hybrid sampling provides a more reliable way to handle imbalanced data.^44,48 Take health record analysis, for instance, where there is often a massive imbalance with a few minority samples. To deal with this, an algorithm called Husdos-boost is used. It combines boosting with an intelligent, distribution-based oversampling technique. This method also uses undersampling to remove duplicate majority samples and creates new minority samples based on the actual distribution of that class.,⁴⁹ A common technique in this scenario is M-SMOTE, where more samples are drawn from the underrepresented group.

In this process, many samples in the Edited Nearest Neighbor (ENN) step are undersampled. Then, hybrid sampling kicks in, and the classification prediction continues until the classification index signals it is time to stop.⁵⁰ Finally, classification prediction is performed using hybrid sampling with the stopping threshold for iterations established according to changes in the classification index. In,⁵¹ In another study, a new hybrid method combining synthetic minority oversampling with ENN was used to clean and filter the data. This approach effectively addresses imbalanced data by leveraging the strengths of both over- and undersampling. The K- Means method was applied to find representative samples from both the majority and minority classes, with a local search making the process more efficient. How data is spread across different classes is crucial for accurate classification.

However, in imbalanced datasets, the minority class often gets the short end of the stick, making it challenging for classifiers to perform well, especially when there is class overlap or limited samples. Class imbalance is a persistent issue in data mining, particularly in real-world categorization tasks. There are several strategies to tackle class differences, generally falling into three main categories. Figure 3 is highlighted because it shows the application of the SMOTE technique in improving the accuracy of diabetes prediction. This figure is fascinating because it illustrates how hybrid sampling techniques are applied in practice, effectively balancing data and improving prediction outcomes—making it a key example of success in the critical area of diabetes prediction.^52–54 Figure 3 shows the SMOTE technique used for diabetes prediction accuracy.

Figure 3.

Improves diabetes prediction accuracy by smoothing out noisy training data using the interquartile range and the SMOTE Technique.⁵⁵

3.2 Existing algorithms

One way to deal with unbalanced data is to develop new algorithms or improve existing ones. This section discusses the different types of existing algorithms.

3.2.1 Variants of SVM

Due to the prevalence of class imbalance issues in DNA microarray data, the predictive ability of minority classes is significantly compromised, often leading to the neglect of the minority group in model predictions. SVM is highlighted in this section because of its exceptional ability to handle complex, high-dimensional, and imbalanced datasets, especially in areas like DNA microarray data. Its robustness and adaptability, through various modifications and hybrid models, make it a powerful tool for improving prediction accuracy, particularly in challenging scenarios. This focus on SVM is justified by its proven effectiveness and versatility in addressing the unique challenges posed by imbalanced data.^56,57 To address this issue, a novel under sampling strategy based on the concept of Ant Colony Optimization (ACO) has been developed. A common categorization method is used for imbalanced data by using a SVM.^58–60

The Megatrend Diffusion Strategy (MTD) may also be utilized to increase minority group representation. During the predictor phase, there are a variety of ML approaches to merge SVM and KNN into hybrid MTD- KNN, MTD-SVM, and prediction models. The researchers used cost,^61,62 and discussed that MTD-SVM is superior to the alternatives of RF, Naive Bayes, and KNN. The researchers suggested a prediction algorithm that can classify people with high type 2 diabetes. There are two sections of the investigation. Data imputation methods, such as median value imputation, KNN imputation, and iterative imputation are explored in this study in terms of their performance when applied to the missing data problem. Consequently, many classification techniques (including linear, tree-based, and ensemble algorithms) are used to verify the effects of these

imputations on classification precision. It is used as an artificial neural network to model the highest quality imputed data, while SMOTE- Tomek is used to ensure that it fairly represents all classes. The accuracy is 98% by using this method on the test data, which is higher than any other tested method on the same data. Population and gender are the main themes in the data set.⁶²

Dealing with a dataset that is skewed toward one group or the other, the majority or the minority, the kernel scaling approach is used to improve SVM. The PSS-SVM classification, which mixes Parallel Select Sampling (PSS) with the SVM, demonstrated impressive results on benchmark datasets, which are much better than ordinary SVM due to the lack of convergence. It may reduce the imbalance in large data sets by selecting data from the dominant class.^63,64 A mixed ensemble of SVMs is used under and oversampling tactics to improve prediction performance. Extensive testing has shown that this strategy is superior for standalone SVM and several alternative classifiers. The research compared the performance of the EnSVM basic model with the selective EnSVM + ensemble by using various resampling strategies.⁶⁴ Using SVM training as a preprocessor, these strategies improved the results of intelligent machine learning algorithms including Multilayer Perceptron (MLP), RF, and Logistic Regression (LR).

There are two phases to the implementation of the balancing approach. In the first stage, the SVM altered the imbalanced data to produce better balance data, and in the second stage, MLP, RF, and LR used these enhanced data as input.^65–67 Table 1 presents the advantages and disadvantages of SVM and its different variations. Each method, such as Megatrend Diffusion SVM (MTD-SVM), Ant Colony Optimization SVM (ACO- SVM), and Ensemble SVM (EnSVM-R), offers unique benefits tailored to specific imbalanced data challenges, particularly in healthcare. For instance, MTD- SVM effectively improves minority class representation, but its effectiveness is limited to smaller datasets, whereas EnSVM-R enhances predictive accuracy but requires careful parameter tuning. This revised discussion clarifies why specific algorithms may be preferred depending on the dataset's characteristics and computational resources.

Table 1.
A summary of Various SVM algorithm approaches.

Algorithm Advantages Disadvantages

Ant Colony Optimization SVM (ACO-SVM)^68,69 - Effectively addresses unbalanced data classification problems.
- Utilizes ACO-based sample selection for optimization. - High computational and storage overhead.

Megatrend Diffusion SVM (MTD-SVM) ⁶⁵ - Enhances the representation of minority class samples through synthetic data generation. - Synthetic data generation is resource-intensive.
- Performs optimally only on small datasets.

Kernel-Transformed SVM with Adjusted F-Measure ²⁰ - Balances cost functions and addresses imbalance using kernel scaling. - Requires efficient parameter and kernel estimation strategies.

Parallel Selective Sampling SVM (PSS-SVM) ⁶⁴ - Offers precise statistical predictions with minimal computational complexity. - Necessitates parallel and distributed computing capabilities.

Ensemble SVM with Resampling (EnSVM-R) and EnSVM + -R ⁶³ - Demonstrates superior performance compared to standard SVM approaches. - Lacks automatic mechanisms for determining the optimal value of kkk.

Intelligent SVM Preprocessor for MLP, LR, and RF (ISPM) ⁷⁰ - Balances the dataset effectively while increasing minority class representation. - Computational intensity may rise with larger datasets.

Second-Order Cone Programming SVM (SOCP-SVM) ⁷¹ - Robust classification performance due to SVM-LP formulation. - May require substantial computational resources for implementation.

Near Bayesian SVM (NBSVM) ³⁴ - Minimizes the misclassification cost for minority class samples. - Trade-offs between Bayesian adjustments and computational speed.

Algorithm	Advantages	Disadvantages
Ant Colony Optimization SVM (ACO-SVM)^68,69	- Effectively addresses unbalanced data classification problems. - Utilizes ACO-based sample selection for optimization.	- High computational and storage overhead.
Megatrend Diffusion SVM (MTD-SVM) ⁶⁵	- Enhances the representation of minority class samples through synthetic data generation.	- Synthetic data generation is resource-intensive. - Performs optimally only on small datasets.
Kernel-Transformed SVM with Adjusted F-Measure ²⁰	- Balances cost functions and addresses imbalance using kernel scaling.	- Requires efficient parameter and kernel estimation strategies.
Parallel Selective Sampling SVM (PSS-SVM) ⁶⁴	- Offers precise statistical predictions with minimal computational complexity.	- Necessitates parallel and distributed computing capabilities.
Ensemble SVM with Resampling (EnSVM-R) and EnSVM + -R ⁶³	- Demonstrates superior performance compared to standard SVM approaches.	- Lacks automatic mechanisms for determining the optimal value of kkk.
Intelligent SVM Preprocessor for MLP, LR, and RF (ISPM) ⁷⁰	- Balances the dataset effectively while increasing minority class representation.	- Computational intensity may rise with larger datasets.
Second-Order Cone Programming SVM (SOCP-SVM) ⁷¹	- Robust classification performance due to SVM-LP formulation.	- May require substantial computational resources for implementation.
Near Bayesian SVM (NBSVM) ³⁴	- Minimizes the misclassification cost for minority class samples.	- Trade-offs between Bayesian adjustments and computational speed.

3.2.1 Clustering

Clustering is used to classify data, while outlier detection is used to find outliers. The similarity-based hierarchical decomposition method runs on the back of clustering algorithms and the detection of outliers.⁷² There are two parts dedicated to the process of creating hierarchies: the first discusses making a mistake while classifying clusters, while the second focuses on doing it correctly.⁷³ Data similarities of labeled subgroups at every level are utilized to generate hierarchy and feature sets, as well as other data based on these different levels. With this method, this revision can prevent issues such as class overlap and inequities between groups.⁷⁴ The research presented under sampling based on clustering in data with an unequal distribution of classes.⁷⁵ Fuzzy Rule-based Classes (FRBCSs) are also used to improve classification accuracy. Using 2-tuple genetic tuning, which also increases the efficiency of FRBCSs, this approach is useful for dealing with imbalanced data, as it can handle both low and high ratios of skewed datasets. Clustering-based oversampling is a revolutionary data-level resampling strategy for improving learning from class unbalanced datasets. The concept of methodology that underpins the proposed method is that one can infer the number of new sample points that need to be generated for a minority class sample by using the distance that exists between a minority class sample and the respective cluster centroid for that minority class sample.^63,76

3.2.2 Feature selection methods

In contexts of high-dimensional data, the selection of features frequently serves as the preliminary step for numerous machine learning algorithms. In scenarios involving high-dimensional heterogeneous data, the implementation of feature selection delivers optimal outcomes.⁷⁷ A wide array of applications, such as bioinformatics, text mining, and image classification, are influenced by the challenges of high dimensionality and class imbalance. Through thorough experimentation across various domains employing a Random Forest classifier, a hybrid approach is advocated, integrating feature selection techniques with strategies to address class imbalance. This hybrid methodology adeptly manages datasets manifesting both characteristics, exhibiting superior efficacy relative to the use of either technique independently,⁷⁸ as illustrated in Figure 4. Regarding imbalanced datasets, existing methods applied for feature selection have been shown to be insufficient. A paradigm is proposed for feature selection which distinguishes between majority and minority characteristics. By assessing demographic data in two distinct manners (majority and minority), it becomes straightforward to adapt existing standards accordingly.

Figure 4.

Hybrid approaches for combine feature selection with imbalance Learning.⁷⁸

So far, another method proposed in,⁷⁹ is effective in several high-dimensional, imbalanced, small-sample datasets. In,⁸⁰ proposed a decomposition-based technique and a Hellinger distance-based approach for feature selection for imbalanced datasets. The conflict caused by the unequal distribution of classes is first addressed by calculating the distributional gaps. The second method partitioned huge classes into arbitrary subclasses by using a wide range of classification algorithms. Recently, it has investigated the efficacy of hybrid learning strategies that combine feature selection approaches to lower the data dimensionality with appropriate methods that deal with the negative impacts of class imbalance (in particular, data balancing and cost-sensitive techniques). Extensive studies have been conducted across data sets from many domains using a popular classifier called the RF, which is effective in high-dimensional spaces and applied to unbalanced problems. The findings demonstrated the advantages of this combined strategy over feature selection or imbalance learning alone.⁸¹

3.2.3 One-class learning

Learning approaches based on traditional methods are plagued by class imbalance, which reduces performance and produces inaccurate results. It is when one group's representation is disproportionate to the other data set and is viewed from different category perspectives. They also presented a significant problem due to the high price of misclassification of minorities. The existence of overlapping instances of data inconsistencies could lead to a disastrous outcome for powerful learning.⁸² By using this method, the algorithm selected only the samples that correspond to the desired class. Thus, it may be used to classify asymmetric information. For example, only samples from minority groups are retrieved, and all other groups are excluded.

There is a clear advantage to using one-class learning on large and imbalanced datasets. Furthermore, this method uses a rule-based approach to the divide-and-conquer principle of the rule- induction system to produce rules iteratively.⁸³ Evidence for Ripper's training has been found in the past. Rules are generated for each class, from the rarest to the most common, with more criteria being added to each rule over time. Ripper claims that this algorithm's strength lies in its ability to learn rules for marginalized groups. Overly skewed datasets, ones that have noisy features as well as high-dimensional space, might benefit greatly from one-class learning. In most cases, the benefits of one-class learning outweigh the additional costs associated with its implementation.⁸⁴ Figure 6 shows the framework for the prediction of type 2 diabetes as used in,⁶² (see Figure 5).

Figure 5.

Prediction of type 2 diabetes Framework.⁶²

Figure 6.

HyperSMURF approach (Polikar, 2006⁸⁵).

3.3 Ensemble learning algorithms

Another key is that the quality of the base learners is also crucial to the success of ensemble approaches. It is well known that ensembles generally improve the accuracy and robustness of the predictions of the learning machine.⁸⁶ Ensemble classifiers are made to make a single classifier more accurate by training multiple classifiers and combining their results into a single classifier that performs better than the individual classifiers. Therefore, ensemble-based techniques combine ensemble learning algorithms and hybrid approaches, such as data-driven algorithms or cost-sensitive solutions. Algorithmic techniques, such as ensemble learning are used to modify the underlying learner rather than update the fundamental classifier. To improve the performance of each classifier, the composition can be utilized in the same way to combine numerous classifiers into a single and more effective one.

In ML, the accuracy of a single classifier has been improved. Although both classifiers and individual learning classifiers are indeed incapable of resolving the imbalance class problem on their own, this problem can be addressed by using well-considered learning methods.⁸⁷ Assuming a weak learning algorithm allows the development of a wide variety of strategies and procedures that can be used to create an ensemble learning algorithm. Described a new strategy that outperforms state-of-the-art algorithms in two separate settings, namely, the prediction of noncoding variants associated with Mendelian and complex disorders, taking advantage of imbalance-aware learning strategies based on resampling techniques and a hyper-economy approach.⁸⁸ Figure 6 shows the hyper SMURF method diagram, and the blue rectangles represent the majority class, subdivided into n subclasses to increase the number of minority class training instances. The oversampling techniques generate unique cases from the minority class for each partition. At the same time, a sample of the larger demographic is drawn. Finally, hypersurf employs an ensemble-of-ensembles method to integrate the forecasts of n independently trained RF using symmetric data sets.

3.3.1 Bagging learning

Breiman's introduced the idea of bootstrap accumulation for constructing ensembles. Since there is an inequity between the proportions of minority samples in a dataset, statisticians employ a technique called “bagging” to correct the issue. Over-Bagging, Under-Bagging, Under-Over-Bagging, and Under-Over-Bagging are also offered as the four main bagging ensembles, all of which maintain the variety without sacrificing it.⁸⁹

3.3.2 Boosting

Boosting is an ML ensemble technique that aims to improve the performance of weak learners (typically simple models) by strategically combining their predictions. A research study confirmed the ability of the Probability Approximately Correct (PAC) learning framework to transform a poor learner into a good one. Similar to bagging, boosting involves picking the points that result in an incorrect forecast.⁹⁰ Presents a solution for an ensemble method, BPSO-Adaboost-KNN, to handle multiclass unbalanced data classification. The central concept of this algorithm includes feature selection and boosting into an ensemble. Furthermore, we employ a unique assessment metric for multiclass classification called AUCarea.⁸⁵

3.3.3 Adaboost

AdaBoost uses the whole dataset to train each classifier serially and each iteration, focusing on the samples to identify (minority instances) due to the algorithm's bias in learning (the weight) from imbalanced data. AdaBoost.M1 and AdaBoost.M2 are two well-known adaptations used in asymmetrical ruler ships.⁹¹ An ensemble approach is also used for the classification of imbalanced data; this strategy divides an imbalanced data set into a large number of equal-sized subsets of the original data, each of which is subjected to a different number of classifiers using a unique classification algorithm.⁹² Figure 6 shows the hyperSMURF approach.

3.4 Cost-sensitivity methods

Misclassification costs can be high when both data and algorithm levels are involved. This method aims to minimize the total cost of misclassification. A cost-sensitive approach would be more interesting if positive instances were recognized rather than negative ones.^22,93 The cost of misclassifying a non-cancerous patient in the medical domain is limited to additional medical tests. On the contrary, the cost of misdiagnosis will be fatal, as potentially cancerous patients are considered healthy. An evaluation framework is used to bridge the gap between internal and external approaches when evaluating potential cost changes. The learning procedure is modified to accept costs and add costs to samples by combining algorithmic and data- level techniques. The classifiers tend to be biased toward the minority class to reduce overall cost misclassification for both courses if there is a higher mistake rate for the minority class. The cost and emotional impact of a false negative in cancer diagnostics (when a patient tests positive but receives a false negative result) are generally higher than that of a false positive (when a patient pushes negative but is classified as positive).⁸⁵

For example, if incorrectly classified a cancer patient is a positive class (i.e., minority class) and a non-cancer patient in a negative class (majority class), the patient could potentially lose life due to incorrect categorization or a delay in receiving proper medical treatment and diagnosis. To reduce misclassification and total test costs, in the same way, investigate various adjustments to cost matrices for cost-sensitive training and formulation costs.⁹⁴ To provide unequal treatment for classes that are not evenly treated by cost- sensitive learning, it keeps the central AdaBoost learning architecture while simultaneously including cost aspects into the weight update algorithm. Therefore, these methods may diverge solely in the specific ways in which they improve the weight update procedure. AdaC1, AdaC2, and AdaC3; CSB1, CSB2, and AdaCost are the most well-known cost- sensitive boosting algorithms.^95,96 Table 2 shows the summary and analysis of clustering, feature selection, one-class learning, cost sensitivity, and ensemble learning approaches.

Table 2.
Clustering, feature selection, one-class learning, cost sensitivity, and ensemble learning techniques.

Methods Advantages Disadvantages

AdaC1, AdaC2, AdaC3; CSB1, CSB2, and AdaCost (Buda et al., 2018 ⁹⁷ ; Mazurowski et al., 2008 ⁹⁸ ) - Effectively ranks features using statistical measures. - High solution complexity, especially during the training phase.

Similarity-Based Hierarchical Decomposition (Beyan & Fisher, 2015 ⁹⁹ ) - Provides sufficient classification performance, particularly for datasets with low imbalance ratios. - Does not consider multiclass factors, leading to various other issues.

Density-Based Feature Selection (DBFS) (Alibeigi et al., 2012 ¹⁰⁰ ) - Handles small-sized samples and high-dimensional data effectively.
- Performs well on imbalanced data streams. - Experiments are limited to three feature selection methods.

Hellinger Distance-Based Imbalanced Data Classification (Grzyb et al., 2021 ²³ ) - Utilizes true and false positive rates to improve classification of imbalanced data. - The method fails to account for critical misclassification factors.

Deep Learning-Based One-Class Transfer Learning Method (Perera & Patel, 2019 ¹⁰¹ ) - Performs well with high-dimensional data, especially for detecting abnormalities in images. - Struggles with complex anomalies that overlap with normal class data in the feature space.

Cost-Sensitive Learning Method (Fernández et al., 2018 ¹⁰² ) - Achieves better results by reducing error costs. - Adds extra layers of complexity to training, resulting in longer training times.

AdaCost: Misclassification Cost-Sensitive Method (Fan et al., 1999 ¹⁰³ ) - Reduces the upper bound of cumulative misclassification costs. - Limited to binary class imbalanced data, excluding multiclass scenarios.

Ensemble Learning Method for Classification (Sagi & Rokach, 2018 ⁹⁰ ) - Improves both accuracy and generalization ability.
- Increases the accuracy of individual classifiers. - In some cases, ambiguity is introduced during classification.

A Novel Ensemble Method for Classifying Unbalanced Data (Sun et al., 2015 ¹⁰⁴ ) - Performs well for imbalanced data classification and improves minority class coverage. - Suffers from imbalanced class distribution and faces challenges in addressing class issues comprehensively.

Multi-Class Pattern Classification (Rong et al., 2008 ¹⁰⁵ ) - Demonstrates robust classification capabilities. - Limited to binary imbalanced data and does not generalize effectively for multiclass imbalanced datasets.

Methods	Advantages	Disadvantages
AdaC1, AdaC2, AdaC3; CSB1, CSB2, and AdaCost (Buda et al., 2018 ⁹⁷ ; Mazurowski et al., 2008 ⁹⁸ )	- Effectively ranks features using statistical measures.	- High solution complexity, especially during the training phase.
Similarity-Based Hierarchical Decomposition (Beyan & Fisher, 2015 ⁹⁹ )	- Provides sufficient classification performance, particularly for datasets with low imbalance ratios.	- Does not consider multiclass factors, leading to various other issues.
Density-Based Feature Selection (DBFS) (Alibeigi et al., 2012 ¹⁰⁰ )	- Handles small-sized samples and high-dimensional data effectively. - Performs well on imbalanced data streams.	- Experiments are limited to three feature selection methods.
Hellinger Distance-Based Imbalanced Data Classification (Grzyb et al., 2021 ²³ )	- Utilizes true and false positive rates to improve classification of imbalanced data.	- The method fails to account for critical misclassification factors.
Deep Learning-Based One-Class Transfer Learning Method (Perera & Patel, 2019 ¹⁰¹ )	- Performs well with high-dimensional data, especially for detecting abnormalities in images.	- Struggles with complex anomalies that overlap with normal class data in the feature space.
Cost-Sensitive Learning Method (Fernández et al., 2018 ¹⁰² )	- Achieves better results by reducing error costs.	- Adds extra layers of complexity to training, resulting in longer training times.
AdaCost: Misclassification Cost-Sensitive Method (Fan et al., 1999 ¹⁰³ )	- Reduces the upper bound of cumulative misclassification costs.	- Limited to binary class imbalanced data, excluding multiclass scenarios.
Ensemble Learning Method for Classification (Sagi & Rokach, 2018 ⁹⁰ )	- Improves both accuracy and generalization ability. - Increases the accuracy of individual classifiers.	- In some cases, ambiguity is introduced during classification.
A Novel Ensemble Method for Classifying Unbalanced Data (Sun et al., 2015 ¹⁰⁴ )	- Performs well for imbalanced data classification and improves minority class coverage.	- Suffers from imbalanced class distribution and faces challenges in addressing class issues comprehensively.
Multi-Class Pattern Classification (Rong et al., 2008 ¹⁰⁵ )	- Demonstrates robust classification capabilities.	- Limited to binary imbalanced data and does not generalize effectively for multiclass imbalanced datasets.

4 Data-level algorithms

Many classification models have been presented to perform the classification task because of their importance in efficient data mining. On the other hand, regular classification models are highly sensitive to each dataset's specifics. Standard classification models are biased toward the most common patterns, leading to the misclassification of unusual cases when applied to datasets with a skewed class distribution. As a result of the gravity of class disparity issues, much effort has been put into finding effective ways to address them. In terms of how they deal with class differences, these ideas may be broken down into three camps: those that take an external (or data-level) approach, those that use an internal (algorithmic) approach, and those that are cost-sensitive. In addition, ensemble learning classifiers are also crucial in the categorization of imbalanced data.⁹⁰ Despite the extensive literature on data-level classifications, the computational burden of these methods is high.

The primary goal of algorithmic methods is to enhance the precision of a model by proposing new algorithms or adjustments to current methods. After comparing the effectiveness of various methods, in,¹⁰⁴ the naive Bayes and K-Nearest Neighbor (KNN) approaches tend to outperform the SVM and Random Forest (RF) approaches. Various problems can be solved using Genetic Programming (GP), an evolutionary approach. As a consequence, price adjustments can be considered a synonym for cost adjustments in GP. Adjusting costs in GP is as simple as following the suitable fitness functions. It is appropriate for the fitness function to reward solutions beneficial to minority and majority groups. Moreover, it demonstrated that both the accuracy (acc) and the average class accuracy (ave) of the fitness function are sufficient to address the problems associated with the imbalanced classification. Thus, the Ames, Incr, and Corr fitness functions are used.^105,106

4.1 External data-level algorithm

A random undersampling approach, which excludes majority class specimens at random and creates a subset of the main dataset in a manner to balance the ratio, is one resampling technique used in the preparation of imbalanced data, which may be categorized into three forms. There is a risk of information loss if data that may be used in the induction process are omitted. Randomly duplicating the current samples, random Oversampling creates a supplementary set of main data that includes more samples from underrepresented groups. However, the increased possibility of repetition might lead to overfitting. Finally, the hybrid method combines the two sampling approaches to provide a more uniform distribution.¹⁰⁷ The minority class's Synthetic Minority Oversampling Technique (SMOTE) interpolation of already existing minority class specimens is used to generate new samples. To generate a duplicate specimen from two interpolated specimens, SMOTE randomly chooses one of the kNN of a poor specimen. Efforts to expand the reach of the minority class into the territory of the majority class are limited by a decision taken at the boundary level. The approach eliminates the overfitting problem but produces noisy and questionable specimens.¹⁰⁸

Some filtering-based approaches (SMOTE-TL and EL) are employed to suppress noise in asymmetric data sets. On the other hand, original sampling techniques are augmented with neighborhood-balanced bagging to deal with asymmetric data (NBBag). An enhanced version of SMOTE, known as the Modified Synthetic Minority Oversampling Technique (MSMOTE), is developed. This method divided the minority class into three categories: latent noise, safe, and border based on the distances between all samples. The MSMOTE created fresh instances, discarding occult noise spots using the kNN classification approach.¹⁰⁹ However, it addresses the hidden noise instances and does not emphasize key features. When dealing with disturbances and maintaining orderly class boundaries, researchers have turned to an extended version of SMOTE and an iterative partitioning filter (IPF).¹¹⁰ For even more powerful data-level strategies, such as SMOTE extension, B1-SMOTE, and B2-SMOTE, SMOTE has been modified in many ways.¹¹¹ The Minority Weighted Minority Oversampling Technique (MWMOTE) is an efficient method for selecting and weighing samples from hard-to- learn minority classes. Furthermore, it can provide realistic simulated instances. Another approach is the Selective Pretreatment of Imbalanced Data (SPIDER), which combines complicated instances screened from the majority class with local oversampling of the minority class in a single step.¹¹²

To fix the skewed data, researchers have turned to a novel Inverse Random Undersampling (IRUS) technique based on an inverse (ratio of unbalanced cardinality) strategy. This also has implications for multiple-label categorization systems. A Radial Basis Function Network (RBFN) is offered for dealing with imbalanced datasets.¹¹³ It is created with both local and global terms in mind and employs a method of training local weights. A higher value of the Imbalance Ratio (IR) produced better results in local weight training techniques, and a lower value of the IR should be balanced with any technique. Incorporating well- known classifiers such as Logistic Regression (LR), the C5 decision tree model (C5), and the search for the nearest neighbor 1 in the tree, a classifier technique is suggested that combines PSO and SMOTE. This new collection of classifiers is shown to be efficient via the use of performance measurements, like accuracy indices and the G-mean. The experimental findings demonstrated the efficacy of the hybrid algorithm PSO + SMOTE + C5 to predict 5-year survival in breast cancer patients. The fusion of PSO, SMOTE, and an assisted Radial Basis Function (RBF) classifier is offered as another effective method for imbalanced binary class situations; evaluation with various metrics revealed that this approach works well for moderately imbalanced data sets but falls short for extremely imbalanced ones.¹¹⁴

4.2 Internal data-level algotihms

One way to deal with imbalanced data is to develop new algorithms or improve the existing ones for the effects on underrepresented groups. Table 3 summarizes the benefits and drawbacks of all data-level approaches.

Table 3.
Approaches at the data level with advantages and disadvantages.

Approaches Advantages Disadvantages

Synthetic Minority Oversampling Technique (SMOTE) ¹¹⁵ - Effectively balances classification by increasing the number of minority class examples. - May lead to overfitting in some cases.

Neighborhood-Balanced Bagging (NBBag) ¹⁰⁹ - Performs well in oversampling bagging extensions.
- Demonstrates robustness in randomly balanced bagging. - Computationally expensive.

Modified SMOTE (MSMOTE) ¹⁰⁹ - Reduces noise in the dataset. - Neglects prioritization of important features.

Iterative Partitioning Filter + SMOTE (SMOTE-IPF) ¹¹⁶ - Effectively handles noise and borderline examples in imbalanced datasets. - Limited performance with small sample sizes.

Majority Weighted Minority Oversampling Technique (MWMOTE) ⁴⁵ - Creates artificial samples for the minority class to address imbalance. - Struggles with multiclass imbalance problems.

Selective Preprocessing of Imbalanced Data (SPIDER) ¹¹² - Filters difficult examples effectively. - Suffers from complexity and implementation challenges.

Novel Inverse Random Under Sampling (IRUS) ⁴⁸ - Provides high accuracy for multi-label classification.
- Feasible for datasets with irregular sizes. - Requires different applications for multi-label classification.

Radial Basis Function Networks (RBFN) SMOTE ¹¹³ - Produces better results as the imbalance ratio (IR) increases. - Demands significant storage space.

SMOTE + PSO + C5 ⁴⁴ - Applied successfully for analyzing breast cancer patients over a five-year period. - Specifically designed for cancer datasets, limiting generalization.

Combined SMOTE and PSO-Based RBF Classifiers (SMOTE + PSO-RBF) ¹¹⁷ - Performs well in generating synthetic samples for minority classes. - Requires extensive storage space.

Approaches	Advantages	Disadvantages
Synthetic Minority Oversampling Technique (SMOTE) ¹¹⁵	- Effectively balances classification by increasing the number of minority class examples.	- May lead to overfitting in some cases.
Neighborhood-Balanced Bagging (NBBag) ¹⁰⁹	- Performs well in oversampling bagging extensions. - Demonstrates robustness in randomly balanced bagging.	- Computationally expensive.
Modified SMOTE (MSMOTE) ¹⁰⁹	- Reduces noise in the dataset.	- Neglects prioritization of important features.
Iterative Partitioning Filter + SMOTE (SMOTE-IPF) ¹¹⁶	- Effectively handles noise and borderline examples in imbalanced datasets.	- Limited performance with small sample sizes.
Majority Weighted Minority Oversampling Technique (MWMOTE) ⁴⁵	- Creates artificial samples for the minority class to address imbalance.	- Struggles with multiclass imbalance problems.
Selective Preprocessing of Imbalanced Data (SPIDER) ¹¹²	- Filters difficult examples effectively.	- Suffers from complexity and implementation challenges.
Novel Inverse Random Under Sampling (IRUS) ⁴⁸	- Provides high accuracy for multi-label classification. - Feasible for datasets with irregular sizes.	- Requires different applications for multi-label classification.
Radial Basis Function Networks (RBFN) SMOTE ¹¹³	- Produces better results as the imbalance ratio (IR) increases.	- Demands significant storage space.
SMOTE + PSO + C5 ⁴⁴	- Applied successfully for analyzing breast cancer patients over a five-year period.	- Specifically designed for cancer datasets, limiting generalization.
Combined SMOTE and PSO-Based RBF Classifiers (SMOTE + PSO-RBF) ¹¹⁷	- Performs well in generating synthetic samples for minority classes.	- Requires extensive storage space.

5 Strategies for handling imbalanced datasets

K-Means uses balanced data, which improves accuracy and reduces processing time compared to the initial imbalanced dataset. The process has two stages: first, we use K-Means to standardize the data, and then we use a support vector machine to sort the resulting balanced dataset.¹¹⁸ Cross-validation is a fundamental method of evaluation of performance, but researchers are unfamiliar with imbalanced data. Traditional classifier algorithms assume a balanced class distribution.¹¹⁹ Data Complexity Metrics (CMs) were introduced to detect dataset attributes that indicate classification difficulty and impact classifier accuracy. This research specifically developed two CMs tailored for imbalanced datasets to better explain the decline in classifier performance.

These measures utilize weighted KNN to account for the challenges posed by imbalanced class distributions. Class distribution skews in such datasets often lead models to favor majority classes, complicating classifier evaluation. While balanced accuracy is a commonly used statistic in these scenarios, it has limitations, particularly when class significance varies or when the distribution of class sizes is highly skewed.¹²⁰

Adaptive-SMOTE improves the SMOTE approach by adaptively picking clusters of inner and dangerous data from the minority class to create a new minority class, keeping the category boundary from expanding and enhancing the distributional characteristics of the initial data.¹²¹ Resampling (SMOTE and US), PSO, and MetaCost are used with nine medical datasets and verified and compared the suggested strategy to the listed methods. A decision tree generates decision rules to simplify research results.¹⁵ A study discussed a mix of metaheuristic Whale Optimization Algorithm (WOA) and local search Late Acceptance Hill-Climbing Algorithm (LAHCA) on the nearest neighbor imputation method for feature weighting. The Metaheuristic and Local Search-based Feature Weighted Nearest Neighbor Imputation (kNN + LAHCAWOA) method learned different k values for distinct test locations. The process is tested on benchmark EHR datasets with SVM, RF, and DNN classifiers (DNN). KNN + LAHCAWOA outperforms its competitors in classification performance because of its successful imputation mechanism.¹²²

Introduce R-Ensembler, a parameter-free greedy ensemble attribute selection approach, which uses the attribute-class, attribute-significance, and attribute-relevance measures from rough set theory to select a subset of attributes from a pool of distinct attribute subsets that are most relevant, significant, and non- redundant in predicting the presence or absence of different diseases in a medical dataset.⁴⁸ Another article¹²³ addressed data problems using ML workflow follows for coping with tiny data. The data source level is included with high-throughput computations and experiments, from the algorithm level, modeling algorithms for small data, and imbalanced learning. Table 4 provides a summary of the evaluation measures for several different ML techniques that aim to solve class imbalance.

Table 4.
Evaluation metrics of several ML algorithms for addressing class imbalance

Algorithm approaches Accuracy Precision Recall F1-Score AUC-ROC Hyperparameters Ref.

Logistic 75.65% 63.57% 47.91% 54.94% 0.7905 C = 1.0, solver='liblinear' (Wang & Cheng, 2021¹²³)

Regression

(unbalanced)

Logistic 74.44% 59.06% 68.87% 63.67% 0.8122 C = 1.0, solver='liblinear' (Maheshwari et al., 2017¹²⁴)

Regression

(SMOTE)

Decision Tree 69.61% 46.34% 54.25% 49.98% 0.6815 max_depth = 5, (Bania & Halder, 2020¹²⁵)

(unbalanced) min_samples_split = 2

Decision Tree 71.11% 50.92% 62.08% 56.04% 0.7664 max_depth = 5, (Breja & Yadav, 2021¹²⁶)

(SMOTE) min_samples_split = 2

RF (unbalanced) 77.39% 64.60% 43.11% 51.80% 0.7318 n_estimators = 100, (Xu et al., 2023¹²⁷)

max_depth = None,

min_samples_split = 2

RF (SMOTE) 79.22% 67.10% 65.45% 66.27% 0.8135 n_estimators = 100, (Xu et al., 2023¹²⁷)

max_depth = None,

min_samples_split = 2

6 Challenges in addressing class imbalance in healthcare

The phenomenon of class imbalance in healthcare datasets poses considerable challenges that have the potential to substantially affect the efficacy of machine learning models. Our exhaustive review of the literature underscores several pivotal issues:

Data Complexity and Heterogeneity: Healthcare data is inherently diverse, encompassing various types such as genomic sequences, medical imaging, and electronic health records. This heterogeneity complicates the application of traditional balancing techniques. For instance, standard oversampling methods like SMOTE often struggle to capture the intricate relationships within complex medical data, leading to suboptimal model performance.

Limited Sample Sizes: Many medical conditions, particularly rare diseases, suffer from limited sample sizes. This scarcity in minority classes makes it difficult to apply conventional oversampling techniques without risking overfitting or introducing synthetic noise. Researchers must navigate the delicate balance between generating additional samples and preserving data integrity.

Ethical Considerations: The manipulation of healthcare data brings forth significant ethical concerns. Creating synthetic samples or removing existing data points must be handled with extreme caution to ensure patient privacy and maintain the integrity of clinical information. Striking a balance between data equilibrium and ethical standards is a complex yet necessary endeavor in the medical field.

Evolving Nature of Medical Knowledge: The rapid advancement of medical science means that the relevance and interpretation of data can change over time. This dynamic nature of medical knowledge challenges the creation of stable and reliable balanced datasets that remain pertinent as medical understanding evolves.

7 Conclusion

In conclusion, ensuring the accuracy and precision of machine learning models is imperative to avert issues such as majority intraclass bias and class imbalance within data types. This paper identifies and quantifies data biases and delineates steps that lead to practical strategies to mitigate their impact. The narrative review accentuates the significance of implementing robust data quality measures throughout the data lifecycle to mitigate misclassification bias and enhance model performance. Moreover, it elaborates on the challenges and perspectives associated with employing machine learning methodologies to address class imbalance in healthcare data. An overview of current methodologies is provided, encompassing data preprocessing approaches such as oversampling, undersampling, hybrid sampling, and ensemble learning strategies including bagging, boosting, and AdaBoost. Additionally, research on ensemble methods and hybrid resampling techniques, such as Iterative Partitioning Filter (IPF) and Edited Nearest Neighbors (ENN), has demonstrated promising solutions for managing imbalanced datasets, particularly in domains characterized by small sample sizes such as medical registrations through the application of undersampling and oversampling methods across all categories. This study further elucidates the effectiveness of various approaches in addressing data imbalance challenges and emphasizes the necessity for ongoing research and innovation in this realm. Future initiatives should prioritize the identification of key data types and the development of improved methodologies to enhance classification accuracy. Furthermore, integrating imaging techniques into the intelligent segmentation of health information presents a promising avenue for future research. The insights garnered from this research endeavor are instrumental in the development of more relevant and reliable AI systems.

Footnotes

ORCID iDs

Bashar Hamad Aubaidan

Bakr Ahmed Taha

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research is funded by the Institute of Visual Informatics Universiti Kebangsaan Malaysia.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Abbasi

Bashir

Qureshi

, et al. Deep learning-based feature extraction and optimizing pattern matching for intrusion detection using finite state machine. Comput Electr Eng 2021; 92: 107094.

Qureshi

Khan

Jamil

SUU

, et al. Internet of Things enables smart solid waste bin management system for a sustainable environment. Environ Sci Pollut Res 2023; 30: 125188–125196.

Sun

Wong

Kamel

. Classification of imbalanced data: a review. Int J Pattern Recognit Artif Intell 2009; 23: 687–719.

Iqbal

Maryam

Qureshi

, et al. Automised flow rule formation by using machine learning in software defined networks based edge computing. Egyptian Inf J 2021; 23: 149–157.

Naseem

Alhudhaif

Anwar

, et al. Artificial general intelligence-based rational behavior detection using cognitive correlates for tracking online harms. Pers Ubiquitous Comput 2022; 27: 119–137.

Van Hulse

Khoshgoftaar

. Knowledge discovery from imbalanced and noisy data. Data Knowl Eng 2009; 68: 1513–1542.

Taha

Al Mashhadany

Al-Jubouri

, et al. Next-generation nanophotonic-enabled biosensors for intelligent diagnosis of SARS-CoV- 2 variants. Sci Total Environ 2023; 880: 163333.

Taha

Al-Jubouri

Al Mashhadany

, et al. Density estimation of SARS-CoV2 spike proteins using super pixels segmentation technique. Appl Soft Comput 2023; 138: 110210.

Kang

Shi

Zhou

, et al. A distance-based weighted undersampling scheme for support vector machines and its application to imbalanced classification. IEEE Trans Neural Networks Learn Syst 2017; 29: 4152–4165.

10.

Taha

Al Mashhadany

Al-Jubouri

, et al. Uncovering the morphological differences between SARS-CoV-2 and SARS-CoV based on transmission electron microscopy images. Microbes Infect 2023; 25: 105187.

11.

Zadrozny

Elkan

. Learning and making decisions when costs and probabilities are both unknown. In: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, 2001, pp.204–213.

12.

Salunkhe

Mali

. Classifier ensemble design for imbalanced data classification: a hybrid approach. Procedia Comput Sci 2016; 85: 725–732.

13.

Keele

. Guidelines for performing systematic literature reviews in software engineering, ed: Technical report, ver. 2.3 ebse technical report. ebse, 2007.

14.

Qureshi

Jeon

, et al. Deep learning-based ambient assisted living for self-management of cardiovascular conditions. Neural Comput Appl 2022; 34: 10449–10467.

15.

Garcia

. Learning from imbalanced data. IEEE Trans Knowl Data Eng 2009; 21: 1263–1284.

16.

Krawczyk

. Learning from imbalanced data: open challenges and future directions. Prog Artif Intell 2016; 5: 221–232.

17.

Goyal

Khiari

. Diversity-aware weighted majority vote classifier for imbalanced data. Proc Int Jt Conf Neural Networks 2020. https://doi.org/10.1109/IJCNN48605.2020.9207261

18.

Karrar

. Investigate the ensemble model by intelligence analysis to improve the accuracy of the classification data in the diagnostic and treatment interventions for prostate cancer. Int J Adv Comput Sci Appl 2022; 13: 181–188.

19.

Bugnon

Yones

Milone

, et al. Deep neural architectures for highly imbalanced data in bioinformatics. IEEE Trans Neural Networks Learn Syst 2020; 31: 2857–2867.

20.

Tian

Bian

Tang

, et al. A new non-kernel quadratic surface approach for imbalanced data classification in online credit scoring. Inf Sci 2021; 563: 150–165.

21.

Zheng

Sun

, et al. An automatic sampling ratio detection method based on genetic algorithm for imbalanced data classification. Knowl-Based Syst 2021; 216: 106800.

22.

Wang

Sun

. The improved AdaBoost algorithms for imbalanced data classification. Inf Sci 2021; 563: 358–374.

23.

Grzyb

Klikowski

Woźniak

. Hellinger distance weighted ensemble for imbalanced data stream classification. J Comput Sci 2021; 51: 101314.

24.

Alanazi

Abdullah

Qureshi

. A critical review for developing accurate and dynamic predictive models using machine learning methods in medicine and health care. J Med Syst 2017; 41: 69.

25.

Kovács

Tinya

Németh

, et al. Unfolding the effects of different forestry treatments on microclimate in oak forests: results of a 4-yr experiment. Ecol Appl 2020; 30: e02043.

26.

Ashokkumar

Don

. Link-based clustering algorithm for clustering web documents. J Test Eval 2019; 47: 20180497.

27.

Puri

Gupta

. Knowledge discovery from noisy imbalanced and incomplete binary class data. Expert Syst Appl 2021; 181: 115179.

28.

Fernández

del Jesus

Herrera

. On the 2-tuples based genetic tuning performance for fuzzy rule based classification systems in imbalanced data-sets. Inf Sci 2010; 180: 1268–1291.

29.

Nnamoko

Korkontzelos

. Efficient treatment of outliers and class imbalance for diabetes prediction. Artif Intell Med 2020; 104: 101815.

30.

Welzer

Eder

Podgorelec

, et al. New Trends in Databases and Information Systems: ADBIS 2019 Short Papers, Workshops BBIGAP, QAUCA, SemBDM, SIMPDA, M2P, MADEISD, and Doctoral Consortium, Bled, Slovenia, September 8–11, 2019, Proceedings. Springer Nature, 2019.

31.

Haixiang

Yijing

Shang

, et al. Learning from class-imbalanced data: review of methods and applications. Expert Syst Appl 2017; 73: 220–239.

32.

Bekkar

Djemaa

Alitouche

. Evaluation measures for models assessment over imbalanced data sets. J Inf Eng Appl 2013; 3.

33.

Wang

Liu

, et al. Training deep neural networks on imbalanced data sets. In: 2016 international joint conference on neural networks (IJCNN), 2016, pp.4368–4374: IEEE.

34.

Datta

Das

. Near-Bayesian support vector machines for imbalanced data classification with equal or unequal misclassification costs. Neural Networks 2015; 70: 39–52.

35.

Wang

Han

, et al. Generalization of deep neural networks for imbalanced fault classification of machinery using generative adversarial networks. IEEE Access 2019; 7: 111168–111180.

36.

Jiang

Hong

Zhou

, et al. A GAN-based anomaly detection approach for imbalanced industrial time series. IEEE Access 2019; 7: 143608–143619.

37.

Munjal

Paul

Krishnan

. Implicit discriminator in variational autoencoder. Proc Int Joint Conf Neural Networks 2020. https://doi.org/10.1109/IJCNN48605.2020.9207307

38.

Alanazi

Abdullah

Qureshi

, et al. Predicting the outcomes of traumatic brain injury using accurate and dynamic predictive model. J Theor Appl Inf Technol, 2016; 93: 561.

39.

Mahmood

Butt

Rehman

, et al. Generation of controlled synthetic samples and impact of hyper-tuning parameters to effectively classify the Complex structure of overlapping region. Appl Sci 2022; 12: 8371.

40.

Seng

Kareem

Varathan

. A neighborhood undersampling stacked ensemble (NUS-SE) in imbalanced classification. Expert Syst Appl 2021; 168: 114246.

41.

Lin

W-C

Tsai

C-F

Y-H

, et al. Clustering-based undersampling in class-imbalanced data. Inf Sci 2017; 409: 17–26.

42.

Khushi

Shaukat

Alam

, et al. A comparative performance analysis of data resampling methods on imbalance medical data. IEEE Access 2021; 9: 109960–109975.

43.

Soltanzadeh

Hashemzadeh

. RCSMOTE: range-controlled synthetic minority over-sampling technique for handling the class imbalance problem. Inf Sci 2021; 542: 92–111.

44.

Sáez

Luengo

Stefanowski

, et al. SMOTE–IPF: addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Inf Sci 2015; 291: 184–203.

45.

Wang

K-J

Makond

Chen

K-H

, et al. A hybrid classifier combining SMOTE with PSO to estimate 5-year survivability of breast cancer patients. Appl Soft Comput 2014; 20: 15–24.

46.

Shen

Nie

, et al. A hybrid sampling algorithm combining M-SMOTE and ENN based on random forest for medical imbalanced data. J Biomed Inform Jul 2020; 107: 103465.

47.

Q . A hybrid sampling SVM approach to imbalanced data classification. In: Abstract and applied analysis, vol. 2014. Cairo, Egypt: Hindawi, 2014.

48.

Pan

Zhao

, et al. Learning imbalanced datasets based on SMOTE and Gaussian distribution. Inf Sci 2020; 512: 1214–1233.

49.

Xifra-Porxas

Ghosh

Mitsis

, et al. Estimating brain age from structural MRI and MEG data: insights from dimensionality reduction techniques. Neuroimage May 1 2021; 231: 117822.

50.

Wang

, et al. SP-SMOTE: a novel space partitioning based synthetic minority oversampling technique. Knowl-Based Syst 2021; 228: 107269.

51.

Seiffert

Khoshgoftaar

Van Hulse

. Hybrid sampling for imbalanced data. Integr Comput-Aided Eng 2009; 16: 193–210.

52.

Fujiwara

Huang

Hori

, et al. Over- and under- sampling approach for extremely imbalanced and small minority data problem in health record analysis. Front Public Health 2020.10.3389/fpubh.2020.00178; 8: 178.

53.

Popel

Hasib

Habib

, et al. A hybrid under-sampling method (HUSBoost) to classify imbalanced data. In: 2018 21st international conference of computer and information technology (ICCIT), 2018, pp.1–7: IEEE.

54.

Zhu

Chu

Wang

, et al. Prediction of rockhead using a hybrid N-XGBoost machine learning framework. J Rock Mech Geotech Eng 2021; 13: 1231–1245.

55.

Shen

, et al. A hybrid method to predict postoperative survival of lung cancer using improved SMOTE and adaptive SVM. Comput Math Methods Med 2021; 2021.

56.

Cui

Jia

Lin

T-Y

, et al. Class-balanced loss based on effective number of samples. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp.9268–9277.

57.

ElSeddawy

Karim

Hussein

, et al. Predictive analysis of diabetes-risk with class imbalance. Comput Intell Neurosci 2022; 2022.

58.

Rahim

Rasheed

Azam

, et al. An integrated machine learning framework for effective prediction of cardiovascular diseases. IEEE Access 2021; 9: 106575–106588.

59.

Majid

Ali

Iqbal

, et al. Prediction of human breast and colon cancers from imbalanced data using nearest neighbor and support vector machines. Comput Methods Programs Biomed 2014; 113: 792–808.

60.

Vuttipittayamongkol

Elyan

. Overlap-based undersampling method for classification of imbalanced medical datasets. In: Artificial Intelligence Applications and Innovations: 16th IFIP WG 12.5 International Conference, AIAI 2020, Neos Marmaras, Greece, June 5–7, 2020, Proceedings, Part II 16, 2020, pp.358–369: Springer.

61.

Sreejith

Nehemiah

Kannan

. Clinical data classification using an enhanced SMOTE and chaotic evolutionary feature selection. Comput Biol Med 2020; 126: 103991.

62.

Athitya Kumaraguru

Vinod

Rajkumar

, et al. Parallel selective sampling for imbalance data sports activities. In: Soft Computing: Theories and Applications: Proceedings of SoCTA 2018, 2020, pp.879–886: Springer.

63.

D’Addabbo

Maglietta

. Parallel selective sampling method for imbalanced and large data classification. Pattern Recognit Lett 2015; 62: 61–67.

64.

Claesen

De Smet

Suykens

, et al. EnsembleSVM: A library for ensemble learning using support vector machines. arXiv preprint arXiv:1403.0745, 2014.

65.

Roy

Ahmad

Waqar

, et al. An enhanced machine learning framework for type 2 diabetes classification using imbalanced data with missing values. Complexity 2021; 2021: 1–21.

66.

Devi

Biswas

Purkayastha

. Learning in presence of class imbalance and class overlapping by using one-class SVM and undersampling technique. Connection Sci 2019; 31: 105–142.

67.

Zhou

Tang

, et al. A DBN-based resampling SVM ensemble learning paradigm for credit classification with imbalanced data. Appl Soft Comput 2018; 69: 192–202.

68.

Hassan

Amiri

. Classification of imbalanced data of diabetes disease using machine learning algorithms. Age (Years) 2019; 21: 33.24.

69.

Mirzaei

Nikpour

Nezamabadi-pour

. CDBH: a clustering and density-based hybrid approach for imbalanced data classification. Expert Syst Appl 2021; 164: 114035.

70.

Vigneron

Chen

. A multi-scale seriation algorithm for clustering sparse imbalanced data: application to spike sorting. Pattern Anal Appl 2016; 19: 885–903.

71.

Roy

Ahmad

Waqar

, et al. An enhanced machine learning framework for type 2 diabetes classification using imbalanced data with missing values. Complexity 2021/07/06 2021; 2021: 1–21.

72.

Maldonado

Weber

Famili

. Feature selection for high-dimensional class-imbalanced data sets using support vector machines. Inf Sci 2014; 286: 228–246.

73.

Shen

, et al. A novel combined dynamic ensemble selection model for imbalanced data to detect COVID-19 from complete blood count. Comput Methods Programs Biomed Nov 2021; 211: 106444.

74.

Shi

. Improving k-nearest neighbors algorithm for imbalanced data classification. In: IOP Conference Series: Materials Science and Engineering, 2020, vol. 719, no. , pp.012072: IOP Publishing.

75.

Liu

Zhou

, et al. Weighted Gini index feature selection method for imbalanced data. In: 2018 IEEE 15th international conference on networking, sensing and control (ICNSC), 2018, pp.1–6: IEEE.

76.

Zhou

, et al. Online feature selection for high-dimensional class-imbalanced data. Knowl-Based Syst 2017; 136: 187–199.

77.

Khaldy

Kambhampati

. Resampling imbalanced class and the effectiveness of feature selection methods for heart failure dataset. Int Rob Autom J 2018; 4: 1–10.

78.

Pes

. Learning from high-dimensional and class-imbalanced datasets using random forests. Information 2021; 12: 286.

79.

Valentini

. Ensemble methods: a review in Advances in Machine Learning and Data Mining for Astronomy (ed. Kumar, V.) 563–594, ed: Chapman & Hall, 2012.

80.

Ben Brahim

Limam

. Ensemble feature selection for high dimensional data: a new method and a comparative study. Adv Data Anal Classif 2018; 12: 937–952.

81.

Güney

. Feature selection-integrated classifier optimisation algorithm for network intrusion detection. Concurrency Comput: Pract Exper 2023; 35: e7807.

82.

Fujita

Wang

Xiao

, et al. Advances and trends in artificial intelligence. In: Theory and Applications: 36th International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems, IEA/AIE 2023, Shanghai, China, July 19–22, 2023, Proceedings, Part I, 2023.

83.

Neeraj

Maurya

. A review on machine learning (feature selection, classification and clustering) approaches of big data mining in different area of research. J Crit Rev 2020; 7: 2610–2626.

84.

Schubach

Robinson

, et al. Imbalance-aware machine learning for predicting rare and common disease-associated non-coding variants. Sci Rep 2017; 7: 2959.

85.

Polikar

. Ensemble based systems in decision making. IEEE Circuits Syst Mag 2006; 6: 21–45.

86.

Islam

Lima

Das

, et al. A comprehensive survey on the process, methods, evaluation, and challenges of feature selection. IEEE Access 2022; 10: 99595–99632.

87.

Breiman

. Bagging predictors. Mach Learn 1996; 24: 123–140.

88.

Sampath

Maurtua

Aguilar Martin

, et al. A survey on generative adversarial networks for imbalance problems in computer vision tasks. J Big Data 2021; 8: 1–59.

89.

Guo

, et al. A boosting based ensemble learning algorithm in imbalanced data classification. Xitong Gongcheng Lilun yu Shijian/Syst Eng Theory Pract 2016; 36: 189–199.

90.

Sagi

Rokach

. Ensemble learning: a survey. Wiley Interdiscip Rev: Data Min Knowl Discovery 2018; 8: e1249.

91.

Ghojogh

Crowley

. The theory behind overfitting, cross validation, regularization, bagging, and boosting: tutorial. arXiv preprint arXiv:1905.12787, 2019.

92.

Bao

Juan

, et al. Boosted near-miss under-sampling on SVM ensembles for concept detection in large-scale imbalanced datasets. Neurocomputing 2016; 172: 198–206.

93.

Haixiang

Yijing

Yanan

, et al. BPSO-Adaboost-KNN ensemble learning algorithm for multi-class imbalanced data classification. Eng Appl Artif Intell 2016; 49: 176–193.

94.

López

Del Río

Benítez

, et al. Cost-sensitive linguistic fuzzy rule based classification systems under the MapReduce framework for imbalanced big data. Fuzzy Sets Syst 2015; 258: 5–38.

95.

Zhang

Tan

, et al. A cost-sensitive deep belief network for imbalanced classification. IEEE Trans Neural Networks Learn Syst 2018; 30: 109–122.

96.

Siers

Islam

. Class imbalance and cost-sensitive decision trees: a unified survey based on a core similarity. ACM Trans Knowl Discovery Data (TKDD) 2020; 15: 1–31.

97.

Buda

Maki

Mazurowski

. A systematic study of the class imbalance problem in convolutional neural networks. Neural Networks 2018; 106: 249–259.

98.

Mazurowski

Habas

Zurada

, et al. Training neural network classifiers for medical decision making: the effects of imbalanced datasets on classification performance. Neural Networks 2008; 21: 427–436.

99.

Beyan

Fisher

. Classifying imbalanced data sets using similarity based hierarchical decomposition. Pattern Recognit 2015; 48: 1653–1672.

100.

Alibeigi

Hashemi

Hamzeh

. DBFS: an effective density based feature selection scheme for small sample size and high dimensional imbalanced data sets. Data Knowl Eng 2012; 81: 67–103.

101.

Perera

Patel

. Learning deep features for one-class classification. IEEE Trans Image Process 2019; 28: 5450–5463.

102.

Fernández

García

Galar

, et al. Cost-sensitive learning. Learn Imbalanced Data Sets 2018: 63–78.

103.

Fan

Stolfo

Zhang

, et al. Adacost: misclassification cost-sensitive boosting. Icml 1999; 99: 97–105.

104.

Sun

Song

Zhu

, et al. A novel ensemble method for classifying imbalanced data. Pattern Recognit 2015; 48: 1623–1637.

105.

Rong

H-J

Huang

G-B

Ong

Y-S

. Extreme learning machine for multi-categories classification applications. In: 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), 2008, pp.1709–1713: IEEE.

106.

Devi

Biswas

Purkayastha

. A review on solution to class imbalance problem: undersampling approaches. In: 2020 international conference on computational performance evaluation (ComPE), 2020, pp.626–631: IEEE.

107.

Naghibi

Ahmadi

Daneshi

. Application of support vector machine, random forest, and genetic algorithm optimized random forest models in groundwater potential mapping. Water Resour Manage 2017; 31: 2761–2775.

108.

Bhowan

Johnston

Zhang

, et al. Reusing genetic programming for ensemble selection in classification of unbalanced data. IEEE Trans Evol Comput 2013; 18: 893–908.

109.

Bhowan

Johnston

Zhang

, et al. Evolving diverse ensembles using genetic programming for classification with unbalanced data. IEEE Trans Evol Comput 2012; 17: 368–386.

110.

Jackson

Stevens

Ren

, et al. Extrapolating survival from randomized trials using external data: a review of methods. Med Decis Making 2017; 37: 377–390.

111.

Soltanzadeh

Hashemzadeh

. RCSMOTE: range-controlled synthetic minority over- sampling technique for handling the class imbalance problem. Inf Sci 2021; 542: 92–111.

112.

Liang

, et al. MSMOTE: improving classification performance when training data is imbalanced. In: 2009 s international workshop on computer science and engineering, 2009, vol. 2, pp.13–17: IEEE.

113.

Sun

Dai

. Two-stage cost-sensitive learning for data streams with concept drift and class imbalance. IEEE Access 2020; 8: 191942–191955.

114.

Turlapati

VPK

Prusty

. Outlier-SMOTE: a refined oversampling technique for improved detection of COVID-19. Intell-based Med 2020; 3: 100023.

115.

Malhotra

Kamal

. An empirical study to investigate oversampling methods for improving software defect prediction using imbalanced data. Neurocomputing 2019; 343: 120–140.

116.

Ghosh

Nag

. An overview of radial basis function networks. Radial Basis Funct Networks 2: Adv Des 2001: 1–36.

117.

Barua

Islam

Yao

, et al. MWMOTE–Majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans Knowl Data Eng 2012; 26: 405–425.

118.

Tahir

Kittler

Yan

. Inverse random under sampling for class imbalance problem and its application to multi-label classification. Pattern Recognit 2012; 45: 3738–3750.

119.

Ali

Salleh

MNM

Saedudin

, et al. Imbalance class problems in data mining: a review. Indones J Electr Eng Comput Sci 2019; 14: 1560–1571.

120.

Gao

Hong

Chen

, et al. A combined SMOTE and PSO based RBF classifier for two- class imbalanced problems. Neurocomputing 2011; 74: 3456–3466.

121.

Shukla

Bhowmick

. To improve classification of imbalanced datasets. In: 2017 International Conference on Innovations in Information, Embedded and Communication Systems (ICIIECS), 2017, pp.1–5: IEEE.

122.

Singh

Gosain

Saha

. Weighted k-nearest neighbor based data complexity metrics for imbalanced datasets. Stat Anal Data Min: ASA Data Sci J 2020; 13: 394–404.

123.

Wang

Y-C

Cheng

C-H

. A multiple combined method for rebalancing medical data with class imbalances. Comput Biol Med 2021; 134: 104527.

124.

Maheshwari

Jain

Jadon

. A review on class imbalance problem: analysis and potential solutions. Int J Comput Sci Issues (IJCSI) 2017; 14: 43–51.

125.

Bania

Halder

. R-Ensembler: a greedy rough set based ensemble attribute selection algorithm with kNN imputation for classification of medical data. Comput Methods Programs Biomed 2020; 184: 105122.

126.

Breja

Yadav

. Genre-based recommendation on community cloud using Apriori algorithm. In: Proceedings of international conference on machine intelligence and data science applications: MIDAS 2020, 2021.

127.

, et al. Small data machine learning in materials science. npj Comput Mater 2023; 9: 42.

Algorithm approaches	Accuracy	Precision	Recall	F1-Score	AUC-ROC	Hyperparameters	Ref.
Logistic	75.65%	63.57%	47.91%	54.94%	0.7905	C = 1.0, solver='liblinear'	(Wang & Cheng, 2021¹²³)
Regression
(unbalanced)
Logistic	74.44%	59.06%	68.87%	63.67%	0.8122	C = 1.0, solver='liblinear'	(Maheshwari et al., 2017¹²⁴)
Regression							(Maheshwari et al., 2017¹²⁴)
(SMOTE)
Decision Tree	69.61%	46.34%	54.25%	49.98%	0.6815	max_depth = 5,	(Bania & Halder, 2020¹²⁵)
(unbalanced)						min_samples_split = 2

Decision Tree	71.11%	50.92%	62.08%	56.04%	0.7664	max_depth = 5,	(Breja & Yadav, 2021¹²⁶)
(SMOTE)						min_samples_split = 2	(Breja & Yadav, 2021¹²⁶)
RF (unbalanced)	77.39%	64.60%	43.11%	51.80%	0.7318	n_estimators = 100,	(Xu et al., 2023¹²⁷)
						max_depth = None,	(Xu et al., 2023¹²⁷)
						min_samples_split = 2
RF (SMOTE)	79.22%	67.10%	65.45%	66.27%	0.8135	n_estimators = 100,	(Xu et al., 2023¹²⁷)
						max_depth = None,	(Xu et al., 2023¹²⁷)
						min_samples_split = 2

A review of intelligent data analysis: Machine learning approaches for addressing class imbalance in healthcare - challenges and perspectives

Abstract

Keywords

1 Introduction

2 Research methodology

2.2 Visualization techniques for imbalanced data

3 Taxonomy of literature on imbalanced data

3.1.1 Under-sampling

3.1.2 Oversampling

3.1.3 Hybrid sampling

3.2.1 Variants of SVM

3.2.2 Feature selection methods

3.3.1 Bagging learning

3.3.2 Boosting

3.3.3 Adaboost

3.4 Cost-sensitivity methods

4.1 External data-level algorithm

4.2 Internal data-level algotihms

7 Conclusion

Footnotes

ORCID iDs

Funding

Declaration of conflicting interests

References