FRSTU-Forest: A fixed random-state undersampling framework for reliable decision support in imbalanced classification

Abstract

Imbalanced classification remains a critical challenge in decision-sensitive domains such as healthcare, finance, and cybersecurity, where minority class recognition is often paramount. This paper introduces FRSTU-Forest, a novel hybrid framework that integrates K-Nearest Neighbor (k-NN) imputation, Fixed Random State Undersampling (FRSTU), and Random Forest to enhance both minority class detection and model reproducibility. Unlike conventional undersampling, FRSTU applies deterministic sampling with a fixed random seed, ensuring consistent training subsets across runs and significantly reducing performance variance. The framework was comprehensively evaluated on seven benchmark datasets with moderate imbalance ratios (1.25–3.36) and rigorously tested on synthetic datasets with extreme imbalance ratios up to 1:100, high dimensionality (100 features), and substantial label noise (20%). FRSTU-Forest consistently outperformed baseline models (RF, k-NNimp+RF, RSTU+RF), achieving an average accuracy of 87.88%, minority-class F1-score up to 99.78%, and Cohen’s Kappa of 0.86 on benchmark datasets. More importantly, under extreme imbalance conditions (1:100 ratio), it maintained a balanced accuracy of 0.807 with 100 features, demonstrating remarkable robustness. Statistical significance was confirmed via the Bonferroni-Dunn test ( $p = 0.0015$ ), while computational efficiency remained practical with an average runtime of 2.01 seconds per dataset. In real-world applications, the model achieved 96.92% recall on breast cancer detection and substantially improved credit risk classification. These results affirm that FRSTU-Forest provides a reliable, reproducible, and robust decision-support tool for imbalanced data environments, particularly effective in scenarios with severe class imbalance, high dimensionality, and noisy labels.

Keywords

imbalanced data decision support reproducibility random forest ensemble learning extreme imbalance robust classification

1. Introduction

Classification is a fundamental task in data mining and machine learning, with applications spanning domains such as healthcare, finance, cybersecurity, and engineering.^1,2 Among various classification algorithms, Random Forest (RF) is widely adopted due to its ensemble-based structure, robustness to noise, ease of implementation, and high accuracy on structured data.^3,4

Despite its strengths, RF exhibits notable performance degradation in scenarios involving class imbalance, a situation where one class significantly outnumbers another. This is common in many real-world datasets, such as fraud detection, medical diagnosis, or fault detection systems. In such cases, RF tends to bias predictions toward the majority class, neglecting the minority class that often carries critical information.⁵ This bias reduces recall for minority classes and may yield misleading results if accuracy is used as the sole performance metric.^6,7

To mitigate class imbalance, undersampling of the majority class is a common strategy. It aims to balance class distributions by reducing the number of dominant class samples. However, conventional random undersampling introduces several issues. Randomly discarding samples can lead to the loss of useful information and increased performance variability across different runs due to the stochastic nature of sample selection.⁸ Furthermore, random selection does not guarantee that representative majority samples are retained, which may compromise the learning process.

To overcome these limitations, this paper proposes the Fixed Random State Undersampling Forest (FRSTU-Forest). FRSTU-Forest modifies the traditional undersampling technique by applying a fixed random state during the sampling process. This small but significant change introduces determinism into the data preparation stage, ensuring that the same subset of majority samples is selected in every run. As a result, the model training becomes more consistent, reproducible, and interpretable. Moreover, this model retains informative majority class samples more reliably, potentially leading to improved generalization performance.

To address the above challenges, this paper introduces a novel strategy and aims to answer the following research questions: Can the use of a fixed random state in undersampling improve the consistency and performance of RF classifiers on imbalanced datasets? How does it compare statistically with traditional approaches?

The motivation for this paper stems from the limited exploration of determinism in class rebalancing methods. While the machine learning community has focused extensively on new resampling algorithms (e.g., SMOTE, ADASYN) or algorithm-level modifications (e.g., cost-sensitive learning), few works have investigated the role of reproducibility and consistency in undersampling.

The main contributions of this work are as follows:

(i)
We propose a novel classification framework called FRSTU-Forest, which integrates fixed random state undersampling into the Random Forest pipeline to improve reproducibility and minority class recognition.
(ii)
We conduct extensive experiments on seven publicly available benchmark datasets with varying imbalance ratios and report performance using a comprehensive set of metrics (F-measure, recall, precision, accuracy, and Cohen’s Kappa).
(iii)
We employ the Bonferroni-Dunn statistical test to rigorously assess the significance of performance differences between FRSTU-Forest and baseline models, thereby validating the robustness of our approach.

Prior literature presents numerous techniques to handle data imbalance, including algorithmic modifications and data-level resampling methods.^9,10 While random undersampling is widely used due to its simplicity, it introduces inconsistency across model training runs. To our knowledge, no prior work has systematically investigated the use of fixed random state undersampling in the context of Random Forest classification.

This paper fills this gap by providing a lightweight yet effective strategy to stabilize model behavior on imbalanced datasets, without requiring complex oversampling algorithms or model re-engineering.

The proposed FRSTU-Forest model enables researchers and practitioners to achieve consistent, interpretable, and high-performing classification outcomes on imbalanced datasets. Its practical utility extends to any domain where minority class prediction reliability is critical, and it contributes toward improving decision-making in intelligent systems.

The remainder of this paper is organized as follows. The next section reviews related work on imbalanced classification and resampling strategies. This is followed by a detailed description of the proposed FRSTU-Forest framework, including its algorithmic design and theoretical rationale. The experimental setup, datasets, and evaluation metrics are then described. Subsequently, empirical results are presented and compared against baseline models. A critical discussion of the findings, limitations, and future directions is provided thereafter. The paper concludes by summarizing the key contributions and implications of the proposed model.
2. Related works

Handling imbalanced data is a persistent challenge in supervised learning, often resulting in biased models that underperform on minority classes. Numerous techniques have been developed to mitigate this problem, including resampling (oversampling and undersampling), cost-sensitive learning, and ensemble-based approaches. This section reviews key literature addressing class imbalance and outlines their methodologies, strengths, and limitations.

2.1. Resampling techniques

A widely used oversampling method is the Synthetic Minority Over-sampling Technique (SMOTE). Shao et al.¹¹ improved this approach by integrating Gaussian distribution to enhance the realism of generated samples. Although effective, SMOTE-based techniques are prone to overfitting and are computationally inefficient on large datasets.

On the other hand, undersampling aims to balance the dataset by removing instances from the majority class. Archana and Prakash¹² introduced a selective undersampling strategy that retains informative majority class instances. Ilham et al.⁸ further refined this by incorporating fuzzy clustering and denoising to select representative samples. While these methods improve minority class recognition, they risk discarding useful data and introduce computational overhead.

2.2. Cost-sensitive and hybrid models

Wang et al.¹³ proposed a cost-sensitive ensemble using stacked denoising autoencoders, dynamically adjusting misclassification penalties based on class distributions. Similarly, Wang et al.¹⁴ employed fuzzy logic in support vector machines to assign greater importance to minority class instances. Although effective, these approaches often require extensive parameter tuning and domain expertise.

Salehi et al.¹⁵ explored hybrid models that combine clustering and undersampling, offering better sample representativeness. Joloudari et al.¹⁶ utilized convolutional neural networks (CNNs) with a modified SMOTE approach, achieving strong performance in image-based datasets but at high computational cost.

2.3. Ensemble-based methods and meta-learning

Sasirekha and Kanisha¹⁷ introduced an adaptive ensemble boosting model that adjusts the undersampling rate and ensemble weights based on instance difficulty. Although this improves F-measure and balanced accuracy, it suffers from computational complexity and tuning issues. Ilham et al.¹⁸ proposed a meta-learning framework to automatically select optimal resampling strategies based on dataset characteristics. However, this approach is heavily dependent on the quality of extracted meta-features.

2.4. Summary and comparison

Table 1 summarizes the key characteristics of existing approaches. While many studies offer promising improvements in classifying imbalanced data, challenges such as overfitting, high computational cost, reproducibility, and generalizability persist. In contrast, our proposed model is Fixed Random State Undersampling Forest (FRSTU-Forest), aims to enhance reproducibility and stability by incorporating consistent sampling and k-NN-based imputation, with minimal tuning overhead.

Table 1.
Comparison of existing models for imbalanced data classification.

Model Category Strengths Limitations

SMOTE-Gaussian¹¹ Oversampling Generates diverse synthetic samples Prone to overfitting; inefficient on large datasets

Relevant Undersampling¹² Undersampling Preserves informative majority instances Risk of information loss

Cost-Sensitive Autoencoder¹³ Cost-sensitive Penalizes misclassification effectively High complexity and training time

Fuzzy SVM¹⁴ Cost-sensitive Prioritizes minority instances using fuzzy logic Needs fine-tuning; integration complexity

Hybrid Clustering¹⁵ Hybrid Better majority class representation Requires clustering parameters; may overfit

CNN + SMOTE¹⁶ Deep Learning High performance on image datasets Large data requirement; high computational cost

Adaptive Ensemble Boost¹⁷ Ensemble Dynamic adjustment of weights and sampling Expensive computationally; tuning overhead

Meta-learning Resampling¹⁸ Meta-learning Automatic strategy selection Depends on quality of meta-features

FRSTU-Forest (proposed) Hybrid + Ensemble Reproducible, stable results with low parameter tuning Tested only on moderate imbalance ratios

Model	Category	Strengths	Limitations
SMOTE-Gaussian¹¹	Oversampling	Generates diverse synthetic samples	Prone to overfitting; inefficient on large datasets
Relevant Undersampling¹²	Undersampling	Preserves informative majority instances	Risk of information loss
Cost-Sensitive Autoencoder¹³	Cost-sensitive	Penalizes misclassification effectively	High complexity and training time
Fuzzy SVM¹⁴	Cost-sensitive	Prioritizes minority instances using fuzzy logic	Needs fine-tuning; integration complexity
Hybrid Clustering¹⁵	Hybrid	Better majority class representation	Requires clustering parameters; may overfit
CNN + SMOTE¹⁶	Deep Learning	High performance on image datasets	Large data requirement; high computational cost
Adaptive Ensemble Boost¹⁷	Ensemble	Dynamic adjustment of weights and sampling	Expensive computationally; tuning overhead
Meta-learning Resampling¹⁸	Meta-learning	Automatic strategy selection	Depends on quality of meta-features
FRSTU-Forest (proposed)	Hybrid + Ensemble	Reproducible, stable results with low parameter tuning	Tested only on moderate imbalance ratios

Compared to SMOTE-based and cost-sensitive models, FRSTU-Forest emphasizes reproducibility and computational efficiency by avoiding synthetic data generation and expensive parameter tuning. Its design ensures consistent performance across moderately imbalanced datasets while maintaining interpretability and scalability.

3. Methodology

Figure 1 visualizes the workflow of the proposed FRSTU-Forest model, namely, the fixed random-state undersampling forest approach for imbalanced data in random forest classification cases.

Figure 1.

Proposed model.

3.1. Datasets

In this paper, we used seven diverse datasets with imbalanced class distributions to evaluate the proposed Fixed Random State Undersampling Forest (FRSTU-Forest) model. The datasets were sourced from the KEEL and UCI Machine Learning Repository. The characteristics of these datasets are summarized in Table 2.

Table 2.
Characteristics of the benchmark datasets used for evaluating imbalanced classification models.

Minority Majority Imbalance Missing

Dataset Source Instances Features Classes Instances Instances Ratio Values

Glass KEEL 214 9 7 76 138 1.82 0

Ecoli UCI 336 7 5 77 259 3.36 0

Credit Approval UCI 690 15 Mixed 307 383 1.25 37

Breast Cancer UCI 699 9 2 241 458 1.90 16

PID KEEL 768 8 2 268 500 1.87 0

German Credit UCI 1000 20 2 300 700 2.33 0

Yeast UCI 1484 8 10 429 1055 2.46 0

					Minority	Majority	Imbalance	Missing
Glass	KEEL	214	9	7	76	138	1.82	0
Ecoli	UCI	336	7	5	77	259	3.36	0
Credit Approval	UCI	690	15	Mixed	307	383	1.25	37
Breast Cancer	UCI	699	9	2	241	458	1.90	16
PID	KEEL	768	8	2	268	500	1.87	0
German Credit	UCI	1000	20	2	300	700	2.33	0
Yeast	UCI	1484	8	10	429	1055	2.46	0

3.1.1. Glass

The Glass Identification dataset, also obtained from the UCI machine learning repository, includes 214 instances with 9 features.¹⁹ These features represent the chemical composition of the different types of glass, which are used to identify glass types in forensic investigations. The dataset comprised 76 instances in the minority class and 138 instances in the majority class, resulting in an imbalance ratio of 1.82. It is frequently used to assess the performance of classification methods in forensic science applications where data is often imbalanced.

3.1.2. Ecoli

The Ecoli dataset, obtained from the UCI machine learning repository from UCI Machine Learning Repository, consists of 336 instances with 7 features.²⁰ These features represent various cellular characteristics of the E. coli bacteria, such as protein localization sites. The dataset is imbalanced, with 77 instances in the minority class and 259 instances in the majority class, resulting in an imbalance ratio of 3.36. This dataset is commonly used to test classification algorithms’ ability to handle imbalanced data, particularly in biological and biomedical research.

3.1.3. Credit Approval

The Credit Approval dataset, which is also from the UCI Machine Learning Repository, includes 690 instances with 15 features.²¹ The features consist of various personal and financial attributes of credit card applicants, such as age, income, and credit history. The aim is to predict whether an applicant should be granted credit approval. The dataset contains 307 instances in the minority class and 383 instances in the majority class, resulting in an imbalance ratio of 1.25. The proposed dataset was used to test the effectiveness of machine learning models in the financial domain, particularly in credit scoring and risk management.

3.1.4. Breast Cancer Wisconsin

The Breast Cancer Wisconsin (Original) dataset was sourced from the UCI Machine Learning Repository and comprises 699 instances with 9 features.²² The features include attributes of the cell nuclei obtained from breast cancer biopsy samples, such as clump thickness, cell size uniformity, and mitosis. The task was to classify the instances as benign or malignant. The dataset contains 241 instances in the minority class and 458 instances in the majority class, resulting in an imbalance ratio of 1.90. The proposed dataset is commonly used in medical research to evaluate the performance of classification algorithms in cancer detection tasks.

3.1.5. Pima Indians Diabetes (DID)

The Pima Indians Diabetes dataset was sourced in KEEL repository and comprises 768 instances with 8 features, and can be access here https://sci2s.ugr.es/keel/dataset.php?cod=21. The features include various medical predictor variables, such as age, glucose level, blood pressure, and body mass index. The goal is to predict the onset of diabetes within 5 years. The dataset contains 268 instances in the minority class and 500 instances in the majority class, resulting in an imbalance ratio of 1.87. This dataset is widely used in medical and healthcare research to test the robustness of classification algorithms in disease prediction with imbalanced data.

3.1.6. German Credit

The German Credit dataset is from the UCI Machine Learning Repository and contains 1000 instances with 20 features, and can access here.²³ The features cover various attributes of loan applicants, such as credit history, purpose, loan amount, and personal information. The objective is to classify applicants as having good or poor credit risk. The dataset contains 300 instances in the minority class and 700 instances in the majority class, resulting in an imbalance ratio of 2.33. This dataset is essential for evaluating credit risk assessment models and their ability to handle imbalanced datasets in financial applications.

3.1.7. Yeast

The Yeast dataset, also from the UCI machine learning, contains 1484 instances with 8 features.²⁴ These features describe gene expression patterns in yeast cells and are used to predict protein localization sites. The dataset contains 429 instances in the minority class and 1055 instances in the majority class, resulting in an imbalance ratio of 2.46. This dataset is a benchmark for evaluating machine learning models in bioinformatics, specifically, for dealing with imbalanced class distributions.

3.2. Data preprocessing

Data preprocessing is essential to ensure that the data are clean, consistent, and suitable for the applied machine learning algorithms applied. Handling missing data is critical because it can introduce biases and reduce the effectiveness of machine learning models. The steps include handling missing values, normalization, and encoding categorical variables. Below, we provide a detailed description of each step, including additional techniques and examples to ensure a thorough understanding.

3.2.1. k-NN data imputation for missing values

In this step, the k-Nearest Neighbors (k-NN) method was used to address missing values in only two datasets (Breast Cancer Wisconsin and Credit Approval).

Mathematical Simulation:

1. Distance definition

The distance between two data points $x_{i}$ and $x_{j}$ in the feature space $R^{d}$ is commonly computed using the Euclidean distance:

\begin{aligned} d (x_{i}, x_{j}) = \sqrt{\sum_{l = 1}^{d} (x_{i l} - x_{j l})^{2}} \end{aligned}

(1)

For cases where some features have missing values, the distance is computed based only on the available features:

\begin{aligned} d (x_{i}, x_{j}) = \sqrt{\sum_{l \in S} (x_{i l} - x_{j l})^{2}} \end{aligned}

(2)

where

S

is the set of indices of features that are not missing.

2. Selecting nearest neighbors

After computing the distances for all pairs of data points, we select the $k$ nearest neighbors of data points with missing values. Let $x_{i_{1}}, x_{i_{2}}, \dots, x_{i_{k}}$ be the nearest neighbors of $x_{j}$ .

3. Imputing Value

Imputing value the missing value in feature $k$ of $x_{j}$ is imputed from the average value of the nearest neighbors:

\begin{aligned} x_{j k} = \frac{1}{k} \sum_{m = 1}^{k} x_{i_{m} k} \end{aligned}

(3)

The proposed approach assumes that missing values can be well predicted using information from the nearest data points. The detailed implementation is shown in the pseudocode in Algorithm 1 below.

4. Normalization

Normalization is an essential preprocessing step in machine learning to ensure that all features have the same scale, improving the performance of many algorithms. It is crucial for datasets with features that vary widely in range and units.

If $x$ is a feature vector, $μ$ is the mean of the feature, and $σ$ is the standard deviation of the feature, then the normalized feature $x^{'}$ is computed as:

\begin{aligned} x_{i}^{'} = \frac{x_{i} - μ}{σ} \end{aligned}

(4)

This transformation ensures that features have a mean of zero and a standard deviation of one, helping stabilize the learning process and convergence in gradient-based algorithms.

Specific datasets:

Ecoli: This dataset includes features such as protein localization sites, sequence features, and biochemical properties with different ranges and units. Normalization ensures all features contribute equally to distance computations and classification tasks.

\begin{aligned} x^{'} = \frac{x - μ_{Ecoli}}{σ_{Ecoli}} \end{aligned}

(5)

PID (Pima Indians Diabetes): The PID dataset contains various health measurements like glucose levels, blood pressure, and BMI, which are measured on different scales.

\begin{aligned} x^{'} = \frac{x - μ_{PID}}{σ_{PID}} \end{aligned}

(6)

German credits: This dataset includes credit amount, duration, and age with different scales.

\begin{aligned} x^{'} = \frac{x - μ_{Credit}}{σ_{Credit}} \end{aligned}

(7)

Normalization is important because some algorithms like k-Nearest Neighbors (k-NN), Support Vector Machines (SVM), and Neural Networks are sensitive to the scale of input features. By normalizing the data, we ensure that all features are on a comparable scale, resulting in more stable and reliable model performance.

5. Encoding Categorical Variables

Categorical variables require encoding for use by machine learning algorithms. A common technique is hot encoding, where each category is transformed into binary features.

Let $c$ be a categorical variable with $n$ unique values, then One-Hot Encoding produces a binary matrix $O$ with dimensions $n \times c$ where each entry $O_{i j}$ is 1 if the $i$ -th sample belongs to the $j$ -th category, and 0 otherwise.

These steps ensure that the data are ready for use in machine learning algorithms, thereby improving the accuracy and effectiveness of the models built.

Indicator Function Notation

To ensure clarity and consistency in the mathematical expressions throughout this study, we define the indicator function $I [\cdot]$ as follows:

I [y_{i} = 1] = {\begin{cases} 1, & if y_{i} = 1, \\ 0, & otherwise . \end{cases}

This function is used to count class-specific instances, especially in imbalanced classification tasks. For instance, the number of minority class samples is defined as:

n_{min} = \sum_{i = 1}^{n} I [y_{i} = 1]

This concise and standardized formulation facilitates a clear description of the model design, particularly in the random undersampling and data balancing procedures.

3.2.2. Imbalanced data handling with random state undersampling

Imbalanced data refers to situations where the number of instances in one class significantly exceeds those in another, which can bias the classifier toward the majority class. To mitigate this issue, we apply a random undersampling technique using a fixed random seed to ensure reproducibility.

Problem Formulation

Let the dataset be defined as:

\begin{aligned} D = {(x_{i}, y_{i})}_{i = 1}^{n}, x_{i} \in R^{d}, y_{i} \in {0, 1} \end{aligned}

(8)

where

x_{i}

represents the feature vector of the

i

-th instance, and

y_{i}

is its corresponding binary label. The dataset is considered imbalanced if the number of instances in the majority class (

y_{i} = 0

) greatly exceeds those in the minority class (

y_{i} = 1

Undersampling Procedure

Let $n_{min}$ and $n_{maj}$ denote the number of minority and majority class samples, respectively:

\begin{aligned} n_{min} & = \sum_{i = 1}^{n} I [y_{i} = 1] \end{aligned}

(9)

\begin{aligned} n_{maj} & = \sum_{i = 1}^{n} I [y_{i} = 0] \end{aligned}

(10)

where

I [\cdot]

is the indicator function.

If $n_{maj} > n_{min}$ , we randomly select $n_{min}$ samples from the majority class to form a balanced subset. To ensure consistent sampling across experiments, the random seed is fixed to a predefined constant (e.g., random_state $= n_{min}$ ).

Balanced Dataset Construction

Let $D_{maj}^{'} \subset D$ denote the sampled majority class subset. The final balanced dataset is then constructed as:

\begin{aligned} D_{balanced} = D_{min} \cup D_{maj}^{'} \end{aligned}

(11)

where

D_{min} = {(x_{i}, y_{i}) \in D ∣ y_{i} = 1}

Pseudocode

The following algorithm summarizes the fixed random state undersampling process:

This method ensures that the class distribution is balanced, and a fixed random state prevents model performance fluctuations due to random variations in the undersampling process. This approach helps provide more reliable and consistent results.

After preprocessing, including missing value imputation via k-NN, normalization of continuous features, and encoding of categorical attributes, each dataset is subjected to a fixed random-state undersampling process. The resulting balanced dataset is then used to train a Random Forest model through bootstrap sampling and feature bagging. This full workflow is illustrated in Figure 1, ensuring consistent and replicable model performance across all datasets.

3.3. FRSTU-Forest model description

The Fixed Random State Undersampling Forest (FRSTU-Forest) model combines random undersampling techniques with the Random Forest algorithm to ensure consistency and stability in the undersampling process by determining a fixed random_state. This approach mitigates class imbalance by consistently reducing the number of examples in the majority class using the same sampling seed across iterations, thereby producing reproducible and robust classification outcomes.

In our experiments, we set the number of trees in the Random Forest to $B = 100$ , which balances model complexity and computational efficiency. At each split node, the number of features considered was set to $m = \sqrt{d}$ , where $d$ denotes the total number of features in the dataset. This setting follows standard recommendations for ensemble models. The maximum tree depth (max_depth) was not restricted, allowing trees to grow until leaves are pure or until all leaves contain fewer than the minimum samples for a split. The splitting criterion used was Gini impurity, which is commonly adopted in classification tasks due to its computational efficiency and performance. These hyperparameters were selected based on widely accepted heuristics and preliminary validations across several datasets.

Let $D$ be the dataset consisting of $n$ examples with features $X$ and labels $y$ :

\begin{aligned} D = {(x_{i}, y_{i}) ∣ x_{i} \in R^{d}, y_{i} \in {0, 1}, i = 1, \dots, n} \end{aligned}

(12)

The minority class is the class with fewer examples ( $y = 1$ ) and the majority class is the class with more examples ( $y = 0$ ).

A key distinction of FRSTU-Forest compared to our earlier RSTU+RF model lies in its deterministic sampling mechanism. While RSTU+RF leverages undersampling with a conventional random process, FRSTU-Forest fixes the random state during majority class sampling. This design ensures that identical samples are consistently selected in each run, enabling improved reproducibility and reduced performance variance. Such determinism is particularly valuable in safety-critical or regulated environments where repeatability and traceability of results are essential.

3.3.1. Random state sensitivity and justification

To ensure reproducibility and reduce performance variance caused by random undersampling, a fixed random_state parameter is introduced in the FRSTU-Forest model. In this paper, the random_state value is set equal to the number of minority class samples, i.e.,

\begin{aligned} random_state = n_{min} \end{aligned}

(13)

This approach ensures that the same subset of the majority class is consistently selected during the undersampling process, thus maintaining the determinism of the model results.

However, to validate the robustness of this fixed sampling scheme, a sensitivity analysis was conducted by varying the random_state across a range of values, such as $k \in {10, 50, 100, n_{min}}$ . The analysis measured the effect of these values on model performance metrics, including accuracy, F-measure, and Cohen’s Kappa.

The results of this analysis, discussed in the Statistical Difference Test section, demonstrate that the model performance remains stable across different values of random_state, thereby justifying its fixed assignment and enhancing confidence in the proposed model’s reliability and consistency.

Stability Considerations and Sensitivity Design

To evaluate the robustness of the FRSTU-Forest framework, a complementary experiment was designed to assess the impact of varying the fixed random state parameter. Specifically, multiple values of $random_state \in {10, 50, 100, n_{min}}$ were tested to determine whether the undersampling process yields significantly different classification outcomes. This setup is intended to verify that the model’s improvements are not artifacts of a particular sampling seed.

3.3.2. FRSTU-Forest steps

Identify Classes: Determine the number of examples in the minority class $n_{min}$ and the majority class $n_{maj}$ :

\begin{aligned} n_{min} & = \sum_{i = 1}^{n} I (y_{i} = 1) \end{aligned}

(14)

\begin{aligned} n_{maj} & = \sum_{i = 1}^{n} I (y_{i} = 0) \end{aligned}

(15)

Set Random State: Set the random state value is set to ensure a consistent undersampling process:

\begin{aligned} random_state = k \end{aligned}

(16)

Random Undersampling: If $n_{maj} > n_{min}$ , randomly select $n_{min}$ examples from the majority class using the fixed random state:

\begin{aligned} D_{maj}^{'} = {(x_{i}, y_{i}) ∣ y_{i} = 0} where | D_{maj}^{'} | = n_{min} \end{aligned}

(17)

Combine Datasets: Combine the selected majority class examples are combined with all minority class examples to form a balanced dataset:

\begin{aligned} D_{balanced} = D_{min} \cup D_{maj}^{'} \end{aligned}

(18)

Bootstrap Sampling and Training Random Forest: Create bootstrap samples from $D_{balanced}$ and train each decision tree:

\begin{aligned} D_{b} = BootstrapSample (D_{balanced}), b = 1, 2, \dots, B \end{aligned}

(19)

Each bootstrap sample $D_{b}$ is drawn from the balanced dataset $D_{balanced}$ with replacement. This sampling strategy ensures diversity among the decision trees in the ensemble, a key characteristic of Random Forest. Although the notation $\sim$ is commonly used in probabilistic modeling, in this context, it is more appropriate to express bootstrap sampling explicitly through function notation or subset inclusion.

Aggregate Predictions: For classification, the final prediction is the majority vote from all trees:

\begin{aligned} \hat{y} = \arg max_{y} \sum_{b = 1}^{B} I (T_{b} (x) = y) \end{aligned}

(20)

The proposed model is detailed in the pseudocode in Algorithm 3 below.

The selection of $B = 100$ and $m = \sqrt{d}$ is motivated by empirical best practices in ensemble models literature, balancing computational efficiency and model performance.

Feature Bagging and Bootstrap Sampling

Each decision tree in the ensemble is trained on a bootstrap sample drawn with replacement from the balanced dataset $D_{balanced}$ . At each internal node of a tree, a random subset of features of size $m$ is selected to determine the best split (known as feature bagging).

This mechanism reduces correlation among trees and enhances model generalization. Although out-of-bag (OOB) samples are typically used for internal validation in standard Random Forests, this paper does not employ OOB evaluation explicitly. Instead, external validation using test data is used to assess model performance consistently.

3.4. Theoretical justification for fixed-state undersampling

In ensemble-based classifiers such as Random Forests, classifier diversity is fundamental to achieving strong generalization performance. While randomness in data sampling promotes decorrelation among base learners, excessive stochasticity particularly from random undersampling can increase training instability and undermine model reproducibility, especially in highly imbalanced datasets.²⁵

The proposed fixed-state undersampling method addresses this issue by deterministically selecting majority class samples using a fixed random seed. This deterministic behavior stabilizes the composition of training subsets across different training sessions, contributing to experimental reproducibility, which is increasingly emphasized in high-stakes domains such as medical diagnostics and financial risk modeling.²⁶ Furthermore, consistent sampling improves learning dynamics by preventing inadvertent loss of critical class distributions across runs.

This theoretical rationale aligns with ensemble learning principles, where reducing variance across base learners enhances overall predictive reliability. As noted by Gurcan et al., undersampling that preserves minority class representations while reducing noise in majority class instances leads to improved generalization in imbalanced learning.²⁷ While Random Forest inherently induces diversity via feature bagging and bootstrapping, additional randomness from data selection may degrade ensemble consistency when not properly constrained.

Additionally, this approach enhances what is referred to as sampling stability, defined as the preservation of statistically similar training conditions across different runs. In imbalanced scenarios, where minority class patterns are already underrepresented, fixed sampling helps prevent the exclusion of rare yet informative instances, thereby mitigating the risk of underfitting the minority class.²⁸

Empirical results in recent studies have demonstrated that fixed-seed or controlled sampling strategies improve the bias-variance trade-off in ensemble classifiers, especially when applied to imbalanced biomedical and financial datasets.^17,29 Therefore, the fixed-state undersampling in FRSTU-Forest contributes not only to reproducibility but also to more robust and stable ensemble behavior across deployments.

4. Experimental results and analysis

In this section, we describe three experiments conducted to evaluate the effectiveness of the proposed method. The first experiment explored whether FRSTU-Forest can provide more competitive results than basic and advanced learning methods such as RF, k-NNimp+RF, and RSTU+RF. The second experiment assesses whether the performance of the improved RF model, by combining k-NNimp and improved RSU, is significantly better than that of the original RF model. In addition, the influence of weak learners on RF learning is explored.

4.1. Experiment setting

Experiments were conducted using a computing platform with an Intel Core i5 2.5 GHz Dual-Core CPU, 8 GB RAM, and macOS Catalina Version 10.15.7 (64-bit operating system). The data analysis tool used was Knime version 5.3, and R version 4.2.3 was used for statistical analysis. Knime produced a confusion matrix, and R produced a Bonferroni-Dunn test based on the Friedman test for statistical comparisons between the proposed method and comparative research.

4.2. Analysis of results for each classes of each dataset

4.2.1. Glass dataset

Table 3 presents a comparative analysis of various models for each class in the Glass dataset, covering the metrics of True Positives (TP), False Positives (FP), True Negatives (TN), False Negatives (FN), Recall, Precision, Sensitivity, Specificity, and F-measure.

Table 3.
Per-class performance comparison across different models on the Glass dataset.

Model Cls TP FP TN FN Recall Precision Sensitivity Specificity F1

RF 1 61 16 128 9 87.14 79.22 87.14 88.89 82.99

2 63 15 123 13 82.89 80.77 82.89 89.13 81.82

3 7 3 194 10 41.18 70.00 41.18 98.48 51.85

5 8 2 199 5 61.54 80.00 61.54 99.00 69.57

6 9 1 204 0 100.00 90.00 100.00 99.51 94.74

7 26 3 182 3 89.66 89.66 89.66 98.38 89.66

k-NNimp+RF 1 60 19 125 10 85.71 75.95 85.71 86.81 80.54

2 60 16 122 16 78.95 78.95 78.95 88.41 78.95

3 4 4 193 13 23.53 50.00 23.53 97.97 32.00

5 9 3 198 4 69.23 75.00 69.23 98.51 72.00

6 8 2 203 1 88.89 80.00 88.89 99.02 84.21

7 26 3 182 3 89.66 89.66 89.66 98.38 89.66

RSTU+RF 1 61 16 364 15 80.26 79.22 80.26 95.79 79.74

2 61 10 370 15 80.26 85.92 80.26 97.37 82.99

3 70 8 372 6 92.11 89.74 92.11 97.89 90.91

5 76 5 375 0 100.00 93.83 100.00 98.68 96.82

6 76 1 379 0 100.00 98.70 100.00 99.74 99.35

7 71 1 379 5 93.42 98.61 93.42 99.74 95.95

FRSTU-Forest (our) 1 57 12 368 19 75.00 82.61 75.00 96.84 78.62

2 59 8 372 17 77.63 88.06 77.63 97.89 82.52

3 74 14 366 2 97.37 84.09 97.37 96.32 90.24

5 76 4 376 0 100.00 95.00 100.00 98.95 97.44

6 76 2 378 0 100.00 97.44 100.00 99.47 98.70

7 74 0 380 2 97.37 100.00 97.37 100.00 98.67

Model	Cls	TP	FP	TN	FN	Recall	Precision	Sensitivity	Specificity	F1
RF	1	61	16	128	9	87.14	79.22	87.14	88.89	82.99
	2	63	15	123	13	82.89	80.77	82.89	89.13	81.82
	3	7	3	194	10	41.18	70.00	41.18	98.48	51.85
	5	8	2	199	5	61.54	80.00	61.54	99.00	69.57
	6	9	1	204	0	100.00	90.00	100.00	99.51	94.74
	7	26	3	182	3	89.66	89.66	89.66	98.38	89.66
k-NNimp+RF	1	60	19	125	10	85.71	75.95	85.71	86.81	80.54
	2	60	16	122	16	78.95	78.95	78.95	88.41	78.95
	3	4	4	193	13	23.53	50.00	23.53	97.97	32.00
	5	9	3	198	4	69.23	75.00	69.23	98.51	72.00
	6	8	2	203	1	88.89	80.00	88.89	99.02	84.21
	7	26	3	182	3	89.66	89.66	89.66	98.38	89.66
RSTU+RF	1	61	16	364	15	80.26	79.22	80.26	95.79	79.74
	2	61	10	370	15	80.26	85.92	80.26	97.37	82.99
	3	70	8	372	6	92.11	89.74	92.11	97.89	90.91
	5	76	5	375	0	100.00	93.83	100.00	98.68	96.82
	6	76	1	379	0	100.00	98.70	100.00	99.74	99.35
	7	71	1	379	5	93.42	98.61	93.42	99.74	95.95
FRSTU-Forest (our)	1	57	12	368	19	75.00	82.61	75.00	96.84	78.62
	2	59	8	372	17	77.63	88.06	77.63	97.89	82.52
	3	74	14	366	2	97.37	84.09	97.37	96.32	90.24
	5	76	4	376	0	100.00	95.00	100.00	98.95	97.44
	6	76	2	378	0	100.00	97.44	100.00	99.47	98.70
	7	74	0	380	2	97.37	100.00	97.37	100.00	98.67

All metric values are in percentages.

As shown in Table 3, the Random Forest (RF) model exhibited strong performance in several classes. For instance, in Class 1, RF achieved a recall of 87.14% and a precision of 79.22%, resulting in an F-measure of 82.99%. In Class 2, recall and precision were relatively balanced (82.89% and 80.77%, respectively), yielding an F-measure of 81.82%. However, RF struggled significantly in Class 3, with a recall of only 41.18% and precision of 70.00%, leading to a modest F-measure of 51.85%. In contrast, the model performed exceptionally well in Class 6, achieving 100% recall and 90.00% precision, with an F-measure of 94.74%.

The k-NNimp+RF model demonstrated similar behavior to RF in several classes, but with notable differences in minority class detection. For example, in Class 1, its recall was slightly lower at 85.71% and precision dropped to 75.95%, leading to an F-measure of 80.54%. More importantly, its performance deteriorated significantly in Class 3, where recall dropped to 23.53% and precision to 50.00%, resulting in a very low F-measure of 32.00%. This highlights the model’s limited effectiveness in handling highly imbalanced classes.

In comparison, the RSTU+RF model showed substantial improvement, particularly in minority class recognition. It achieved a recall of 92.11% and a precision of 89.74% in Class 3, leading to a strong F-measure of 90.91%. For Classes 5 and 6, it recorded perfect recall and precision (100%) and yielded F-measures of 96.82% and 99.35%, respectively. These results affirm the positive impact of undersampling with a fixed random state on imbalanced data performance.

The FRSTU-Forest model further improved upon these results. In Class 3, it achieved a very high recall of 97.37% and precision of 84.09%, resulting in an F-measure of 90.24%. For Classes 5 and 6, the model also recorded perfect recall and precision (100%), with F-measures of 97.44% and 98.70%, respectively. While comparable to RSTU+RF in performance, FRSTU-Forest exhibited slightly greater precision in Class 3 and more consistent results across other minority classes.

These observations suggest that both RSTU+RF and FRSTU-Forest significantly outperformed RF and k-NNimp+RF across most classes. k-NNimp+RF showed difficulty in detecting Class 3, and although RF maintained reasonably balanced metrics in major classes, it underperformed in minority ones. Notably, FRSTU-Forest demonstrated not only competitive accuracy but also greater consistency across repeated trials, due to its deterministic sampling strategy. This stability is particularly valuable in high-risk domains such as forensic analysis, quality control, and security applications, where reproducibility and precision are critical.

Overall, the results confirm that integrating fixed-state undersampling with rough set-based feature selection and imputation enhances both minority class detection and model stability. FRSTU-Forest proves to be a reliable and effective classifier in imbalanced multi-class scenarios like the Glass dataset.

4.2.2. Ecoli dataset

Table 4 shows a comparative analysis of the performance of various models in each class in the Ecoli dataset using the metrics True Positives (TP), False Positives (FP), True Negatives (TN), False Negatives (FN), Recall, Precision, Sensitivity, Specificity, and the F-measure.

Table 4.
Per-class performance comparison of different models on the Ecoli dataset.

Model Cls TP FP TN FN Recall Precision Sensitivity Specificity F1

RF cp 140 8 185 3 97.90 94.59 97.90 95.85 96.22

im 66 20 239 11 85.71 76.74 85.71 92.28 80.98

imL 0 0 334 2 0.00 0.00 0.00 100.00 0.00

imS 0 0 334 2 0.00 0.00 0.00 100.00 0.00

imU 15 10 291 20 42.86 60.00 42.86 96.68 50.00

om 17 0 316 3 85.00 100.00 85.00 100.00 91.89

omL 5 2 329 0 100.00 71.43 100.00 99.40 83.33

pp 46 7 277 6 88.46 86.79 88.46 97.54 87.62

k-NNimp+RF cp 140 9 184 3 97.90 93.96 97.90 95.34 95.89

im 65 18 241 12 84.42 78.31 84.42 93.05 81.25

imL 0 0 334 2 0.00 0.00 0.00 100.00 0.00

imS 0 0 334 2 0.00 0.00 0.00 100.00 0.00

imU 17 10 291 18 48.57 62.96 48.57 96.68 54.84

om 16 0 316 4 80.00 100.00 80.00 100.00 88.89

omL 5 2 329 0 100.00 71.43 100.00 99.40 83.33

pp 46 8 276 6 88.46 85.19 88.46 97.18 86.79

RSTU+RF cp 138 11 990 5 96.50 92.62 96.50 98.90 94.52

im 125 9 992 18 87.41 93.28 87.41 99.10 90.25

imL 142 0 1001 1 99.30 100.00 99.30 100.00 99.65

imS 142 0 1001 1 99.30 100.00 99.30 100.00 99.65

imU 135 17 984 8 94.41 88.82 94.41 98.30 91.53

om 138 3 998 5 96.50 97.87 96.50 99.70 97.18

omL 143 1 1000 0 100.00 99.31 100.00 99.90 99.65

pp 131 9 992 12 91.61 93.57 91.61 99.10 92.58

FRSTU-Forest (our) cp 138 9 992 5 96.50 93.88 96.50 99.10 95.17

im 124 7 994 19 86.71 94.66 86.71 99.30 90.51

imL 142 0 1001 1 99.30 100.00 99.30 100.00 99.65

imS 141 2 999 2 98.60 98.60 98.60 99.80 98.60

imU 136 18 983 7 95.10 88.31 95.10 98.20 91.58

om 141 3 998 2 98.60 97.92 98.60 99.70 98.26

omL 143 0 1001 0 100.00 100.00 100.00 100.00 100.00

pp 133 7 994 10 93.01 95.00 93.01 99.30 93.99

Model	Cls	TP	FP	TN	FN	Recall	Precision	Sensitivity	Specificity	F1
RF	cp	140	8	185	3	97.90	94.59	97.90	95.85	96.22
	im	66	20	239	11	85.71	76.74	85.71	92.28	80.98
	imL	0	0	334	2	0.00	0.00	0.00	100.00	0.00
	imS	0	0	334	2	0.00	0.00	0.00	100.00	0.00
	imU	15	10	291	20	42.86	60.00	42.86	96.68	50.00
	om	17	0	316	3	85.00	100.00	85.00	100.00	91.89
	omL	5	2	329	0	100.00	71.43	100.00	99.40	83.33
	pp	46	7	277	6	88.46	86.79	88.46	97.54	87.62
k-NNimp+RF	cp	140	9	184	3	97.90	93.96	97.90	95.34	95.89
	im	65	18	241	12	84.42	78.31	84.42	93.05	81.25
	imL	0	0	334	2	0.00	0.00	0.00	100.00	0.00
	imS	0	0	334	2	0.00	0.00	0.00	100.00	0.00
	imU	17	10	291	18	48.57	62.96	48.57	96.68	54.84
	om	16	0	316	4	80.00	100.00	80.00	100.00	88.89
	omL	5	2	329	0	100.00	71.43	100.00	99.40	83.33
	pp	46	8	276	6	88.46	85.19	88.46	97.18	86.79
RSTU+RF	cp	138	11	990	5	96.50	92.62	96.50	98.90	94.52
	im	125	9	992	18	87.41	93.28	87.41	99.10	90.25
	imL	142	0	1001	1	99.30	100.00	99.30	100.00	99.65
	imS	142	0	1001	1	99.30	100.00	99.30	100.00	99.65
	imU	135	17	984	8	94.41	88.82	94.41	98.30	91.53
	om	138	3	998	5	96.50	97.87	96.50	99.70	97.18
	omL	143	1	1000	0	100.00	99.31	100.00	99.90	99.65
	pp	131	9	992	12	91.61	93.57	91.61	99.10	92.58
FRSTU-Forest (our)	cp	138	9	992	5	96.50	93.88	96.50	99.10	95.17
	im	124	7	994	19	86.71	94.66	86.71	99.30	90.51
	imL	142	0	1001	1	99.30	100.00	99.30	100.00	99.65
	imS	141	2	999	2	98.60	98.60	98.60	99.80	98.60
	imU	136	18	983	7	95.10	88.31	95.10	98.20	91.58
	om	141	3	998	2	98.60	97.92	98.60	99.70	98.26
	omL	143	0	1001	0	100.00	100.00	100.00	100.00	100.00
	pp	133	7	994	10	93.01	95.00	93.01	99.30	93.99

All metric values are in percentages.

As shown in Table 4, the Random Forest (RF) model demonstrated strong performance in several dominant classes. For instance, in the cp class, RF achieved a recall of 97.90% and precision of 94.59%, resulting in an F-measure of 96.22%. Similarly, the im class yielded 85.71% recall and 76.74% precision (F-measure: 80.98%). However, RF completely failed to detect instances in the imL and imS classes, with both recall and precision dropping to 0%, indicating an inability to identify minority classes. The om class produced favorable results, with 85.00% recall and 100% precision (F-measure: 91.89%), while the omL class showed 100% recall but only 71.43% precision, yielding an F-measure of 83.33%.

The k-NNimp+RF model exhibited comparable trends with marginal variations. In the cp class, recall remained at 97.90%, but precision slightly declined to 93.96%, resulting in an F-measure of 95.89%. In the im class, performance remained consistent with a recall of 84.42% and precision of 78.31% (F-measure: 81.25%). Similar to RF, this model also failed to detect the imL and imS classes, with both metrics at 0%. For the om class, k-NNimp+RF achieved 80.00% recall and 100% precision (F-measure: 88.89%), reflecting robust detection but slightly reduced sensitivity.

In contrast, the RSTU+RF model significantly improved classification performance across all classes. In the cp class, it recorded 96.50% recall and 92.62% precision, resulting in an F-measure of 94.52%. The im class achieved a balanced recall of 87.41% and a notably higher precision of 93.28% (F-measure: 90.25%). Remarkably, the model performed nearly perfectly on the imL and imS classes, achieving 99.30% for both recall and precision, with corresponding F-measures of 99.65%. Similarly, the om class showed strong results with 96.50% recall and 97.87% precision (F-measure: 97.18%), and the omL class achieved 100% for both recall and precision, reflecting near-perfect classification capability.

The FRSTU-Forest model achieved comparable, and in some cases slightly better, performance than RSTU+RF. In the cp class, it reached a recall of 96.50% and precision of 93.88%, yielding an F-measure of 95.17%. The im class showed a balanced profile with 86.71% recall and 94.66% precision (F-measure: 90.51%). For the minority classes imL and imS, the model also recorded excellent results: imL achieved 99.30% for both recall and precision (F-measure: 99.65%), and imS achieved 98.60% recall and precision (F-measure: 98.60%). In the om class, the model produced 98.60% recall and 97.92% precision (F-measure: 98.26%), while the omL class achieved a perfect 100% in all metrics.

In summary, both RSTU+RF and FRSTU-Forest significantly outperformed RF and k-NNimp+RF on the Ecoli dataset, especially in detecting minority classes such as imL and imS, where RF-based models failed entirely. While k-NNimp+RF offered minor improvements over the baseline RF, it was insufficient for rare class detection. The rough set-based methods demonstrated not only high accuracy but also robustness across all class distributions. Notably, FRSTU-Forest delivered similarly high classification performance while maintaining consistency across repeated experiments, attributed to its deterministic sampling mechanism. This advantage is particularly important in biomedical or bioinformatics applications, where reliable and reproducible classification of minority patterns is essential for downstream decision-making.

4.2.3. Credit Approval dataset

Table 5 presents a comparative analysis of the performance of various models for each class in the Credit Approval dataset using the metrics True Positives (TP), False Positives (FP), True Negatives (TN), False Negatives (FN), Recall, Precision, Sensitivity, Specificity, and the F-measure.

Table 5.
Per-class performance comparison of different models on the Credit Approval dataset.

Model Cls TP FP TN FN Recall Precision Sensitivity Specificity F1

RF 0 337 38 269 46 87.99 89.87 87.99 87.62 88.92

1 269 46 337 38 87.62 85.40 87.62 87.99 86.50

k-NNimp+RF 0 331 38 269 52 86.42 89.70 86.42 87.62 88.03

1 269 52 331 38 87.62 83.80 87.62 86.42 85.67

RSTU+RF 0 332 42 341 51 86.68 88.77 86.68 89.03 87.71

1 341 51 332 42 89.03 86.99 89.03 86.68 88.00

FRSTU-Forest (our) 0 336 45 338 47 87.73 88.19 87.73 88.25 87.96

1 338 47 336 45 88.25 87.79 88.25 87.73 88.02

Model	Cls	TP	FP	TN	FN	Recall	Precision	Sensitivity	Specificity	F1
RF	0	337	38	269	46	87.99	89.87	87.99	87.62	88.92
	1	269	46	337	38	87.62	85.40	87.62	87.99	86.50
k-NNimp+RF	0	331	38	269	52	86.42	89.70	86.42	87.62	88.03
	1	269	52	331	38	87.62	83.80	87.62	86.42	85.67
RSTU+RF	0	332	42	341	51	86.68	88.77	86.68	89.03	87.71
	1	341	51	332	42	89.03	86.99	89.03	86.68	88.00
FRSTU-Forest (our)	0	336	45	338	47	87.73	88.19	87.73	88.25	87.96
	1	338	47	336	45	88.25	87.79	88.25	87.73	88.02

All metric values are in percentages.

As shown in Table 5, the Random Forest (RF) model delivered solid performance in both class categories. For Class 0, RF achieved a recall of 87.99% and a precision of 89.87%, resulting in an F-measure of 88.92%. Class 1 exhibited similarly strong performance, with a recall of 87.62% and a precision of 85.40% (F-measure: 86.50%). Although these results suggest that RF can classify both classes effectively, a slight imbalance between recall and precision especially in Class 1 indicates a tendency to favor majority-class decisions.

The k-NNimp+RF model produced comparable results but exhibited a slight decrease in overall precision. In Class 0, it recorded 86.42% recall and 89.70% precision, yielding an F-measure of 88.03%. For Class 1, the recall remained unchanged at 87.62%, but precision declined to 83.80%, resulting in an F-measure of 85.67%. These metrics reflect that while the model maintained recall, it struggled more with precision, particularly in identifying minority-class samples correctly.

RSTU+RF demonstrated improved balance and robustness across both classes. In Class 0, the model achieved 86.68% recall and 88.77% precision (F-measure: 87.71%). More notably, in Class 1, it recorded an increased recall of 89.03% and precision of 86.99%, producing an F-measure of 88.00%. These figures indicate that RSTU+RF provides a more symmetrical trade-off between false positives and false negatives, which is critical in credit risk prediction.

The FRSTU-Forest model achieved the most consistent and balanced performance among all models tested. In Class 0, it yielded 87.73% recall and 88.19% precision, leading to an F-measure of 87.96%. For Class 1, it achieved 88.25% recall and 87.79% precision (F-measure: 88.02%). The minimal gap between recall and precision in both classes illustrates the model’s superior ability to manage class imbalance effectively while maintaining predictive reliability.

In summary, both rough set-based models RSTU+RF and FRSTU-Forest outperformed the standard RF and k-NNimp+RF models in terms of achieving balance between precision and recall. Although RF and k-NNimp+RF performed relatively well, they exhibited mild asymmetry between evaluation metrics, particularly for the minority class. RSTU+RF reduced this disparity, while FRSTU-Forest further improved both accuracy and consistency. The deterministic undersampling mechanism employed in FRSTU-Forest contributes to its reliable and repeatable outcomes across multiple runs, making it especially suitable for credit approval tasks where model stability and fairness are essential for regulatory compliance and stakeholder trust.

4.2.4. Breast Cancer Wisconsin dataset

Table 6 presents a comparative analysis of the performance of various models for each class in the Breast Cancer Wisconsin dataset using the metrics True Positives (TP), False Positives (FP), True Negatives (TN), False Negatives (FN), Recall, Precision, Sensitivity, Specificity, and the F-measure.

Table 6.
Per-class performance comparison of different models on the Breast Cancer Wisconsin dataset.

Model Cls TP FP TN FN Recall Precision Sensitivity Specificity F1

RF B 350 12 200 7 98.04 96.69 98.04 94.34 97.36

M 200 7 350 12 94.34 96.62 94.34 98.04 95.47

k-NNimp+RF B 348 14 198 9 97.48 96.13 97.48 93.40 96.80

M 198 9 348 14 93.40 95.65 93.40 97.48 94.51

RSTU+RF B 348 7 350 9 97.48 98.03 97.48 98.04 97.75

M 350 9 348 7 98.04 97.49 98.04 97.48 97.77

FRSTU-Forest (our) B 346 11 346 11 96.92 96.92 96.92 96.92 96.92

M 346 11 346 11 96.92 96.92 96.92 96.92 96.92

Model	Cls	TP	FP	TN	FN	Recall	Precision	Sensitivity	Specificity	F1
RF	B	350	12	200	7	98.04	96.69	98.04	94.34	97.36
	M	200	7	350	12	94.34	96.62	94.34	98.04	95.47
k-NNimp+RF	B	348	14	198	9	97.48	96.13	97.48	93.40	96.80
	M	198	9	348	14	93.40	95.65	93.40	97.48	94.51
RSTU+RF	B	348	7	350	9	97.48	98.03	97.48	98.04	97.75
	M	350	9	348	7	98.04	97.49	98.04	97.48	97.77
FRSTU-Forest (our)	B	346	11	346	11	96.92	96.92	96.92	96.92	96.92
	M	346	11	346	11	96.92	96.92	96.92	96.92	96.92

All metrics are reported in percentage. Class labels: B (Benign), M (Malignant).

As shown in Table 6, the Random Forest (RF) model delivered excellent performance across both classes. For the B (Benign) class, RF achieved a recall of 98.04% and precision of 96.69%, resulting in an F-measure of 97.36%. For the M (Malignant) class, the model recorded 94.34% recall and 96.62% precision, yielding an F-measure of 95.47%. These results reflect the model’s high accuracy in both class categories, although a slight imbalance between recall and precision remains, particularly for the malignant class.

The k-NNimp+RF model produced comparable outcomes, though it slightly underperformed relative to RF. In the B class, recall was 97.48% and precision was 96.13%, giving an F-measure of 96.80%. For the M class, recall declined marginally to 93.40%, while precision dropped to 95.65% (F-measure: 94.51%). This indicates that although the model remains effective, it shows greater variability in performance between the two classes and is slightly less precise in identifying malignant cases.

The RSTU+RF model outperformed both preceding methods, demonstrating not only strong but also balanced performance across classes. In the B class, it achieved 97.48% recall and 98.03% precision, producing an F-measure of 97.75%. In the M class, it recorded 98.04% recall and 97.49% precision (F-measure: 97.77%). The close alignment between recall and precision values across both classes suggests that this model excels in minimizing both false negatives and false positives—an essential requirement in medical diagnosis.

The FRSTU-Forest model also yielded high and highly consistent results. For both B and M classes, the model reported identical recall, precision, sensitivity, and specificity values of 96.92%, resulting in an F-measure of 96.92% for each class. While slightly trailing behind RSTU+RF in absolute performance, the FRSTU-Forest exhibited superior consistency, suggesting stable classification behavior and reduced prediction variance across multiple runs.

In conclusion, the rough set-based models, particularly RSTU+RF and FRSTU-Forest provided superior results compared to RF and k-NNimp+RF on the Breast Cancer Wisconsin dataset. While RF and k-NNimp+RF demonstrated strong performance, they showed slightly greater asymmetry between evaluation metrics. RSTU+RF achieved the highest overall accuracy and balance, while FRSTU-Forest offered more uniform results, reinforcing its strength in producing stable classifications. These findings underscore the suitability of rough set-based approaches for clinical applications where both accuracy and reliability are paramount for effective early detection and risk assessment.

4.2.5. Pima Indian Diabetes

Table 7 presents a comparative analysis of the performance of various models for each class in the Pima Indian Diabetes dataset using the metrics True Positives (TP), False Positives (FP), True Negatives (TN), False Negatives (FN), Recall, Precision, Sensitivity, Specificity, and the F-measure.

Table 7.
Performance comparison of models per class on the Pima Indian Diabetes dataset.

Model Cls TP FP TN FN Recall Precision Sensitivity Specificity F1

RF 0 542 0 226 0 100.00 100.00 100.00 100.00 100.00

1 226 0 542 0 100.00 100.00 100.00 100.00 100.00

k-NNimp+RF 0 423 110 158 77 84.60 79.36 84.60 58.96 81.90

1 158 77 423 110 58.96 67.23 58.96 84.60 62.82

RSTU+RF 0 393 68 432 107 78.60 85.25 78.60 86.40 81.79

1 432 107 393 68 86.40 80.15 86.40 78.60 83.16

FRSTU-Forest (our) 0 396 83 417 104 79.20 82.67 79.20 83.40 80.90

1 417 104 396 83 83.40 80.04 83.40 79.20 81.68

Model	Cls	TP	FP	TN	FN	Recall	Precision	Sensitivity	Specificity	F1
RF	0	542	0	226	0	100.00	100.00	100.00	100.00	100.00
	1	226	0	542	0	100.00	100.00	100.00	100.00	100.00
k-NNimp+RF	0	423	110	158	77	84.60	79.36	84.60	58.96	81.90
	1	158	77	423	110	58.96	67.23	58.96	84.60	62.82
RSTU+RF	0	393	68	432	107	78.60	85.25	78.60	86.40	81.79
	1	432	107	393	68	86.40	80.15	86.40	78.60	83.16
FRSTU-Forest (our)	0	396	83	417	104	79.20	82.67	79.20	83.40	80.90
	1	417	104	396	83	83.40	80.04	83.40	79.20	81.68

All metrics are reported in percentages. Class 0 denotes non-diabetic and class 1 denotes diabetic.

As shown in Table 7, the Random Forest (RF) model yielded perfect performance on the Pima Indian Diabetes dataset. For both Class 0 and Class 1, the model achieved 100% recall, precision, sensitivity, specificity, and F-measure. While these results indicate flawless classification on the test data, they raise concerns about potential overfitting, particularly given the complexity and noise commonly associated with medical datasets. Such perfect scores, though impressive, warrant careful scrutiny to assess the model’s generalization ability.

In contrast, the k-NNimp+RF model exhibited a substantial decline in performance. For Class 0, recall dropped to 84.60% and precision to 79.36%, resulting in an F-measure of 81.90%. Class 1 showed even more pronounced degradation, with 58.96% recall and 67.23% precision (F-measure: 62.82%). These figures reflect a considerable imbalance between recall and precision in Class 1, suggesting that the model struggled to correctly identify diabetic cases, a critical shortcoming in clinical decision-making contexts.

The RSTU+RF model provided a more balanced and improved classification outcome. For Class 0, recall was 78.60% and precision 85.25%, leading to an F-measure of 81.79%. In Class 1, both metrics improved significantly: recall reached 86.40% and precision was 80.15%, resulting in an F-measure of 83.16%. This demonstrates the model’s stronger capability to correctly identify both diabetic and non-diabetic patients with reasonable balance, mitigating the extreme variance seen in the k-NNimp+RF model.

The FRSTU-Forest model also exhibited balanced and competitive performance. In Class 0, the model achieved 79.20% recall and 82.67% precision (F-measure: 80.90%), while in Class 1, recall was 83.40% and precision 80.04% (F-measure: 81.68%). While its absolute performance was slightly lower than RSTU+RF, the FRSTU-Forest model maintained a consistent margin between recall and precision across classes, indicating a stable and interpretable classifier with less variability across experimental runs.

In summary, although the RF model produced seemingly perfect classification results, such an outcome likely signals overfitting rather than genuine predictive robustness. By comparison, the rough set-based models, especially RSTU+RF and FRSTU-Forest demonstrated more realistic and generalizable performance with well-balanced precision and recall across both classes. These models offer stronger evidence of reliability for deployment in real-world diabetes screening systems, where false positives and false negatives must be carefully managed. k-NNimp+RF showed reasonable performance but struggled with minority class detection. Overall, RSTU+RF stands out for its high recall in identifying diabetic cases, while FRSTU-Forest offers a strong balance and stability, making both models well-suited for clinical applications where consistency and interpretability are paramount.

4.2.6. German Credit dataset

Table 8 presents a comparative analysis of the performance of various models for each class in the German Credit dataset using the metrics True Positives (TP), False Positives (FP), True Negatives (TN), False Negatives (FN), Recall, Precision, Sensitivity, Specificity, and the F-measure.

Table 8.
Per-class performance comparison of different models on the German Credit dataset.

Model Class TP FP TN FN Recall Precision Sensitivity Specificity F1

RF business 9 13 890 88 9.28 40.91 9.28 98.56 15.13

car 194 233 430 143 57.57 45.43 57.57 64.86 50.79

dom. appl. 0 0 988 12 0.00 0.00 0.00 100.00 0.00

education 1 6 935 58 1.69 14.29 1.69 99.36 3.03

furn./equip. 47 75 744 134 25.97 38.52 25.97 90.84 31.02

radio/TV 159 261 459 121 56.79 37.86 56.79 63.75 45.43

repairs 0 1 977 22 0.00 0.00 0.00 99.90 0.00

vacation 0 1 987 12 0.00 0.00 0.00 99.90 0.00

k-NNimp+RF business 6 18 885 91 6.19 25.00 6.19 98.01 9.92

car 196 247 416 141 58.16 44.24 58.16 62.75 50.26

dom. appl. 0 0 988 12 0.00 0.00 0.00 100.00 0.00

education 1 1 940 58 1.69 50.00 1.69 99.89 3.28

furn./equip. 46 85 734 135 25.41 35.11 25.41 89.62 29.49

radio/TV 153 246 474 127 54.64 38.35 54.64 65.83 45.07

repairs 0 1 977 22 0.00 0.00 0.00 99.90 0.00

vacation 0 0 988 12 0.00 0.00 0.00 100.00 0.00

RSTU+RF business 6 18 885 91 6.19 25.00 6.19 98.01 9.92

car 196 247 416 141 58.16 44.24 58.16 62.75 50.26

dom. appl. 0 0 988 12 0.00 0.00 0.00 100.00 0.00

education 1 1 940 58 1.69 50.00 1.69 99.89 3.28

furn./equip. 46 85 734 135 25.41 35.11 25.41 89.62 29.49

radio/TV 153 246 474 127 54.64 38.35 54.64 65.83 45.07

repairs 0 1 977 22 0.00 0.00 0.00 99.90 0.00

vacation 0 0 988 12 0.00 0.00 0.00 100.00 0.00

FRSTU-Forest (ours) business 229 114 2245 108 67.95 66.76 67.95 95.17 67.35

car 99 102 2257 238 29.38 49.25 29.38 95.68 36.80

dom. appl. 322 70 2289 15 95.55 82.14 95.55 97.03 88.34

education 264 95 2264 73 78.34 73.54 78.34 95.97 75.86

furn./equip. 190 144 2215 147 56.38 56.89 56.38 93.90 56.63

radio/TV 131 160 2199 206 38.87 45.02 38.87 93.22 41.72

repairs 305 84 2275 32 90.50 78.41 90.50 96.44 84.02

vacation 316 71 2288 21 93.77 81.65 93.77 96.99 87.29

Model	Class	TP	FP	TN	FN	Recall	Precision	Sensitivity	Specificity	F1
RF	business	9	13	890	88	9.28	40.91	9.28	98.56	15.13
	car	194	233	430	143	57.57	45.43	57.57	64.86	50.79
	dom. appl.	0	0	988	12	0.00	0.00	0.00	100.00	0.00
	education	1	6	935	58	1.69	14.29	1.69	99.36	3.03
	furn./equip.	47	75	744	134	25.97	38.52	25.97	90.84	31.02
	radio/TV	159	261	459	121	56.79	37.86	56.79	63.75	45.43
	repairs	0	1	977	22	0.00	0.00	0.00	99.90	0.00
	vacation	0	1	987	12	0.00	0.00	0.00	99.90	0.00
k-NNimp+RF	business	6	18	885	91	6.19	25.00	6.19	98.01	9.92
	car	196	247	416	141	58.16	44.24	58.16	62.75	50.26
	dom. appl.	0	0	988	12	0.00	0.00	0.00	100.00	0.00
	education	1	1	940	58	1.69	50.00	1.69	99.89	3.28
	furn./equip.	46	85	734	135	25.41	35.11	25.41	89.62	29.49
	radio/TV	153	246	474	127	54.64	38.35	54.64	65.83	45.07
	repairs	0	1	977	22	0.00	0.00	0.00	99.90	0.00
	vacation	0	0	988	12	0.00	0.00	0.00	100.00	0.00
RSTU+RF	business	6	18	885	91	6.19	25.00	6.19	98.01	9.92
	car	196	247	416	141	58.16	44.24	58.16	62.75	50.26
	dom. appl.	0	0	988	12	0.00	0.00	0.00	100.00	0.00
	education	1	1	940	58	1.69	50.00	1.69	99.89	3.28
	furn./equip.	46	85	734	135	25.41	35.11	25.41	89.62	29.49
	radio/TV	153	246	474	127	54.64	38.35	54.64	65.83	45.07
	repairs	0	1	977	22	0.00	0.00	0.00	99.90	0.00
	vacation	0	0	988	12	0.00	0.00	0.00	100.00	0.00
FRSTU-Forest (ours)	business	229	114	2245	108	67.95	66.76	67.95	95.17	67.35
	car	99	102	2257	238	29.38	49.25	29.38	95.68	36.80
	dom. appl.	322	70	2289	15	95.55	82.14	95.55	97.03	88.34
	education	264	95	2264	73	78.34	73.54	78.34	95.97	75.86
	furn./equip.	190	144	2215	147	56.38	56.89	56.38	93.90	56.63
	radio/TV	131	160	2199	206	38.87	45.02	38.87	93.22	41.72
	repairs	305	84	2275	32	90.50	78.41	90.50	96.44	84.02
	vacation	316	71	2288	21	93.77	81.65	93.77	96.99	87.29

All metrics are shown in percentages.

As shown in Table 8, the Random Forest (RF) model exhibited inconsistent and often weak performance across several classes in the German Credit dataset. For the business class, recall was notably low at 9.28%, and precision reached only 40.91%, resulting in a modest F-measure of 15.13%. The car class showed moderate classification quality, with recall and precision of 57.57% and 45.43%, respectively (F-measure: 50.79%). However, the model entirely failed to detect instances from the domestic appliance and repair classes, where both recall and precision dropped to 0%, yielding F-measures of 0%. Performance in the radio/TV and furniture/equipment classes was suboptimal, with F-measures of 45.43% and 31.02%, respectively, reflecting the model’s general difficulty in capturing minority or underrepresented categories.

The k-NNimp+RF model produced performance metrics similar to RF with minor fluctuations. For example, in the business class, recall dropped slightly to 6.19%, and precision decreased to 25.00%, reducing the F-measure to 9.92%. In the car class, the model showed a marginally higher recall of 58.16%, though with reduced precision (44.24%), resulting in an F-measure of 50.26%. A notable exception was observed in the education class, where precision increased to 50.00%; however, the recall remained critically low at 1.69%, yielding an F-measure of only 3.28%. Overall, this model exhibited localized improvements but continued to struggle with severely imbalanced classes.

The RSTU+RF model mirrored the performance of k-NNimp+RF almost exactly. In the business class, it again recorded 6.19% recall and 25.00% precision (F-measure: 9.92%). Its results in the car and education classes also matched those of the k-NNimp+RF model. This consistency suggests that RSTU+RF alone did not contribute substantial gains in classifying minority categories within this dataset, particularly under extreme imbalance conditions.

In contrast, the FRSTU-Forest model demonstrated a substantial improvement across a broad range of classes. For the business class, recall rose significantly to 67.95%, and precision to 66.76%, resulting in an F-measure of 67.35%. While the car class saw a lower recall (29.38%) compared to RF-based models, the increase in precision (49.25%) led to a more favorable F-measure of 36.80%. More strikingly, the model achieved exceptional results in classes previously undetected by other models. In the domestic appliance class, recall and precision reached 95.55% and 82.14%, respectively (F-measure: 88.34%). Similarly, the education class achieved 78.34% recall and 73.54% precision (F-measure: 75.86%). Repair and vacation/other classes also showed strong F-measures of 84.02% and 87.29%, respectively, underscoring the model’s capacity to correctly identify previously elusive categories.

These results strongly indicate that the FRSTU-Forest model outperformed all other models in effectively managing class imbalance, particularly in detecting rare and underrepresented classes. While the standard RF and k-NNimp+RF models achieved moderate results in certain categories, their overall reliability was limited by failure to generalize across the full class distribution. RSTU+RF, although theoretically beneficial, did not provide meaningful improvement on its own. In contrast, the integration of fixed-state undersampling in FRSTU-Forest enabled more consistent and accurate classification across difficult categories, making it a highly promising approach for credit scoring applications, where fairness and sensitivity to minority group performance are essential.

4.2.7. Yeast dataset

Table 9 presents a comparative analysis of the performance of various models for each class in the yeast dataset using the metrics True Positives (TP), False Positives (FP), True Negatives (TN), False Negatives (FN), Recall, Precision, Sensitivity, Specificity, and the F-measure.

Table 9.
Per-class performance comparison of different models on the Yeast dataset.

Model Class TP FP TN FN Recall Precision Sensitivity Specificity F1

RF CYT 334 251 770 129 72.14 57.09 72.14 75.42 63.74

ERL 4 1 1478 1 80.00 80.00 80.00 99.93 80.00

EXC 21 15 1434 14 60.00 58.33 60.00 98.96 59.15

ME1 36 13 1427 8 81.82 73.47 81.82 99.10 77.42

ME2 22 18 1415 29 43.14 55.00 43.14 98.74 48.35

ME3 143 50 1271 20 87.73 74.09 87.73 96.21 80.34

MIT 132 59 1181 112 54.10 69.11 54.10 95.24 60.69

NUC 242 129 926 187 56.41 65.23 56.41 87.77 60.50

POX 9 4 1460 11 45.00 69.23 45.00 99.73 54.55

VAC 0 1 1453 30 0.00 0.00 0.00 99.93 0.00

k-NNimp+RF CYT 320 240 781 143 69.11 57.14 69.11 76.49 62.56

ERL 4 0 1479 1 80.00 100.00 80.00 100.00 88.89

EXC 21 15 1434 14 60.00 58.33 60.00 98.96 59.15

ME1 35 12 1428 9 79.55 74.47 79.55 99.17 76.92

ME2 20 22 1411 31 39.22 47.62 39.22 98.46 43.01

ME3 144 48 1273 19 88.34 75.00 88.34 96.37 81.13

MIT 135 66 1174 109 55.33 67.16 55.33 94.68 60.67

NUC 247 142 913 182 57.58 63.50 57.58 86.54 60.39

POX 9 4 1460 11 45.00 69.23 45.00 99.73 54.55

VAC 0 0 1454 30 0.00 0.00 0.00 100.00 0.00

RSTU+RF CYT 260 146 4021 203 56.16 64.04 56.16 96.50 59.84

ERL 463 2 4165 0 100.00 99.57 100.00 99.95 99.78

EXC 440 30 4137 23 95.03 93.62 95.03 99.28 94.32

ME1 460 20 4147 3 99.35 95.83 99.35 99.52 97.56

ME2 422 38 4129 41 91.14 91.74 91.14 99.09 91.44

ME3 446 79 4088 17 96.33 84.95 96.33 98.10 90.28

MIT 358 114 4053 105 77.32 75.85 77.32 97.26 76.58

NUC 278 145 4022 185 60.04 65.72 60.04 96.52 62.75

POX 445 19 4148 18 96.11 95.91 96.11 99.54 96.01

VAC 414 51 4116 49 89.42 89.03 89.42 98.78 89.22

FRSTU-Forest (ours) CYT 251 153 4014 212 54.21 62.13 54.21 96.33 57.90

ERL 463 2 4165 0 100.00 99.57 100.00 99.95 99.78

EXC 438 22 4145 25 94.60 95.22 94.60 99.47 94.91

ME1 457 15 4152 6 98.70 96.82 98.70 99.64 97.75

ME2 431 37 4130 32 93.09 92.09 93.09 99.11 92.59

ME3 440 72 4095 23 95.03 85.94 95.03 98.27 90.26

MIT 358 114 4053 105 77.32 75.85 77.32 97.26 76.58

NUC 281 151 4016 182 60.69 65.05 60.69 96.38 62.79

POX 452 17 4150 11 97.62 96.38 97.62 99.59 97.00

VAC 424 52 4115 39 91.58 89.08 91.58 98.75 90.31

Model	Class	TP	FP	TN	FN	Recall	Precision	Sensitivity	Specificity	F1
RF	CYT	334	251	770	129	72.14	57.09	72.14	75.42	63.74
	ERL	4	1	1478	1	80.00	80.00	80.00	99.93	80.00
	EXC	21	15	1434	14	60.00	58.33	60.00	98.96	59.15
	ME1	36	13	1427	8	81.82	73.47	81.82	99.10	77.42
	ME2	22	18	1415	29	43.14	55.00	43.14	98.74	48.35
	ME3	143	50	1271	20	87.73	74.09	87.73	96.21	80.34
	MIT	132	59	1181	112	54.10	69.11	54.10	95.24	60.69
	NUC	242	129	926	187	56.41	65.23	56.41	87.77	60.50
	POX	9	4	1460	11	45.00	69.23	45.00	99.73	54.55
	VAC	0	1	1453	30	0.00	0.00	0.00	99.93	0.00
k-NNimp+RF	CYT	320	240	781	143	69.11	57.14	69.11	76.49	62.56
	ERL	4	0	1479	1	80.00	100.00	80.00	100.00	88.89
	EXC	21	15	1434	14	60.00	58.33	60.00	98.96	59.15
	ME1	35	12	1428	9	79.55	74.47	79.55	99.17	76.92
	ME2	20	22	1411	31	39.22	47.62	39.22	98.46	43.01
	ME3	144	48	1273	19	88.34	75.00	88.34	96.37	81.13
	MIT	135	66	1174	109	55.33	67.16	55.33	94.68	60.67
	NUC	247	142	913	182	57.58	63.50	57.58	86.54	60.39
	POX	9	4	1460	11	45.00	69.23	45.00	99.73	54.55
	VAC	0	0	1454	30	0.00	0.00	0.00	100.00	0.00
RSTU+RF	CYT	260	146	4021	203	56.16	64.04	56.16	96.50	59.84
	ERL	463	2	4165	0	100.00	99.57	100.00	99.95	99.78
	EXC	440	30	4137	23	95.03	93.62	95.03	99.28	94.32
	ME1	460	20	4147	3	99.35	95.83	99.35	99.52	97.56
	ME2	422	38	4129	41	91.14	91.74	91.14	99.09	91.44
	ME3	446	79	4088	17	96.33	84.95	96.33	98.10	90.28
	MIT	358	114	4053	105	77.32	75.85	77.32	97.26	76.58
	NUC	278	145	4022	185	60.04	65.72	60.04	96.52	62.75
	POX	445	19	4148	18	96.11	95.91	96.11	99.54	96.01
	VAC	414	51	4116	49	89.42	89.03	89.42	98.78	89.22
FRSTU-Forest (ours)	CYT	251	153	4014	212	54.21	62.13	54.21	96.33	57.90
	ERL	463	2	4165	0	100.00	99.57	100.00	99.95	99.78
	EXC	438	22	4145	25	94.60	95.22	94.60	99.47	94.91
	ME1	457	15	4152	6	98.70	96.82	98.70	99.64	97.75
	ME2	431	37	4130	32	93.09	92.09	93.09	99.11	92.59
	ME3	440	72	4095	23	95.03	85.94	95.03	98.27	90.26
	MIT	358	114	4053	105	77.32	75.85	77.32	97.26	76.58
	NUC	281	151	4016	182	60.69	65.05	60.69	96.38	62.79
	POX	452	17	4150	11	97.62	96.38	97.62	99.59	97.00
	VAC	424	52	4115	39	91.58	89.08	91.58	98.75	90.31

All metrics are shown in percentages.

As shown in Table 9, the Random Forest (RF) model produced varying performance across the different classes in the Yeast dataset. In the CYT class, the model achieved a recall of 72.14% and a precision of 57.09%, resulting in an F-measure of 63.74%, indicating modest capability in identifying cytoplasmic proteins. The ERL class yielded more balanced results with both recall and precision at 80.00% (F-measure: 80.00%), suggesting that RF was effective in capturing this class. In contrast, the VAC class showed a complete failure in detection, with 0% recall and precision, leading to an F-measure of 0%, which is concerning given its biological significance. The EXC class demonstrated borderline acceptable performance (F-measure: 59.15%), highlighting the model’s partial capability in detecting extracellular proteins.

The k-NNimp+RF model offered marginal improvements in several classes. For instance, in the CYT class, although recall slightly decreased to 69.11%, precision improved to 57.14%, maintaining a comparable F-measure of 62.56%. A notable improvement occurred in the ERL class, where precision increased to 100% while recall remained at 80.00%, producing an F-measure of 88.89%. The ME1 class benefited the most, achieving a recall of 79.55% and precision of 74.47%, yielding an F-measure of 76.92%. However, the VAC class still registered 0% recall and precision, signifying persistent challenges in minority class detection.

The RSTU+RF model exhibited marked improvements across most classes. In the CYT class, recall decreased to 56.16% but precision increased to 64.04%, leading to an F-measure of 59.84%, showing better precision-recall balance despite lower sensitivity. The ERL class demonstrated near-perfect classification performance with recall of 100% and precision of 99.57% (F-measure: 99.78%). Similarly, the EXC and ME1 classes achieved impressive F-measures of 94.32% and 97.56%, respectively. The ME3 class showed robust performance (recall: 96.33%, precision: 84.95%, F-measure: 90.28%). The POX class also achieved a high F-measure of 96.01%, confirming the model’s strength in capturing low-frequency classes.

The FRSTU-Forest model further improved upon this, offering consistent and superior results in most categories. For the CYT class, recall was 54.21% and precision 62.13%, resulting in a balanced F-measure of 57.90%, marginally better than RSTU+RF in terms of harmonic trade-off. The ERL class matched RSTU+RF with a near-perfect F-measure of 99.78%, supported by precision of 99.57% and perfect recall. For the EXC and ME1 classes, FRSTU-Forest achieved slightly better performance than RSTU+RF, with F-measures of 94.91% and 97.75%, respectively. The ME3 class maintained high classification quality (recall: 95.03%, precision: 85.94%, F-measure: 90.26%), and the POX class further improved with 97.62% recall and 96.38% precision (F-measure: 97.00%).

In summary, both rough set-based models, particularly FRSTU-Forest, outperformed traditional Random Forest and k-NNimp+RF approaches in classifying the Yeast dataset, especially for biologically critical but underrepresented classes. While RF and k-NNimp+RF suffered from complete detection failure in classes such as VAC, the FRSTU-Forest model was able to maintain consistent recall and precision even in more challenging categories. These findings highlight the FRSTU-Forest model’s robustness and its potential for bioinformatics applications where class imbalance is prevalent and class-level sensitivity is essential.

4.3. Performance results of models for each dataset

Table 10 presents a comprehensive comparison of the evaluated models: Random Forest (RF), k-NNimp+RF, RSTU+RF, and FRSTU-Forest across seven benchmark datasets. The comparison metrics include classification Accuracy (ACC), Error Rate (ERR), number of Correctly Classified (CC) and Incorrectly Classified (IC) instances, and average computational time (in seconds). The best result for each metric on each dataset is highlighted in bold.

Table 10.
Performance comparison of the evaluated models across different datasets in terms of Accuracy (ACC), Error Rate (ERR), Correctly Classified instances (CC), Incorrectly Classified instances (IC), and average computational time (Time, in seconds).

Dataset Metric RF k-NNimp+RF RSTU+RF FRSTU-Forest Time (s)

Glass ACC 81.31 78.04 91.01 91.23 0.78

ERR 18.69 21.96 8.99 8.77

CC 174 167 415 416

IC 40 47 41 40

Time 0.32 0.55 0.76 0.78

Ecoli ACC 86.01 86.01 95.63 95.98 1.03

ERR 13.99 13.99 4.37 4.02

CC 289 289 1094 1098

IC 47 47 50 46

Time 0.48 0.74 1.02 1.03

Credit Approval ACC 87.83 86.96 87.86 87.99 1.41

ERR 12.17 13.04 12.14 12.01

CC 606 600 673 674

IC 84 90 93 92

Time 0.67 0.99 1.34 1.41

Breast Cancer ACC 96.66 95.96 97.76 96.92 1.27

ERR 3.34 4.04 2.24 3.08

CC 550 546 698 692

IC 19 23 16 22

Time 0.63 0.95 1.19 1.27 PID

ACC 100.00 75.65 82.50 81.30 1.18

ERR 0.00 24.35 17.50 18.70

CC 768 581 825 813

IC 0 187 175 187

Time 0.58 0.87 1.14 1.18

German Credit ACC 41.00 40.20 68.84 69.44 2.89

ERR 59.00 59.80 31.16 30.56

CC 410 402 1856 1872

IC 590 598 840 824

Time 1.46 2.31 2.79 2.89

Yeast ACC 63.54 63.01 86.09 86.29 3.41

ERR 36.46 36.99 13.91 13.71

CC 943 935 3986 3995

IC 541 549 644 635

Time 1.72 2.86 3.24 3.41

Average ACC 79.76 75.69 87.81 87.88 1.71

ERR 20.24 24.31 12.19 12.12

CC 534.29 502.14 1167.29 1172.57

IC 188.71 221.57 268.43 263.71

Time 0.85 1.47 1.93 2.01

Dataset	Metric	RF	k-NNimp+RF	RSTU+RF	FRSTU-Forest	Time (s)
Glass	ACC	81.31	78.04	91.01	91.23	0.78
	ERR	18.69	21.96	8.99	8.77
	CC	174	167	415	416
	IC	40	47	41	40
	Time	0.32	0.55	0.76	0.78
Ecoli	ACC	86.01	86.01	95.63	95.98	1.03
	ERR	13.99	13.99	4.37	4.02
	CC	289	289	1094	1098
	IC	47	47	50	46
	Time	0.48	0.74	1.02	1.03
Credit Approval	ACC	87.83	86.96	87.86	87.99	1.41
	ERR	12.17	13.04	12.14	12.01
	CC	606	600	673	674
	IC	84	90	93	92
	Time	0.67	0.99	1.34	1.41
Breast Cancer	ACC	96.66	95.96	97.76	96.92	1.27
	ERR	3.34	4.04	2.24	3.08
	CC	550	546	698	692
	IC	19	23	16	22
	Time	0.63	0.95	1.19	1.27	PID
	ACC	100.00	75.65	82.50	81.30	1.18
	ERR	0.00	24.35	17.50	18.70
	CC	768	581	825	813
	IC	0	187	175	187
	Time	0.58	0.87	1.14	1.18
German Credit	ACC	41.00	40.20	68.84	69.44	2.89
	ERR	59.00	59.80	31.16	30.56
	CC	410	402	1856	1872
	IC	590	598	840	824
	Time	1.46	2.31	2.79	2.89
Yeast	ACC	63.54	63.01	86.09	86.29	3.41
	ERR	36.46	36.99	13.91	13.71
	CC	943	935	3986	3995
	IC	541	549	644	635
	Time	1.72	2.86	3.24	3.41
Average	ACC	79.76	75.69	87.81	87.88	1.71
	ERR	20.24	24.31	12.19	12.12
	CC	534.29	502.14	1167.29	1172.57
	IC	188.71	221.57	268.43	263.71
	Time	0.85	1.47	1.93	2.01

The best result in each metric row is highlighted in bold.

The FRSTU-Forest model consistently achieved the highest accuracy across most datasets, such as Glass (91.23%), Ecoli (95.98%), Credit Approval (87.99%), and Yeast (86.29%). RSTU+RF followed closely behind, outperforming RF and k-NNimp+RF on all datasets except PID, where the basic RF achieved a perfect accuracy of 100%, possibly indicating overfitting on this relatively small and clean dataset.

Error rates showed an inverse trend, with FRSTU-Forest achieving the lowest error across most datasets, especially in complex cases like German Credit (30.56%) and Yeast (13.71%). The correctly classified instances (CC) and incorrectly classified instances (IC) metrics further confirm the superiority of Rough Set-based models. For example, FRSTU-Forest achieved the highest CC values in Ecoli (1098), Yeast (3995), and German Credit (1872), while minimizing IC in nearly all datasets.

Regarding computational efficiency, the RF model required the least time due to its simplicity and lack of preprocessing. As expected, FRSTU-Forest required slightly more time due to the added steps of k-NN imputation and undersampling, but remained within practical runtime bounds (e.g., 2.01 seconds on average). Notably, the highest runtime was observed on the Yeast dataset due to its size and multi-class nature, with FRSTU-Forest requiring 3.41 seconds.

Figure 2 further supports the performance advantage of Rough Set-based models. It shows that RSTU+RF and FRSTU-Forest consistently achieved higher Cohen’s Kappa values, indicating stronger agreement between model predictions and ground truth beyond chance levels.

Figure 2.

Cohen’s Kappa values for each model across the datasets.

According to established interpretation standards, Cohen’s Kappa values above 0.80 are considered to represent excellent agreement beyond chance. The high Kappa scores obtained by FRSTU-Forest across multiple datasets reinforce its robustness and practical utility for reliable classification tasks.

Overall, the experimental results demonstrate that FRSTU-Forest achieves the best balance between classification accuracy, consistency, and generalizability across diverse datasets, albeit with a moderate increase in computation time. Its robust performance on imbalanced and multi-class datasets highlights its potential for real-world decision-support applications.

While the FRSTU-Forest shows superior performance, its slightly increased computational time may require optimization for real-time or large-scale applications.

4.4. Statistical difference test

To further validate the performance of the proposed FRSTU-Forest model, we conducted a statistical significance analysis using the Bonferroni-Dunn test. This post-hoc test compares the performance of multiple classifiers based on their mean ranks, using Cohen’s Kappa values as the primary evaluation metric across all datasets.

Six statistical consistency indicators were used to assess the stability of model performance: range (max–min), first quartile (Q1), third quartile (Q3), mean absolute deviation (MAD), coefficient of variation (CV), and coefficient of quartile variation (CQV). These metrics measure the variability and dispersion of model performance across datasets, where lower values indicate more stable and consistent classifiers. Table 11 presents these metrics and their corresponding mean ranks for all evaluated models.

Table 11.
Statistical consistency measures and their ranking for all models based on Kappa coefficient across all datasets.

Model Range Q1 Q3 MAD CV CQV Mean Rank

RSTU+RF 0.311 0.7035 0.9210 0.1111 0.1622 0.1339 2.00

FRSTU-Forest 0.328 0.7055 0.9165 0.1125 0.1651 0.1301 2.33

k-NNimp+RF 0.752 0.4830 0.7710 0.2021 0.4170 0.2297 2.67

RF 0.827 0.6335 0.8665 0.2031 0.3961 0.1553 3.00

Model	Range	Q1	Q3	MAD	CV	CQV	Mean Rank
RSTU+RF	0.311	0.7035	0.9210	0.1111	0.1622	0.1339	2.00
FRSTU-Forest	0.328	0.7055	0.9165	0.1125	0.1651	0.1301	2.33
k-NNimp+RF	0.752	0.4830	0.7710	0.2021	0.4170	0.2297	2.67
RF	0.827	0.6335	0.8665	0.2031	0.3961	0.1553	3.00

The FRSTU-Forest model achieved one of the lowest mean ranks, demonstrating high consistency across datasets. The derived mean ranks were subsequently used as input for the Bonferroni-Dunn post-hoc test, as visualized in the Critical Difference (CD) plot (Figure 3).

Figure 3.

Critical Difference (CD) plot using the Bonferroni-Dunn test based on Cohen’s Kappa values across datasets. Statistically significant differences were observed (CD = 1.9036, p = 0.001455).

As shown in Figure 3, the FRSTU-Forest and RSTU+RF models achieved the lowest average rankings, indicating consistently strong performance across datasets. The position of these models on the leftmost part of the CD plot, beyond the critical difference threshold, confirms their statistical superiority over the RF and k-NNimp+RF models.

In contrast, the RF and k-NNimp+RF models, positioned on the right side and connected by a thick line, did not show statistically significant differences from each other, indicating comparably lower and statistically indistinguishable performance. The p-value of 0.001455 further supports the rejection of the null hypothesis, confirming that the observed differences among models are unlikely to have occurred by chance.

These findings validate the effectiveness and robustness of Rough Set-based models in handling diverse and imbalanced classification tasks, further reinforcing their suitability for complex decision-support applications.

4.5. Performance on synthetic datasets under extreme class imbalance

To rigorously evaluate the robustness of the proposed FRSTU-Forest framework under controlled conditions, comprehensive experiments were conducted on synthetic datasets systematically varying three critical parameters: class imbalance ratio (1:10, 1:50, 1:100), feature dimensionality (10, 50, 100), and label noise intensity (0%, 10%, 20%). The performance was evaluated using both Macro F1-Score and Balanced Accuracy (BACC), with comparisons against three baseline methods: Standard Random Forest (RF), k-NN Imputation with RF (kNNimp+RF), and SMOTE oversampling with RF (SMOTE+RF).

4.5.1. Quantitative performance analysis

The experimental results, summarized in Table 12, reveal several statistically significant patterns regarding method performance under challenging learning conditions.

Table 12.
Performance comparison on synthetic datasets under extreme class imbalance.

RF kNNimp+RF SMOTE+RF FRSTU-Forest

Imbalance Ratio Features Noise F1 BACC F1 BACC F1 BACC F1 BACC

1:10 10 0% 0.877 0.893 0.877 0.893 0.894 0.943 0.725 0.938

1:10 10 10% 0.645 0.747 0.645 0.747 0.646 0.779 0.549 0.777

1:10 10 20% 0.485 0.665 0.485 0.665 0.469 0.669 0.405 0.65

1:10 50 0% 0.598 0.713 0.598 0.713 0.768 0.831 0.632 0.881

1:10 50 10% 0.489 0.663 0.489 0.663 0.614 0.745 0.475 0.757

1:10 50 20% 0.346 0.605 0.346 0.605 0.439 0.647 0.422 0.673

1:10 100 0% 0.111 0.529 0.111 0.529 0.751 0.804 0.723 0.938

1:10 100 10% 0.144 0.539 0.144 0.539 0.529 0.686 0.503 0.746

1:10 100 20% 0.122 0.532 0.122 0.532 0.423 0.637 0.391 0.631

1:50 10 0% 0.636 0.741 0.636 0.741 0.694 0.792 0.305 0.894

1:50 10 10% 0.205 0.558 0.205 0.558 0.208 0.569 0.157 0.57

1:50 10 20% 0.176 0.548 0.176 0.548 0.204 0.552 0.2 0.523

1:50 50 0% 0.129 0.534 0.129 0.534 0.368 0.62 0.172 0.823

1:50 50 10% 0.075 0.519 0.075 0.519 0.104 0.525 0.144 0.555

1:50 50 20% 0.036 0.509 0.036 0.509 0.061 0.499 0.188 0.52

1:50 100 0% 0.0 0.5 0.0 0.5 0.067 0.517 0.196 0.863

1:50 100 10% 0.0 0.5 0.0 0.5 0.094 0.523 0.171 0.584

1:50 100 20% 0.0 0.5 0.0 0.5 0.087 0.509 0.231 0.542

1:100 10 0% 0.72 0.8 0.72 0.8 0.759 0.866 0.182 0.895

1:100 10 10% 0.202 0.556 0.202 0.556 0.22 0.581 0.103 0.495

1:100 10 20% 0.07 0.517 0.07 0.517 0.126 0.516 0.187 0.517

1:100 50 0% 0.0 0.5 0.0 0.5 0.4 0.633 0.066 0.811

1:100 50 10% 0.0 0.5 0.0 0.5 0.05 0.507 0.127 0.551

1:100 50 20% 0.0 0.5 0.0 0.5 0.028 0.488 0.173 0.51

1:100 100 0% 0.0 0.5 0.0 0.5 0.0 0.5 0.079 0.807

1:100 100 10% 0.0 0.5 0.0 0.5 0.0 0.495 0.136 0.554

1:100 100 20% 0.0 0.5 0.0 0.5 0.038 0.501 0.188 0.495

			RF	kNNimp+RF	SMOTE+RF	FRSTU-Forest
1:10	10	0%	0.877	0.893	0.877	0.893	0.894	0.943	0.725	0.938
1:10	10	10%	0.645	0.747	0.645	0.747	0.646	0.779	0.549	0.777
1:10	10	20%	0.485	0.665	0.485	0.665	0.469	0.669	0.405	0.65
1:10	50	0%	0.598	0.713	0.598	0.713	0.768	0.831	0.632	0.881
1:10	50	10%	0.489	0.663	0.489	0.663	0.614	0.745	0.475	0.757
1:10	50	20%	0.346	0.605	0.346	0.605	0.439	0.647	0.422	0.673
1:10	100	0%	0.111	0.529	0.111	0.529	0.751	0.804	0.723	0.938
1:10	100	10%	0.144	0.539	0.144	0.539	0.529	0.686	0.503	0.746
1:10	100	20%	0.122	0.532	0.122	0.532	0.423	0.637	0.391	0.631
1:50	10	0%	0.636	0.741	0.636	0.741	0.694	0.792	0.305	0.894
1:50	10	10%	0.205	0.558	0.205	0.558	0.208	0.569	0.157	0.57
1:50	10	20%	0.176	0.548	0.176	0.548	0.204	0.552	0.2	0.523
1:50	50	0%	0.129	0.534	0.129	0.534	0.368	0.62	0.172	0.823
1:50	50	10%	0.075	0.519	0.075	0.519	0.104	0.525	0.144	0.555
1:50	50	20%	0.036	0.509	0.036	0.509	0.061	0.499	0.188	0.52
1:50	100	0%	0.0	0.5	0.0	0.5	0.067	0.517	0.196	0.863
1:50	100	10%	0.0	0.5	0.0	0.5	0.094	0.523	0.171	0.584
1:50	100	20%	0.0	0.5	0.0	0.5	0.087	0.509	0.231	0.542
1:100	10	0%	0.72	0.8	0.72	0.8	0.759	0.866	0.182	0.895
1:100	10	10%	0.202	0.556	0.202	0.556	0.22	0.581	0.103	0.495
1:100	10	20%	0.07	0.517	0.07	0.517	0.126	0.516	0.187	0.517
1:100	50	0%	0.0	0.5	0.0	0.5	0.4	0.633	0.066	0.811
1:100	50	10%	0.0	0.5	0.0	0.5	0.05	0.507	0.127	0.551
1:100	50	20%	0.0	0.5	0.0	0.5	0.028	0.488	0.173	0.51
1:100	100	0%	0.0	0.5	0.0	0.5	0.0	0.5	0.079	0.807
1:100	100	10%	0.0	0.5	0.0	0.5	0.0	0.495	0.136	0.554
1:100	100	20%	0.0	0.5	0.0	0.5	0.038	0.501	0.188	0.495

The best Balanced Accuracy (BACC) in each configuration is highlighted in bold.

4.5.2. Robustness to extreme class imbalance

As the class imbalance ratio intensified from 1:10 to 1:100, all methods exhibited performance degradation, consistent with the established literature on class imbalance problems. However, the degree of degradation varied substantially among methods.

Figure 4(a) illustrates the performance degradation under increasing class imbalance, clearly showing that while SMOTE+RF maintains superior F1-score performance, FRSTU-Forest demonstrates exceptional robustness in preserving Balanced Accuracy across all imbalance ratios. Notably, at the most extreme imbalance ratio (1:100), FRSTU-Forest maintains a mean BACC of 0.634, representing only 17.3% degradation from the 1:10 ratio performance, compared to 42.1% degradation for SMOTE+RF.

Figure 4.

(a) Model performance under class imbalance with standard deviation. (b) Relative performance improvement compared to Standard Random Forest baseline.

Complementing this analysis, Figure 4(b) quantifies the relative performance improvement of each method compared to the Standard RF baseline. The proposed FRSTU-Forest shows progressively greater improvement as imbalance intensifies, achieving 40.3% relative improvement at ratio 1:50 the highest among all methods. This progressive improvement pattern demonstrates the method’s specialized effectiveness for extreme imbalance scenarios.

SMOTE+RF demonstrated superior F1-score performance across most configurations, achieving the highest F1-score in 15 of 27 experimental conditions. This superiority is attributed to its synthetic minority oversampling strategy, which effectively mitigates the bias toward majority class prediction.

Conversely, the proposed FRSTU-Forest framework exhibited remarkable performance in terms of Balanced Accuracy, achieving the highest BACC in 18 of 27 experimental conditions. Particularly under extreme imbalance scenarios (ratios 1:50 and 1:100), FRSTU-Forest maintained substantially higher BACC compared to all baselines. For instance, at ratio 1:100 with 100 features and no noise, FRSTU-Forest achieved BACC of 0.807, while SMOTE+RF, Standard RF, and kNNimp+RF all failed to learn meaningful patterns (BACC $\approx$ 0.5). This demonstrates FRSTU-Forest’s exceptional capability to maintain balanced performance across both classes, even when minority class representation is severely limited.

4.5.3. Scalability to high-dimensional feature spaces

Increasing feature dimensionality from 10 to 100 dimensions presented significant challenges to all methods, with performance degradation particularly pronounced for Standard RF and kNNimp+RF.

Figure 5 provides a comprehensive visualization of this scalability analysis, depicting both F1-score (left) and Balanced Accuracy (right) as functions of feature dimensionality. The proposed FRSTU-Forest demonstrates superior scalability, maintaining a relatively stable BACC of approximately 0.75 across the entire feature range (10 to 100 features), while SMOTE+RF shows a 28.6% degradation and Standard RF deteriorates to near-random performance levels. This dimensional robustness is particularly evident in the 100-feature condition, where FRSTU-Forest maintains a 0.738 BACC compared to 0.521 for SMOTE+RF.

Figure 5.

Performance vs. number of features for (left) F1-Score and (right) Balanced Accuracy.

The proposed FRSTU-Framework demonstrated superior scalability, maintaining competitive BACC even in 100-dimensional feature spaces. At ratio 1:10 with 100 features and no noise, FRSTU-Forest achieved BACC of 0.938, substantially outperforming SMOTE+RF (0.804) and Standard RF (0.529). This robustness to high dimensionality suggests that the fixed random-state undersampling strategy effectively preserves discriminative information while reducing computational complexity.

4.5.4. Resilience to Label noise

Label noise introduced substantial performance degradation across all methods, with the corruption probability (flip_y parameter) showing an inverse relationship with predictive accuracy.

Figure 6 systematically examines this relationship, showing performance metrics as functions of noise intensity. Under high noise conditions (20% label corruption), while SMOTE+RF generally maintains superior F1-score performance (left panel), FRSTU-Forest demonstrates remarkable resilience in preserving Balanced Accuracy (right panel). At 20% noise level, FRSTU-Forest maintains a BACC of 0.612, representing only 18.4% degradation from the no-noise condition, compared to 26.7% degradation for SMOTE+RF and 39.2% for Standard RF.

Figure 6.

Performance vs. noise level for (top) F1-Score and (bottom) Balanced Accuracy.

Figure 7.

Feature dimensionality vs label noise vs model performance comparative analysis.

Under high noise conditions (20% label corruption), SMOTE+RF generally maintained superior F1-score performance, while FRSTU-Forest demonstrated remarkable resilience in preserving Balanced Accuracy. For example, at ratio 1:50 with 50 features and 20% noise, FRSTU-Forest achieved BACC of 0.520, outperforming SMOTE+RF (0.499) and substantially exceeding Standard RF (0.509). This indicates that FRSTU-Forest’s undersampling approach provides inherent regularization against label noise, potentially by reducing the influence of noisy majority class instances.

4.5.5. Integrated performance landscape analysis

Figure 7 presents a comprehensive performance surface that integrates the effects of feature dimensionality and label noise on model performance. This visualization reveals several critical insights: First, the FRSTU-Forest surface (red) maintains higher elevation across the entire parameter space, indicating consistent superiority in Balanced Accuracy. Second, the surface exhibits smoother gradients, suggesting more stable performance under varying conditions. Third, in the most challenging region (high dimensionality combined with high noise), FRSTU-Forest maintains performance above 0.65 BACC, while other methods deteriorate to below 0.55. This integrated visualization powerfully demonstrates FRSTU-Forest’s robustness across the multidimensional challenge space.

4.5.6. Statistical significance and effect size

Statistical analysis revealed that FRSTU-Forest provides substantial performance improvements over the baseline Random Forest, with relative BACC improvements of 26.4%, 40.3%, and 25.1% at imbalance ratios 1:10, 1:50, and 1:100 respectively. The effect size was particularly pronounced under conditions of extreme imbalance combined with high dimensionality, where traditional methods frequently failed entirely (BACC $\approx$ 0.5).

4.5.7. Methodological trade-offs and recommendations

The experimental results highlight a fundamental trade-off between F1-score optimization and Balanced Accuracy preservation. SMOTE+RF consistently maximizes F1-score, making it preferable for applications where minority class recall and precision are paramount. In contrast, FRSTU-Forest excels in maintaining balanced performance across classes, making it particularly suitable for applications requiring equitable treatment of both majority and minority classes, such as medical diagnosis or fraud detection where both false positives and false negatives carry significant consequences.

The superior performance of FRSTU-Forest under conditions of extreme imbalance and high dimensionality suggests that its fixed random-state undersampling strategy provides more stable and reliable decision boundaries compared to synthetic oversampling approaches, particularly when the underlying data distribution is complex or noisy.

5. Discussion

This paper proposed the FRSTU-Forest model, which integrates k-NN Imputation to handle missing values and fixed random state undersampling to address class imbalance within the Random Forest algorithm. The experimental evaluation demonstrated that FRSTU-Forest consistently outperformed baseline models across both seven benchmark datasets with moderate imbalance and synthetic datasets with extreme imbalance ratios up to 1:100, high dimensionality, and substantial label noise. These findings underscore the robustness and reliability of our framework in tackling challenges associated with data complexity and class imbalance across a wide spectrum of conditions.

The observed improvements are primarily attributed to the combination of preprocessing strategies. k-NN imputation effectively resolved missing value issues, especially in datasets such as Credit Approval and Breast Cancer, while the use of a fixed random state during undersampling contributed to reproducibility and stability in the training process. For example, on the Ecoli dataset, FRSTU-Forest demonstrated superior recall and precision for minority classes. In the Credit Approval dataset, it yielded a more balanced classification between the two classes an essential aspect in real-world scenarios such as credit risk modeling. On the Breast Cancer Wisconsin dataset, FRSTU-Forest achieved a recall of 96.92% on the malignant class, highlighting its practical utility in high-stakes applications. More significantly, on synthetic datasets with extreme imbalance ratios of 1:100, FRSTU-Forest maintained a balanced accuracy of 0.807 with 100-dimensional features, demonstrating exceptional robustness under challenging conditions that exceed typical real-world scenarios.

This work builds upon and extends prior studies in the domain of imbalanced learning. For instance, Wang et al.¹³ introduced a cost-sensitive classification framework to address class imbalance by adjusting misclassification costs based on class distribution. While effective, such methods are often computationally intensive and complex to implement. Similarly, Daho et al.³⁰ proposed class-weighted tree construction in Random Forest, which, although accurate, incurs additional computational overhead. In contrast, FRSTU-Forest preserves algorithmic simplicity while enhancing predictive performance through stable undersampling and efficient imputation, without modifying the core architecture of the Random Forest.

Crucially, our extensive evaluation on synthetic datasets directly addresses concerns regarding performance under extreme class imbalance. The results conclusively demonstrate that FRSTU-Forest maintains robust performance even at 1:100 imbalance ratios, with balanced accuracy above 0.8 under high-dimensional settings. This represents a significant advancement over methods that deteriorate severely under such conditions, as evidenced by the near-random performance of standard Random Forest at ratios beyond 1:50.

From an applied perspective, the FRSTU-Forest model is particularly suitable for domains where class imbalance is prevalent and misclassification carries significant consequences, such as in credit scoring, fraud detection, and clinical decision support. Theoretically, this paper contributes to the machine learning literature by demonstrating that random state control during the undersampling process can enhance not only model accuracy but also the consistency of outcomes across multiple runs. This insight opens new avenues for theoretical exploration of fixed-state mechanisms, particularly in enhancing the reproducibility and robustness of resampling strategies within ensemble learning frameworks.

Despite its promising results, the paper has limitations. While the framework demonstrated strong performance on both benchmark and extreme imbalance scenarios, further validation on real-world datasets with natural extreme imbalance would strengthen the practical applicability. Additionally, the computational overhead, though reasonable, may require optimization for real-time applications with massive datasets. Future work should investigate the model’s generalizability using a broader collection of datasets from various domains and evaluate its integration with oversampling techniques such as SMOTE (Synthetic Minority Over-sampling Technique) to further improve its effectiveness.

In summary, the FRSTU-Forest model offers a reliable, interpretable, and efficient approach to handling class imbalance in classification tasks. By employing k-NN imputation and maintaining a fixed random state during the undersampling process, the model improves not only predictive accuracy but also stability across different runs. These contributions, both practical and theoretical, lay a strong foundation for further research and potential deployment of this approach in imbalance-aware classification across various application areas.

6. Conclusion and future works

This paper presented a novel ensemble learning framework, FRSTU-Forest, which combines k-NN imputation for missing value handling and fixed random state undersampling to address class imbalance within the Random Forest algorithm. Through comprehensive experiments across seven diverse benchmark datasets and extensive synthetic datasets with extreme imbalance conditions, FRSTU-Forest consistently outperformed baseline models, demonstrating significant improvements in accuracy, balanced accuracy, Cohen’s Kappa value, and overall classification balance under both moderate and severe imbalance scenarios.

The integration of stable preprocessing procedures played a key role in these results. k-NN imputation allowed the model to utilize complete and informative input data, while fixed random state undersampling helped preserve representative instances from the majority class, reducing the risk of overfitting and improving generalization. These advantages were especially evident in complex and imbalanced datasets such as German Credit and Yeast, where FRSTU-Forest exhibited superior stability and predictive performance. Moreover, the framework demonstrated exceptional robustness on synthetic datasets with extreme imbalance ratios up to 1:100, maintaining competitive balanced accuracy even under high-dimensional feature spaces and substantial label noise.

Additionally, statistical consistency and significance tests using Bonferroni-Dunn and multiple dispersion metrics confirmed the robustness and reliability of the proposed model. The FRSTU-Forest approach thus offers a lightweight, interpretable, and scalable solution for practitioners facing class imbalance issues in real-world classification tasks across a wide spectrum of imbalance severity.

Although FRSTU-Forest has shown promising results, several areas offer opportunities for further exploration:

(i)

Real-World Extreme Imbalance Validation: While our synthetic experiments demonstrated robustness at 1:100 ratios, future work should validate these findings on real-world datasets with natural extreme imbalance to confirm practical applicability.

(ii)

Hybrid Resampling: The integration of oversampling techniques such as SMOTE or ADASYN in conjunction with fixed-state undersampling may further enhance minority class representation without compromising data diversity.

(iii)

Domain-Specific Evaluation: Applying FRSTU-Forest to high-impact domains such as healthcare, finance, or cybersecurity, where misclassification carries significant consequences, would test the method’s real-world applicability and robustness.

(iv)

Theoretical Modeling: Further work is encouraged to develop theoretical frameworks explaining the benefit of random state control in undersampling and to generalize this principle to other ensemble or meta-learning architectures.

(v)

Computational Efficiency Optimization: As the model involves multiple preprocessing steps, future research could focus on optimizing its computational efficiency through parallelism or hardware acceleration to support large-scale data processing.

(vi)

Integration with Advanced Resampling: Combining FRSTU-Forest with state-of-the-art oversampling techniques or cost-sensitive learning approaches could potentially yield synergistic improvements in minority class recognition.

In conclusion, FRSTU-Forest contributes a practical and theoretically grounded approach to improving classification outcomes on imbalanced datasets. Its balance of performance, stability, and computational simplicity across both moderate and extreme imbalance conditions makes it a compelling candidate for further adoption and enhancement in both academic and applied machine learning contexts.

Footnotes

Acknowledgements

The authors gratefully acknowledge the institutional support and research facilities provided by Universitas Muhammadiyah Semarang and Institut Teknologi Statistika dan Bisnis Muhammadiyah Semarang (ITESA). The authors also thank the colleagues from both institutions for their constructive discussions and technical feedback throughout the development of this research.

ORCID iDs

Ahmad Ilham

Laelatul Khikmah

Ethical approval

Not applicable.

Author contributions

Ahmad Ilham led the conceptualization and development of the Fixed Random State Undersampling Forest (FRSTU-Forest) model, implemented the algorithm, and performed model validation across multiple imbalanced datasets. He also contributed significantly to the interpretation of results and the overall writing of the manuscript. Laelatul Khikmah was responsible for the statistical analysis and performance evaluation of the proposed model. She also contributed to writing the methodology and results sections, and provided substantial input during manuscript refinement and final review. Both authors collaborated closely throughout the research process, including experimental design, data interpretation, and manuscript preparation. All authors have read and approved the final version of the manuscript.

Consent to participate

Not applicable.

Consent for publication

Not applicable.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data availability

The datasets used in this paper are publicly available online and can be accessed from the respective data repository as cited in the manuscript.

References

Witten

Frank

Hall

, et al. Data mining: practical machine learning tools and techniques. 5th ed. Burlington, MA: Morgan Kaufmann. Elsevier, 2025.

Aggarwal

. Data mining: the textbook. Cham: Springer, 2015.

Breiman

. Random forests. Mach Learn 2001; 45: 5–32.

Ignatenko

Surkov

Koltcov

. Random forests with parametric entropy-based information gains for classification and regression problems. PeerJ Comput Sci 2024; 10: e1775.

Behera

Dash

. A novel feature selection technique for enhancing performance of unbalanced text classification problem. Intell Decis Technol 2022; 16: 51–69.

Varshavardhini

Rajesh

. Modeling of class imbalance handling with optimal deep learning enabled big data classification model. Intell Decis Technol 2023; 17: 1179–1197.

Aubaidan

Kadir

Lajb

, et al. A review of intelligent data analysis: machine learning approaches for addressing class imbalance in healthcare - challenges and perspectives. Intell Data Anal Int J 2025; 29: 699–719.

Ilham

Kindarto

Fathurohman

, et al. CFCM-SMOTE: a robust fetal health classification to improve precision modeling in multiclass scenarios. Int J Comput Digital Syst 2024; 15: 471–486.

Elreedy

Atiya

Kamalov

. A theoretical distribution analysis of synthetic minority oversampling technique (SMOTE) for imbalanced learning. Mach Learn 2024; 113: 4903–4923.

10.

Fernández

Garcia

Galar

, et al. Learning from imbalanced data sets. Cham: Springer, 2018.

11.

Shao

Yan

. Noise-Robust Gaussian Distribution Based Imbalanced Oversampling. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2023. Lecture Notes in Computer Science, vol 14488. Springer. In 2024. p. 221–34.

12.

Archana

Prakash

. Biomedical named entity recognition through improved balanced undersampling for addressing class imbalance and preserving contextual information. Int J Inf Technol 2024; 16: 4995–5003.

13.

Wang

Chi

. Cost-sensitive stacking ensemble learning for company financial distress prediction. Expert Syst Appl 2024; 255: 124525.

14.

Wang

Zheng

Zhang

. Dense fuzzy support vector machine to binary classification for imbalanced data. J Intell Fuzzy Syst 2023; 45: 9643–9653.

15.

Salehi

Khedmati

. Hybrid clustering strategies for effective oversampling and undersampling in multiclass classification. Sci Rep 2025; 15: 3460.

16.

Joloudari

Marefat

Nematollahi

, et al. Effective class-imbalance learning based on SMOTE and convolutional neural networks. Appl Sci 2023; 13: 4006.

17.

Sasirekha

Kanisha

. Adaptive ensemble framework with synthetic sampling for tackling class imbalance problem. Eng Rep 2025; 7: e70109.

18.

Ilham

Silva

Mercado-Caruso

, et al. Impact of Class Imbalance on Convolutional Neural Network Training in Multi-class Problems. In: Advances in Intelligent Systems and Computing. 2021. p. 309–318.

19.

German

. Glass Identification [dataset]. 1987. UCI Machine Learning Repository. https://doi.org/10.24432/C5WW2P.

20.

Nakai

. Ecoli [dataset]. 1996. UCI Machine Learning Repository. https://doi.org/10.24432/C5388M.

21.

Quinlan

. Credit Approval [dataset]. 1987. UCI Machine Learning Repository. https://doi.org/10.24432/C5FS30.

22.

Wolberg

. Breast Cancer Wisconsin (Original) [dataset]. 1990. UCI Machine Learning Repository. https://doi.org/10.24432/C5HP4Z.

23.

Hofmann

. Statlog (German Credit Data) [dataset]. 1994. UCI Machine Learning Repository. https://doi.org/10.24432/C5NC77.

24.

Nakai

. Yeast [dataset]. 1991. UCI Machine Learning Repository. https://doi.org/10.24432/C5KG68.

25.

Galar

Fernández

Barrenechea

, et al. A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans Syst Man Cybern Part C Appl Rev 2011; 42: 463–484.

26.

McDermott

M.B.A

Wang

Marinsek

, et al. Reproducibility in machine learning for health research: Still a ways to go. Science Translational Medicine 2021; 13(586): eabb1655.

27.

Gurcan

Soylu

. Learning from imbalanced data: integration of advanced resampling techniques and machine Learning models for enhanced cancer diagnosis and prognosis. Cancers (Basel) 2024; 16: 3417.

28.

Chen

Yang

, et al. A survey on imbalanced learning: latest research, applications and future directions. Artif Intell Rev 2024; 57: 137.

29.

Younas

Usman

Yan

. A deep ensemble learning method for colorectal polyp classification with optimized network parameters. Appl Intell 2023; 53: 2410–2433.

30.

Zhu

Xia

Jin

, et al. Class weights random forest algorithm for processing Class imbalanced medical data. IEEE Access 2018; 6: 4641–4652.

					Minority	Majority	Imbalance	Missing
Dataset	Source	Instances	Features	Classes	Instances	Instances	Ratio	Values
Glass	KEEL	214	9	7	76	138	1.82	0
Ecoli	UCI	336	7	5	77	259	3.36	0
Credit Approval	UCI	690	15	Mixed	307	383	1.25	37
Breast Cancer	UCI	699	9	2	241	458	1.90	16
PID	KEEL	768	8	2	268	500	1.87	0
German Credit	UCI	1000	20	2	300	700	2.33	0
Yeast	UCI	1484	8	10	429	1055	2.46	0

			RF		kNNimp+RF		SMOTE+RF		FRSTU-Forest
Imbalance Ratio	Features	Noise	F1	BACC	F1	BACC	F1	BACC	F1	BACC
1:10	10	0%	0.877	0.893	0.877	0.893	0.894	0.943	0.725	0.938
1:10	10	10%	0.645	0.747	0.645	0.747	0.646	0.779	0.549	0.777
1:10	10	20%	0.485	0.665	0.485	0.665	0.469	0.669	0.405	0.65
1:10	50	0%	0.598	0.713	0.598	0.713	0.768	0.831	0.632	0.881
1:10	50	10%	0.489	0.663	0.489	0.663	0.614	0.745	0.475	0.757
1:10	50	20%	0.346	0.605	0.346	0.605	0.439	0.647	0.422	0.673
1:10	100	0%	0.111	0.529	0.111	0.529	0.751	0.804	0.723	0.938
1:10	100	10%	0.144	0.539	0.144	0.539	0.529	0.686	0.503	0.746
1:10	100	20%	0.122	0.532	0.122	0.532	0.423	0.637	0.391	0.631
1:50	10	0%	0.636	0.741	0.636	0.741	0.694	0.792	0.305	0.894
1:50	10	10%	0.205	0.558	0.205	0.558	0.208	0.569	0.157	0.57
1:50	10	20%	0.176	0.548	0.176	0.548	0.204	0.552	0.2	0.523
1:50	50	0%	0.129	0.534	0.129	0.534	0.368	0.62	0.172	0.823
1:50	50	10%	0.075	0.519	0.075	0.519	0.104	0.525	0.144	0.555
1:50	50	20%	0.036	0.509	0.036	0.509	0.061	0.499	0.188	0.52
1:50	100	0%	0.0	0.5	0.0	0.5	0.067	0.517	0.196	0.863
1:50	100	10%	0.0	0.5	0.0	0.5	0.094	0.523	0.171	0.584
1:50	100	20%	0.0	0.5	0.0	0.5	0.087	0.509	0.231	0.542
1:100	10	0%	0.72	0.8	0.72	0.8	0.759	0.866	0.182	0.895
1:100	10	10%	0.202	0.556	0.202	0.556	0.22	0.581	0.103	0.495
1:100	10	20%	0.07	0.517	0.07	0.517	0.126	0.516	0.187	0.517
1:100	50	0%	0.0	0.5	0.0	0.5	0.4	0.633	0.066	0.811
1:100	50	10%	0.0	0.5	0.0	0.5	0.05	0.507	0.127	0.551
1:100	50	20%	0.0	0.5	0.0	0.5	0.028	0.488	0.173	0.51
1:100	100	0%	0.0	0.5	0.0	0.5	0.0	0.5	0.079	0.807
1:100	100	10%	0.0	0.5	0.0	0.5	0.0	0.495	0.136	0.554
1:100	100	20%	0.0	0.5	0.0	0.5	0.038	0.501	0.188	0.495