Improving Credit Scoring with Feature Selection and Predictive Modeling

Abstract

Credit scoring, which forecasts the probability of loan default based on borrower attributes and credit history, is still a crucial task in the financial industry. Finding the most important characteristics to improve credit scoring accuracy has become more difficult due to the complexity of borrower profiles. This paper presents a systematic and multidimensional evaluation of the impact of different feature selection techniques, namely wrapper-based, filter-based, and embedded methods, on the performance of various machine learning classifiers such as Random Forest (RF) and Extreme Gradient Boosting (XGBoost). The influence of data resampling techniques to address class imbalance is also explored. The study evaluates all combinations under three settings: original, oversampled, and undersampled data, using three publicly available datasets: German, Taiwan, and Australian credit scoring datasets. Experimental results show that ensemble classifiers, especially XGBoost and RF, consistently outperform single classifier models. Additionally, feature selection methods, especially embedded and wrapper techniques, enhance model performance and reduce false positive and false negative rates across the three datasets.

Keywords

credit scoring feature selection data resampling ensemble learning

1. Introduction

Credit scoring remains a crucial function in the financial sector. The objective of this function is to evaluate borrower credit history and other characteristics to anticipate the possibility of loan default. The importance of robust credit scoring models has increased dramatically due to increasingly complex economic environment (Ala’raj & Abbod, 2016). Using these models, financial institutions are able to make more accurate lending decisions. These models provide a reliable framework for assessing creditworthiness, which help reduce financial risks and support more inclusive lending practices. Financial organizations have historically depended on expert guidelines and basic statistical analysis. These traditional methods use variables such as credit history and income. However, due to the rapid growth of data that is now available and advancements in the financial industry, several challenges have emerged in the field of credit scoring. One of the primary challenges is feature selection. As data availability increases and borrower profiles become more complex, identifying the most relevant features to enhance the accuracy of credit scoring models has become increasingly difficult. With so many potential factors to consider, it can be difficult to determine which ones are truly significant in predicting credit risk. Another significant challenge is the issue of high dimensionality. The high dimensionality of data can cause overfitting and reduce model performance (Leiva et al., 2019). Models may become very complex as the number of characteristics rises. Such complex models can capture noise in the data instead of learning significant patterns. A third challenge is data imbalance. Imbalanced datasets, where defaulters are usually a minority class, can lead to performance degradation. Since detecting high-risk borrowers is often the main objective of credit scoring, this imbalance may result in models that perform badly in this regard.

The ability to handle these difficulties has improved significantly due to the development of computational powers. Credit score datasets are now of higher quality thanks to sophisticated methods for managing outliers, missing data, and data validation. In traditional banking settings, where data quality can range greatly between systems and time periods, this development is very important. Furthermore, the accuracy of model performance predictions has been enhanced by specific validation methods that take into consideration temporal dependencies in credit data, where past borrower behavior or evolving economic conditions influence future credit risk. For example, a borrower who has maintained timely payments for years but recently missed several payments might present a different risk profile than someone with consistent payment delays. The issues of feature selection, excessive dimensionality, and data imbalance in credit scoring are discussed in this study. This study addresses these challenges by offering a systematic and comprehensive evaluation of how different types of feature selection techniques (wrapper, embedded, and filter-based) interact with different machine learning models under multiple dataset conditions (original, oversampled, and undersampled) using three publicly available datasets. Unlike previous studies that typically examine individual aspects in isolation, this study has the following main contributions:

–
A three-dimensional evaluation that simultaneously considers feature selection methods, classifiers, and resampling approaches. This integrated approach offers a more holistic analysis of model behavior.
–
An extensive experimental setup that evaluates different combinations of feature selection methods, classifiers, and resampling techniques, which provides a detailed performance analysis and ensures robustness and practical relevance of the findings.
–
Applicable recommendations on the best combinations of feature selection, resampling methods, and classifiers under different conditions. This can serve as a practical guide for industry experts and data scientists working on credit scoring applications.
–
An emphasis on how feature selection enhances model interpretability, especially for complex ensemble models, to provide a trade-off between accuracy and interpretability, which is crucial in regulated financial settings.
–
Comprehensive performance comparison of different models across various metrics such as accuracy, precision, recall, and area under the receiver operating characteristic (AUC-ROC) curve, to provide a thorough evaluation of their effectiveness in credit scoring tasks.
–
The development of a clearly structured and reproducible framework that can be replicated or extended by other researchers who may wish to test new frameworks or hybrid models of credit scorings.
Three publicly accessible datasets are used in this work to analyze credit scores. These datasets are the German credit scoring dataset, the customers default payments in Taiwan dataset, and the Australian credit approval dataset. These datasets offer a wide variety of borrower attributes and credit histories, enabling a thorough assessment of the suggested methods. Correlation-based and mutual information-based feature selection strategies are investigated; these strategies are combined with machine learning algorithms. The remainder of this paper is organized as follows: A thorough survey of the literature on credit scoring models is provided in Section 2. This section sets the scene for the current investigation and identifies the research gaps that this paper seeks to fill. The methods used in this paper are described in Section 3. This section explains the feature selection techniques and machine learning algorithms employed. The data resampling methods used to address the issue of imbalanced datasets are also explained in this section. The employed datasets and the experimental results are explained in Section 4. This section provides information on the efficacy of various models and feature selection strategies in credit scoring tasks by comparing their performance. Section 5 concludes the paper.
2. Literature Review

Many studies have focused on using advanced machine learning and ensemble techniques to improve credit scoring accuracy. A novel ensemble method that adapts to different imbalance ratios in credit rating information was introduced by He et al. (2018). Their approach, which combined multiple classifiers, such as decision trees (DT), RF, and support vector machines (SVMs), demonstrated improved performance in controlling class imbalance, and considerable adaptability to various situations. The researchers did observe a number of significant drawbacks, though, including reduced interpretability due to the combination of multiple models and higher computational complexity and resource intensity when executing multiple classifiers. Despite these limitations, the ensemble method’s ability to improve the overall accuracy and robustness of credit scoring models is a noteworthy advancement in the field. Zhang and Chi (2021) proposed a heterogeneous ensemble model using adaptive classifier selection to address the issue of imbalanced data. This model made use of base learners such as linear SVM, multivariate discriminant analysis (MDA), k-nearest neighbors (KNN), logistic regression (LR), and DT. The ensemble model resulted in better performance compared to base classifiers. However, the ensemble model is more complicated to implement and interpret than simpler models. Moreover, selecting the optimal base classifier can require a lot of resources when dealing with large datasets. Laborda and Ryoo (2021) focused their investigation on the critical element of feature selection in credit scoring. They compared filtering-based, wrapper-based, and embedded methods across LR, DT, RF, SVMs, and naive Bayes (NB) algorithms. Their study demonstrated how important it is to choose features carefully in order to increase the accuracy and efficacy of models. Tripathi et al. (2022) compared the performance of multiple credit scoring models that included nine ensemble learning techniques and five classification techniques. Examples of the ensemble techniques included bagging, adaboost, multiboost, and dagging while the classification techniques included LR, radial basis function neural network (RBFN), and sequential minimal optimization (SMO) (Ellis et al., 2022). These models were trained and evaluated on six benchmark credit scoring datasets such as Taiwan, bank marketing, and German datasets. The best ensemble models were multiboost and bagging. Mokheleli and Museba (2023) addressed the issues of concept drift, verification latency, and class inequality through developing a model that combined SVM with XGBoost. Liao et al. (2024) investigated data augmentation techniques in order to address the issue of data representativeness. They suggested two techniques that pseudo-label denied loan applications according to confidence levels. The proposed model was compared with a supervised model and a conventional reject inference method with fuzzy augmentation. The results demonstrated that a self-training model with calibrated probabilities increased loan approval rates by $2.6 %$ without increasing the default rate. Table 1 summarizes these studies.

Table 1.
A Comparison of Related Studies on Credit Scoring Models.

Study Dataset Algorithms Performance metrics

He et al. (2018) –
The Japanese dataset (690 samples, 383 negative, 307 positive, 15 features).
–
The Australian dataset (690 samples, 383 negative, 307 positive, 14 features).
–
The German dataset (1,000 samples, 700 positive, 300 negative, 20 features).
–
The DefaultData (30,000 samples, 23,364 positive, 6,636 negative, 24 features).
–
The PPDaiData (55,596 samples, 48,413 positive, 7,183 negative, 29 features).
–
The LC2017Q1Data (94,414 positive, 1,219 negative, 72 features).
KNN, LR, LDA, SVM, DT, RF, GBDT, AdaBoost, XGBoost –
AUC: RF (0.79198)
–
H-measure: RF (0.32247)
–
Kolmogorov-Smirnov (KS): RF (0.46433)
–
F-measure: RF (0.84957)
–
Geometric mean (G-mean): LR (0.66144)
–
Log-loss: SVM (0.49446)
–
Average rank (AvgR): RF (2.5)

Zhang and Chi (2021) –
The German dataset.
–
The DefaultData dataset.
–
The Chilean dataset (6,569 instances, 5,227 positive, 1,342 negative, 20 features).
–
The GMSC dataset (120,269 instances, 111,912 positive, 8,357 negative).
LSVM, MDA, KNN, DT, LR, Bagging applied to each. –
G-mean: 0.658
–
F-measure: 0.552
–
Matthews correlation coefficient (MCC): 0.369
–
AUC: 0.684
–
AvgR: 1.2

Laborda and Ryoo (2021) Data from Chung Hua University in Taiwan (30,000 observations, 25 features). LR, KNN, SVM, RF Best accuracy achieved with forward stepwise selection.

Tripathi et al. (2022) –
The Taiwan dataset.
–
The German dataset.
–
The Japanese datasets.
–
The Bank-marketing dataset (4,521 samples, 4,000 positive, 521 negative, 16 features).
DT, RBFN, SMO –
Accuracy on the Taiwan dataset: 83.76%
–
Accuracy on the German dataset: 79.98%

Mokheleli and Museba (2023) –
The Australian dataset.
–
The Japanese dataset.
–
The German datasets.
–
The PPDai dataset (55,596 instances, 48,413 good, 7,183 defaults, 29 features).
DAHE, KNN, RF, XGBoost, LR, SVM For the German dataset, the best performance values for each metric are as follows: –
Accuracy: 87.7
–
Precision: 0.89
–
Recall: 0.92
–
F1-score: 0.90
–
G-mean: 0.860
–
AUC-ROC: 0.862

Liao et al. (2024) Intuit’s lending dataset (2017-2020): It includes around 20,000 loan applications. Good/bad model, fuzzy augmentation, self-training with trust score, self-training with calibrated probability, weak supervision. –
AUC: 0.740
–
KS: 0.381
–
Approval rate: 0.543 at 2.5% bad rate

Study	Dataset	Algorithms	Performance metrics
He et al. (2018)	– The Japanese dataset (690 samples, 383 negative, 307 positive, 15 features). – The Australian dataset (690 samples, 383 negative, 307 positive, 14 features). – The German dataset (1,000 samples, 700 positive, 300 negative, 20 features). – The DefaultData (30,000 samples, 23,364 positive, 6,636 negative, 24 features). – The PPDaiData (55,596 samples, 48,413 positive, 7,183 negative, 29 features). – The LC2017Q1Data (94,414 positive, 1,219 negative, 72 features).	KNN, LR, LDA, SVM, DT, RF, GBDT, AdaBoost, XGBoost	– AUC: RF (0.79198) – H-measure: RF (0.32247) – Kolmogorov-Smirnov (KS): RF (0.46433) – F-measure: RF (0.84957) – Geometric mean (G-mean): LR (0.66144) – Log-loss: SVM (0.49446) – Average rank (AvgR): RF (2.5)
Zhang and Chi (2021)	– The German dataset. – The DefaultData dataset. – The Chilean dataset (6,569 instances, 5,227 positive, 1,342 negative, 20 features). – The GMSC dataset (120,269 instances, 111,912 positive, 8,357 negative).	LSVM, MDA, KNN, DT, LR, Bagging applied to each.	– G-mean: 0.658 – F-measure: 0.552 – Matthews correlation coefficient (MCC): 0.369 – AUC: 0.684 – AvgR: 1.2
Laborda and Ryoo (2021)	Data from Chung Hua University in Taiwan (30,000 observations, 25 features).	LR, KNN, SVM, RF	Best accuracy achieved with forward stepwise selection.
Tripathi et al. (2022)	– The Taiwan dataset. – The German dataset. – The Japanese datasets. – The Bank-marketing dataset (4,521 samples, 4,000 positive, 521 negative, 16 features).	DT, RBFN, SMO	– Accuracy on the Taiwan dataset: 83.76% – Accuracy on the German dataset: 79.98%
Mokheleli and Museba (2023)	– The Australian dataset. – The Japanese dataset. – The German datasets. – The PPDai dataset (55,596 instances, 48,413 good, 7,183 defaults, 29 features).	DAHE, KNN, RF, XGBoost, LR, SVM	For the German dataset, the best performance values for each metric are as follows: – Accuracy: 87.7 – Precision: 0.89 – Recall: 0.92 – F1-score: 0.90 – G-mean: 0.860 – AUC-ROC: 0.862
Liao et al. (2024)	Intuit’s lending dataset (2017-2020): It includes around 20,000 loan applications.	Good/bad model, fuzzy augmentation, self-training with trust score, self-training with calibrated probability, weak supervision.	– AUC: 0.740 – KS: 0.381 – Approval rate: 0.543 at 2.5% bad rate

Recent studies showed that credit scoring techniques have advanced significantly, especially in resolving issues with feature selection, model performance, and data imbalance. Advanced machine learning techniques and ensemble approaches have demonstrated encouraging outcomes in handling complicated data interactions and increasing accuracy. Nevertheless, these developments frequently result in higher processing requirements and possible difficulties with interpretability. In a recent study, Tu and Wu (2025) proposed a model that balances interpretability and predictive performance. Their model, based on the optimal classification tree with hyperplane splits (OCT-H), combines the simplicity of decision trees with the expressive power of hyperplane-based splits. Esenogho et al. (2022) proposed a model that combines a long short-term memory (LSTM) network with adaptive boosting (AdaBoost). The authors employed a hybrid data resampling method using the synthetic minority oversampling technique and edited nearest neighbors (SMOTE-ENN) to handle data imbalance. The proposed LSTM-AdaBoost ensemble demonstrated superior performance compared to traditional algorithms like SVM, DT, and standalone LSTM. Rofik et al. (2024) proposed a model that employs SMOTE for handling data imbalance and ensemble classification techniques to improve credit scoring performance. A federated learning approach was proposed by Wang et al. (2024) In this model, knowledge distillation and fine-tuning were employed to extract both generic and specific knowledge for improved performance. Koc et al. (2023) investigated the impact of feature selection, data scaling, and hyperparameter optimization on performance of credit scoring models. Results showed that combining appropriate feature selection and scaling methods with advanced ML algorithms significantly improves accuracy and reduces error rates compared to traditional approaches. Emmanuel et al. (2024) proposed a model that combines ensemble classifier with a filter-based feature selection method based on information gain. Evaluation on three datasets showed that the stacked model outperforms traditional classifiers such as DT and KNN. Talaat et al. (2024) proposed a model that combines deep learning with explainable artificial intelligence (XAI) techniques to improve model interpretability. The models identifies the features of payment delays and outstanding bill amounts as having a high impact on default risk. Kwon et al. (2025) proposed a multitask learning technique based on Siamese neural networks to improve both predictive power and stability in credit scoring models. The results demonstrated that the proposed model outperforms both classical machine learning and deep learning models in accuracy and stability.

3. Methods

This study employs a comprehensive approach to evaluate and improve credit scoring models using various techniques. The methodology encompasses several stages that include data preprocessing, feature selection, model development, and performance evaluation. The basic processing steps of the proposed methodology are illustrated in Figure 1.

Figure 1.

The Proposed Methodology for Crediting Scoring: Multiple Classification Models are Combined with Different Feature Selection Techniques and Different Data Resampling Methods and Evaluated Using Three Different Datasets.

3.1. Data Preprocessing

Data preprocessing is typically a crucial step in many data science projects, transforming raw data into a format suitable for analysis. In general, preprocessing may involve:

–
Cleaning: Removing missing values and noisy data.
–
Integration: Combining data from multiple sources.
–
Discretization: Converting continuous variables into categorical ones.
–
Standardization: Scaling numerical features to a standard range.
These steps are often necessary to ensure data quality and prepare datasets for effective modeling. While the datasets selected for analysis in this paper were pre-processed and structured, additional preprocessing steps were implemented to further enhance data quality. These steps included removing outliers with values exceeding 40% of the interquartile range (IQR), imputing missing values using median for numeric data and constant values for categorical data. Furthermore, feature scaling, standardization, and categorical variable encoding were applied to prepare the data for optimal model performance.
3.2. Features Selection

The use of feature selection strategies to determine which features are most pertinent to credit scoring is examined in this study. Three distinct strategies are used:

–
Filtering-based methods.
–
Wrapper-based methods.
–
Embedded methods.

3.2.1. Filtering-Based Feature Selection Techniques

One technique for choosing a subset of pertinent features from a big candidate pool is filtering-based feature selection. By concentrating on the most informative features, it seeks to increase the predictive model’s accuracy (Akogul, 2023). The procedure entails:

–
Calculating relevance scores: Statistical methods such as chi-square tests, mutual information, correlation coefficients, or statistical significance tests are used to evaluate each feature’s significance to the objective variable.
–
Selecting features: A subset of the most pertinent features is chosen after they are sorted according to their ratings.
Among the many benefits of filtering-based feature selection techniques are their ease of use, computational effectiveness, and ability to manage high-dimensional data. However, they might overlook complex relationships between features and the target variable. The chi-square statistical test is used to evaluate the dependence between each feature and the target variable. This method offers several key advantages: –
Statistical robustness: Using statistical testing, it offers a trustworthy indicator of feature importance.
–
Efficiency of computation: Easy and effective approach that works especially well with big datasets.
–
Interpretability: Produces easily comprehensible statistical significance values.
–
Effectiveness: Determines which features are most informative by calculating their dependency on the target variable.
Larger chi-square scores denote strong dependence between the features and the target variable. It is noteworthy that chi-square testing works best with non-negative feature values and categorical data. Our model was improved by the chi-square feature selection, which found and chose the most pertinent features while preserving statistical reliability and computational efficiency.
3.2.2. Wrapper-Based Feature Selection Techniques

The “wrapper feature selection” technique utilizes a feature selection algorithm to “wrap” around the machine learning model. It assesses various feature subsets according to how they affect the model’s performance. The objective is to find the smallest optimal subset that maximizes the model’s accuracy (Belete & Manjaiah, 2020).

–
Forward feature selection (FFS): FFS starts with an empty set of features. It iteratively adds features one by one to the set, selecting those that significantly improve classification accuracy. The process continues until a stopping criterion, such as reaching a maximum number of features or a performance plateau, is met.
–
Backward feature selection (BFS): BFS starts with all available features and gradually removes those that have little or no impact on classification accuracy. Unlike FFS, BFS works by elimination rather than addition. The process continues until removing features no longer improves performance. In this paper, BFS is used to assess the impact of feature removal on classification system’s performance.

3.2.3. Embedded Feature Selection Techniques

Embedded methods offer an intermediate solution between filter and wrapper methods, combining the qualities of both approaches. Like filter methods, embedded methods are computationally efficient while still allowing interaction with the classifier to incorporate its bias into feature selection, which tends to produce better classifier performance. In this study, RF was implemented as the embedded method for feature selection. RF is a predictive clustering method that uses multiple interconnected decision trees, where feature selection is integrated directly into the classifier algorithm. As the classifier trains, it automatically adjusts its internal parameters to determine the importance of each feature, performing feature selection and model construction in a single step. RF was specifically chosen for its:

–
Ability to handle noisy features effectively.
–
High accuracy and robustness in feature selection.
–
Built-in feature importance metrics.
–
Ability to manage high-dimensional data.
–
Seamless handling of both numerical and categorical features.
This approach enabled efficient feature selection while maintaining model interpretability and performance.
3.3. Data Split and Validation

The employed datasets are split into training and testing sets, with 85% allocated for training and 15% for testing. A fixed random seed was applied to maintain reproducibility of the results. This split was chosen due to the relatively small size of the dataset, ensuring sufficient data for robust model development while preserving a meaningful portion for unbiased performance evaluation (Gholamy et al., 2018). For smaller datasets, a higher training percentage is necessary to avoid overfitting and ensure the model captures relevant patterns.

3.4. Data Resampling

The imbalanced data can affect the accuracy and the performance of the model, and this requires using data resampling techniques. The methods listed below can be applied to deal with an unbalanced dataset. Only classification issues employ these methods (Song et al., 2021). Two data resampling techniques are examined in this work:

–
Random oversampling: This method involves randomly sampling the minority class that is, the class with fewer samples with replacement until the minority class’s sample count equals that of the majority class. Although this method is straightforward and efficient, it could result in overfitting and information loss.
–
Random undersampling: Until the number of samples in the majority class equals that in the minority class, the majority class is randomly sampled without replacement. Although this method can be quick and easy, it might lose information and be ineffective for unbalanced datasets with few samples.
Each data resampling technique has advantages and disadvantages; thus, the choice should be dependent on the particulars of the dataset and the issue the study is attempting to address. To make sure that the oversampling technique doesn’t result in overfitting or other problems, it’s also critical to assess how well the machine learning models trained on the oversampled dataset perform.
3.5. Classification Algorithms

For this paper, seven classification algorithms were chosen because of their extensive use and demonstrated efficacy in credit scoring. These algorithms include DT, RF, XGBoost, categorical boosting (CatBoost), Voting Classifier, light gradient-boosting machine (LightGBM), and bootstrap aggregating (Bagging). Previous studies have shown how good these models are in predicting credit risk, which validated this choice. Every algorithm was selected due to its distinct method of ensemble learning: RF uses random feature sampling RF, XGBoost and CatBoost use gradient boosting with various optimization techniques, LightGBM uses gradient-based one-side sampling, and Bagging and Voting classifier use diverse strategies for integrating multiple models (Tripathi et al., 2022). This variety of approaches makes it possible to thoroughly assess which ones work best for credit scoring applications.

3.5.1. DT

DT are supervised learning models that split data into branches based on feature values, forming a tree-like structure. At each node, the algorithm selects the feature that best separates the data according to a criterion like Gini impurity or information gain. They are easy to interpret and visualize but can overfit, especially with complex datasets.

3.5.2. RF

The RF algorithm employs a group of decision trees to build a classification model. The trees cooperate, and the addition of randomness from the tree-based component is what distinguishes this ensemble classifier. RF is made up of multiple random decision trees.

3.5.3. XGBoost

XGBoost is an extremely effective gradient boosting technique that is optimized for scalability and performance. It expands on the idea of boosting, in which several weak learners—typically decision trees—are trained one after the other, with each new tree fixing mistakes committed by the ones before it (Chen & Guestrin, 2016). XGBoost’s salient characteristics include:

–
Parallelization: XGBoost can parallelize tree construction, which greatly accelerates the training process in contrast to conventional gradient boosting.
–
Regularization: L1 and L2 regularization are incorporated, which enhances model generalization and helps manage overfitting.
–
Scalability: Large datasets can be handled effectively using XGBoost, which is made for distributed computing environments like Hadoop and Spark.
–
Handling sparse data: XGBoost uses a sparsity-aware technique to optimize performance for datasets with missing or sparse values.

3.5.4. Voting Classifier

CatBoost is a powerful gradient boosting technique that can handle categorical data natively, therefore, it does not require one-hot encoding or other intensive preprocessing. It is known for its accuracy, scalability, and fast performance in both classification and regression tasks. To reduce overfitting and stop target leakage, CatBoost employs a special ordered boosting technique that enhances the model’s reliability. In addition to supporting GPU acceleration for even quicker training on big datasets, it works well right out of the box and requires little hyperparameter tweaking (Prokhorenkova et al., 2018).

3.5.5. CatBoost

The Voting Classifier is an ensemble learning method that combines the predictions of several separate classifiers to enhance model performance. To improve overall accuracy, this technique makes use of the variety of models, including DT, LR, and SVM. The two ways to aggregate predictions are soft voting and hard voting. Soft voting averages the projected probabilities and typically produces superior results. In hard voting, each classifier votes for its anticipated class. Voting classifiers can perform better on unseen data and are resistant to overfitting. They have broad applications in several fields, such as natural language processing, healthcare, and finance (Re & Valentini, 2012).

3.5.6. LightGBM

LightGBM is an efficient and scalable gradient boosting technique designed for large-scale machine learning applications. By classifying continuous data into discrete bins using a histogram-based learning technique, it drastically lowers computing complexity and speeds up training. LightGBM uses a leaf-wise tree growth technique by choosing the leaf with the highest delta loss for growth in order to frequently produces a more accurate and ideal tree structure. There is no need for intensive preprocessing because the system can handle categorical features directly. Furthermore, LightGBM facilitates GPU and parallel learning, increasing its speed and scalability and making it appropriate for a range of uses, such as marketing and financial prediction modeling (Ke et al., 2017).

3.5.7. Bagging

Bagging is an ensemble learning technique that generates numerous copies of a base model by training each version on a distinct bootstrap sample of the original dataset. These samples are produced by selecting observations at random with replacement, which means that some observations might show up more than once and others might not show up at all. Usually, the final prediction is determined by obtaining a majority vote (for classification) or averaging the predictions (for regression) across all base models.

4. Experimental Results and Discussion

Multiple experiments were conducted to explore the performance of the proposed models on three publicly available datasets (The German (Asuncion & Newman, 2007a), Taiwan (Yeh & Lien, 2009), and Australian (Asuncion & Newman, 2007b) credit scoring datasets) across three different scenarios: the original dataset, an oversampled dataset, and an undersampled dataset. In each scenario, the various proposed feature selection methods combined with different classification algorithms were evaluated using multiple performance metrics: accuracy, AUC, false positive rate (FPR), false negative rate (FNR), recall, precision, and F1-score. In the first scenario, the proposed models were evaluated using the original samples from the three employed datasets. In the second scenario, the proposed models were evaluated using an oversampled dataset. In oversampling, new samples are generated for the minority class. However, this technique can lead to overfitting, as the model may memorize duplicated samples rather than generalizing well. In the third scenario, the proposed models were evaluated using an undersampled dataset. In undersampling, a similar number of samples from the majority class is selected to match the number of samples in the minority class. Since this technique removes samples from the majority class, it could negatively affect the performance due to the loss of valuable information from the majority class.

4.1. Dataset Description

This study makes use of three popular, preprocessed credit rating datasets:

–
The German credit data With 21 variables: 20 input predictors and one outcome variable and 1,000 cases, the German credit scoring dataset separates defaulters from non-defaulters.
–
The Taiwan credit card default dataset This dataset covers 23 features such as credit details, payment history, and demographic data. It is a useful tool for classification problems in machine learning research since it seeks to forecast default payments in the upcoming month.
–
The Australian credit approval This dataset contains 690 Australian credit card applications with 14 anonymized features. It includes both numerical and categorical data used to predict credit approval.
These datasets are publicly available and often utilized in related studies, offering a solid foundation for comparing and evaluating models. Since these datasets provide a variety of attributes pertaining to consumer demographics, financial data, and credit behaviors, they were selected for their applicability to credit scoring and their ease of use.
4.2. Evaluation Metrics

The performance of each model is assessed using the test dataset. The following metrics are used for evaluation:

4.2.1. Accuracy

Accuracy in credit scoring is a straightforward metric that measures the overall proportion of borrowers correctly classified as either low-risk or high-risk. It is calculated by summing the true positives (TP) and true negatives (TN) and dividing them by the total number of borrowers. A higher accuracy score indicates better overall classification performance (Labatut & Cherifi, 2012). It is defined by equation (1):

Accuracy = \frac{TP + TN}{Total number of borrowers}

(1)

where total number of borrowers: The sum of TP, TN, false positives (FP), and false negatives (FN).

4.2.2. FPR

The percentage of low-risk borrowers who are mistakenly classified as high-risk is known as the FPR in credit scoring. A high FPR causes losses for business due to the rejection of creditworthy borrowers. FPR is computed mathematically by equation (2). Minimizing the FPR improves business decision-making by balancing between identifying high-risk borrowers and avoiding rejecting low-risk borrowers (Galdi & Tagliaferri, 2019).

FPR = \frac{FP}{FP + TN}

(2)

4.2.3. FNR

The percentage of high-risk borrowers that are mistakenly classified as low-risk is known as the FNR in credit scoring. FNR is defined by equation (3). Minimizing the FNR is essential to accurately identify high-risk borrowers (Galdi & Tagliaferri, 2019).

FNR = \frac{FN}{FN + TP}

(3)

4.2.4. AUC-ROC

The AUC-ROC assesses how well the model distinguishes between positive and negative examples (Carter et al., 2016). The ROC curve shows the true positive rate (TPR) versus the FPR. The closer the AUC-ROC value is to 1, the higher the model’s performance.

4.2.5. Precision

Precision is a performance metric that is especially crucial for binary classification such as credit scoring. By figuring out the percentage of real positive predictions among all of the model’s positive predictions, precision assesses how accurate the forecasts are. It is defined by equation (4):

Precision = \frac{TP}{TP + FP}

(4)

A high precision score means that the model is more likely to be right when it predicts a borrower as high-risk (positive), which means that fewer low-risk borrowers are incorrectly labeled as high-risk. A credit scoring algorithm may reject a decent applicant, resulting in lost money and unhappy clients, if it incorrectly labels a low-risk borrower as high-risk (FP). Therefore, increasing precision becomes essential while trying to reduce such errors.

4.2.6. Recall

Recall (also known as sensitivity or TPR) is an important performance metric in the context of credit scoring, particularly for detecting high-risk borrowers. It calculates the percentage of real positive examples that the model accurately detects. It is computed mathematically by equation (5):

Recall = \frac{TP}{TP + FN}

(5)

When it comes to situations like credit scoring, where failing to identify a high-risk borrower can lead to a sizable loss in money, a high recall guarantees that the model minimizes the number of missed positive cases (false negatives). However, when recall increases, the model may begin to incorrectly identify low-risk borrowers as high-risk, which could result in a decrease in precision. As a result, depending on the objectives of the credit scoring model, recall and precision frequently must be traded off.

4.2.7. F1-score

When assessing credit scoring models on unbalanced datasets, the F1-score is very helpful because it integrates precision and recall into a single value. Precision and recall are given equal weight in the F1-score, which is the harmonic mean of the two measures. A higher F1-score shows that the model properly classifies high-risk borrowers while striking a solid balance between recall and precision. In contrast to accuracy alone, the F1-score takes into consideration the class imbalance that is frequently present in credit scoring datasets, where there are generally more low-risk borrowers than high-risk borrowers. Consequently, the F1-score offers a more reliable assessment of the model’s performance, particularly when it comes to precisely identifying the comparatively uncommon high-risk cases (Galdi & Tagliaferri, 2019). It is defined by equation (6):

F 1 - s c o r e = \frac{2 \times Precision \times Recall}{Precision + Recall}

(6)

4.3. Results on the German Credit Scoring Dataset

The performance of the proposed models is first explored on the German credit scoring dataset across three different scenarios: the original dataset, an oversampled dataset, and an undersampled dataset. Table 2 lists the results of the various feature selection strategies and classification models on the original German credit scoring dataset, without applying any data resampling techniques. In the analysis of the results on the original German credit dataset, RF consistently demonstrates superior and balanced performance, particularly when combined with the backward selection method. This combination achieves the highest accuracy (82.66%), the highest recall (94.23%), the highest F1-score (88.28%), and maintains strong performance across other metrics, including a balanced FPR (43.48%) and one of the lowest FNR rates (5.77%) in handling the imbalanced credit risk assessment scenario. Although XGBoost with the embedded method shows competitive performance with the highest AUC-ROC (76%) and strong precision (84.07%), RF with filtering demonstrates exceptional recall (94.23%). The Voting classifier performs notably well in precision metrics, particularly with the filtering method (84.54%), showcasing its effectiveness in reducing false positives in credit risk assessment. LightGBM, while showing promising results in FPR minimization, struggles with higher FNR rates and lower F1-scores, suggesting that it may not be the optimal choice for achieving balanced performance in credit risk prediction.

Table 2.
Evaluation Metrics of the Models Using Original Dataset (The German Dataset).

Measure type Method type Method name DT RF XGBoost CatBoost Voting classifier LightGBM Bagging

Accuracy Embedded RF 73.33% 80.00% 82.00% 80.66% 77.33% 72.66% 80.00%

Filtering Chi-squared 74% 80.66% 77.33% 76.66% 81.33% 76.66% 80.66%

Wrapper Backward selection 73.33% 82.66% 78.00% 78.00% 78.00% 75.33% 79.33%

Forward selection 68% 76.00% 73.33% 74.66% 75.33% 74.66% 76.00%

Without feature selection Without feature selection 73.33% 78.66% 78.00% 76.00% 79.33% 75.33% 74.00%

AUC-ROC Embedded RF 61% 72% 76% 75% 72% 69% 72%

Filtering Chi-squared 66% 72% 72% 69% 76% 73% 73%

Wrapper Backward selection 64% 75% 71% 71% 73% 73% 72%

Forward selection 63% 69% 65% 65% 68% 71% 68%

Without feature selection Without feature selection 61% 69% 73% 70% 74% 71% 67%

FPR Embedded RF 69.57% 47.83% 39.13% 41.30% 41.30% 41.30% 47.83%

Filtering Chi-squared 54.35% 50.00% 43.48% 50.00% 36.96% 36.96% 45.65%

Wrapper Backward selection 60.87% 43.48% 45.65% 45.65% 39.13% 34.78% 47.83%

Forward selection 50% 47.83% 56.52% 58.70% 50.00% 36.96% 54.35%

Without feature selection Without feature selection 69.57% 54.35% 41.30% 45.65% 39.13% 39.13% 52.17%

FNR Embedded RF 7.69% 7.69% 8.65% 9.62% 14.42% 21.15% 7.69%

Filtering Chi-squared 13.46% 5.77% 13.46% 11.54% 10.58% 17.31% 7.69%

Wrapper Backward selection 11.54% 5.77% 11.54% 11.54% 14.42% 20.19% 8.65%

Forward selection 24.04% 13.46% 13.46% 10.58% 13.46% 20.19% 10.58%

Without feature selection Without feature selection 7.69% 6.73% 13.46% 14.42% 12.50% 18.27% 14.42%

Recall Embedded RF 92.30% 92.30% 91.34% 90.38% 85.57% 78.84% 92.30%

Filtering Chi-squared 86.53% 94.23% 86.53% 88.46% 89.42% 82.69% 92.30%

Wrapper Backward selection 88.46% 94.23% 88.46% 88.46% 85.57% 78.80% 91.34%

Forward selection 75.96% 86.53% 86.53% 89.42% 86.53% 79.80% 89.42%

Without feature selection Without feature selection 92.30% 93.26% 86.53% 85.57% 87.50% 81.73% 85.57%

Precision Embedded RF 75% 81.35% 84.07% 83.18% 82.40% 81.18% 81.35%

Filtering Chi-squared 78.26% 80.99% 81.81% 80.00% 84.54% 83.49% 82.05%

Wrapper Backward selection 76.66% 83.05% 81.41% 81.41% 83.17% 83.83% 81.19%

Forward selection 77.45% 80.35% 77.58% 77.50% 79.64% 83.00% 78.81%

Without feature selection Without feature selection 75% 79.50% 82.56% 80.90% 83.48% 82.52% 78.76%

F1-score Embedded RF 82.75% 86.48% 87.55% 86.63% 83.96% 80.00% 886.48%

Filtering Chi-squared 82.19% 87.11% 84.11% 84.01% 86.91% 83.09% 86.87%

Wrapper Backward selection 82.14% 88.28% 84.79% 84.79% 84.36% 81.77% 85.97%

Forward selection 76.69% 83.33% 81.81% 83.03% 82.94% 81.37% 83.78%

Without feature selection Without feature selection 82.75% 85.84% 84.50% 83.17% 85.44% 82.12% 82.02%

Measure type	Method type	Method name	DT	RF	XGBoost	CatBoost	Voting classifier	LightGBM	Bagging
Accuracy	Embedded	RF	73.33%	80.00%	82.00%	80.66%	77.33%	72.66%	80.00%
	Filtering	Chi-squared	74%	80.66%	77.33%	76.66%	81.33%	76.66%	80.66%
	Wrapper	Backward selection	73.33%	82.66%	78.00%	78.00%	78.00%	75.33%	79.33%
		Forward selection	68%	76.00%	73.33%	74.66%	75.33%	74.66%	76.00%
	Without feature selection	Without feature selection	73.33%	78.66%	78.00%	76.00%	79.33%	75.33%	74.00%
AUC-ROC	Embedded	RF	61%	72%	76%	75%	72%	69%	72%
	Filtering	Chi-squared	66%	72%	72%	69%	76%	73%	73%
	Wrapper	Backward selection	64%	75%	71%	71%	73%	73%	72%
		Forward selection	63%	69%	65%	65%	68%	71%	68%
	Without feature selection	Without feature selection	61%	69%	73%	70%	74%	71%	67%
FPR	Embedded	RF	69.57%	47.83%	39.13%	41.30%	41.30%	41.30%	47.83%
	Filtering	Chi-squared	54.35%	50.00%	43.48%	50.00%	36.96%	36.96%	45.65%
	Wrapper	Backward selection	60.87%	43.48%	45.65%	45.65%	39.13%	34.78%	47.83%
		Forward selection	50%	47.83%	56.52%	58.70%	50.00%	36.96%	54.35%
	Without feature selection	Without feature selection	69.57%	54.35%	41.30%	45.65%	39.13%	39.13%	52.17%
FNR	Embedded	RF	7.69%	7.69%	8.65%	9.62%	14.42%	21.15%	7.69%
	Filtering	Chi-squared	13.46%	5.77%	13.46%	11.54%	10.58%	17.31%	7.69%
	Wrapper	Backward selection	11.54%	5.77%	11.54%	11.54%	14.42%	20.19%	8.65%
		Forward selection	24.04%	13.46%	13.46%	10.58%	13.46%	20.19%	10.58%
	Without feature selection	Without feature selection	7.69%	6.73%	13.46%	14.42%	12.50%	18.27%	14.42%
Recall	Embedded	RF	92.30%	92.30%	91.34%	90.38%	85.57%	78.84%	92.30%
	Filtering	Chi-squared	86.53%	94.23%	86.53%	88.46%	89.42%	82.69%	92.30%
	Wrapper	Backward selection	88.46%	94.23%	88.46%	88.46%	85.57%	78.80%	91.34%
		Forward selection	75.96%	86.53%	86.53%	89.42%	86.53%	79.80%	89.42%
	Without feature selection	Without feature selection	92.30%	93.26%	86.53%	85.57%	87.50%	81.73%	85.57%
Precision	Embedded	RF	75%	81.35%	84.07%	83.18%	82.40%	81.18%	81.35%
	Filtering	Chi-squared	78.26%	80.99%	81.81%	80.00%	84.54%	83.49%	82.05%
	Wrapper	Backward selection	76.66%	83.05%	81.41%	81.41%	83.17%	83.83%	81.19%
		Forward selection	77.45%	80.35%	77.58%	77.50%	79.64%	83.00%	78.81%
	Without feature selection	Without feature selection	75%	79.50%	82.56%	80.90%	83.48%	82.52%	78.76%
F1-score	Embedded	RF	82.75%	86.48%	87.55%	86.63%	83.96%	80.00%	886.48%
	Filtering	Chi-squared	82.19%	87.11%	84.11%	84.01%	86.91%	83.09%	86.87%
	Wrapper	Backward selection	82.14%	88.28%	84.79%	84.79%	84.36%	81.77%	85.97%
		Forward selection	76.69%	83.33%	81.81%	83.03%	82.94%	81.37%	83.78%
	Without feature selection	Without feature selection	82.75%	85.84%	84.50%	83.17%	85.44%	82.12%	82.02%

Table 3 lists the results of the various feature selection strategies and classification models on the oversampled German credit scoring dataset. The evaluation of classification models on an oversampled German credit scoring dataset shows that oversampling improves performance, particularly when combined with feature selection. Chi-squared filtering with RF achieves the highest accuracy (81.33%) and the highest F1-score (87.15%), which means that balancing the dataset enhances RF’s performance. on the other hand, forward selection results in lower accuracy (around 71%), lower recall, and the lowest F1-scores, which means that removing too many features harms model performance. In general, oversampling, combined with Chi-squared filtering or backward selection, improves model performance, while forward selection leads to poor results.

Table 3.

Evaluation Metrics of the Models Using Oversampled Dataset (The German Dataset).

Measure type	Method type	Method name	DT	RF	XGBoost	CatBoost	Voting classifier	LightGBM	Bagging
Accuracy	Embedded	RF	71.33%	74.66%	76.66%	76.66%	77.33%	76.00%	78.66%
	Filtering	Chi-squared	70.66%	81.33%	76.00%	79.33%	78.66%	76.66%	78.66%
	Wrapper	Backward selection	68%	80.00%	81.33%	78.00%	80.66%	78.66%	80.00%
		Forward selection	72.66%	72.00%	70.00%	71.33%	72.00%	73.33%	71.33%
	Without feature selection	Without feature selection	74.66%	74.66%	74.66%	76.66%	78.00%	76.00%	73.33%
AUC-ROC	Embedded	RF	64%	69%	73%	69%	73%	71%	73%
	Filtering	Chi-squared	67%	75%	71%	74%	73%	72%	72%
	Wrapper	Backward selection	64%	76%	77%	74%	76%	74%	75%
		Forward selection	68%	66%	64%	64%	66%	67%	64%
	Without feature selection	Without feature selection	67%	68%	69%	71%	73%	71%	67%
FPR	Embedded	RF	54.35%	68%	69%	71%	73%	71%	67%
	Filtering	Chi-squared	43.48%	41.30%	41.30%	41.30%	41.30%	39.13%	45.65%
	Wrapper	Backward selection	47.83%	34.78%	32.61%	36.96%	36.96%	39.13%	36.96%
		Forward selection	45.65%	47.83%	52.17%	54.35%	50.00%	50.00%	54.35%
	Without feature selection	Without feature selection	52.17%	47.83%	45.65%	43.48%	41.30%	43.48%	47.83%
FNR	Embedded	RF	17.31%	16.35%	18.27%	11.54%	15.38%	16.35%	12.50%
	Filtering	Chi-squared	23.08%	8.65%	16.35%	11.54%	12.50%	16.35%	10.58%
	Wrapper	Backward selection	25%	13.46%	12.50%	15.38%	11.54%	13.46%	12.50%
		Forward selection	19.23%	19.23%	20.19%	17.31%	18.27%	16.35%	17.31%
	Without feature selection	Without feature selection	13.46%	15.38%	16.35%	14.42%	13.46%	15.38%	17.31%
Recall	Embedded	RF	82.69%	83.65%	81.73%	88.46%	84.61%	83.65%	87.50%
	Filtering	Chi-squared	76.92%	91.34%	83.65%	88.46%	87.50%	83.65%	89.42%
	Wrapper	Backward selection	75%	86.53%	87.50%	84.61%	88.46%	86.53%	87.50%
		Forward selection	80.76%	80.76%	79.80%	82.69%	81.73%	83.65%	82.69%
	Without feature selection	Without feature selection	83.53%	84.61%	83.65%	85.57%	86.53%	84.61%	82.69%
Precision	Embedded	RF	77.47%	80.55%	84.15%	80.00%	83.01%	82.07%	82.72%
	Filtering	Chi-squared	80%	83.33%	82.07%	82.88%	82.72%	82.85%	81.57%
	Wrapper	Backward selection	78%	84.90%	85.84%	83.80%	84.40%	83.33%	84.25%
		Forward selection	80%	79.24%	77.57%	77.47%	78.70%	79.09%	77.74%
	Without feature selection	Without feature selection	78.94%	80.00%	80.55%	81.65%	82.56%	81.48%	79.62%
F1-score	Embedded	RF	80%	82.07%	82.92%	84.01%	83.80%	82.85%	85.04%
	Filtering	Chi-squared	78.43%	87.15%	82.85%	85.58%	85.04%	83.25%	85.32%
	Wrapper	Backward selection	76.47%	85.71%	86.66%	84.21%	86.38%	84.90%	85.84%
		Forward selection	80.38%	80.00%	78.67%	80.00%	80.18%	81.30%	80.00%
	Without feature selection	Without feature selection	82.56%	82.24%	82.07%	83.56%	84.50%	83.01%	81.13%

Table 4 lists the results of the various feature selection strategies and classification models on the undersampled German credit scoring dataset.

Table 4.

Evaluation Metrics of the Models Using Undersampled Dataset (The German Dataset).

Measure type	Method type	Method name	DT	RF	XGBOOST	CatBoost	Voting Classifier	LightGBM	Bagging
Accuracy	Embedded	RF	67.33%	74.00%	76.00%	73.33%	76.66%	72.66%	71.33%
	Filtering	Chi-squared	70%	74.66%	70.66%	74.66%	73.33%	73.33%	73.33%
	Wrapper	Backward selection	68.66%	76.66%	78.66%	77.33%	78.00%	74.66%	69.33%
	Wrapper	Forward selection	64.66%	71.33%	68.66%	72.00%	70.00%	71.33%	63.33%
	Without feature selection	Without feature selection	68.66%	72.66%	73.33%	72.00%	75.33%	73.33%	70.66%
AUC-ROC	Embedded	RF	66%	71%	74%	72%	74%	71%	70%
	Filtering	Chi-squared	69%	74%	69%	74%	72%	72%	72%
	Wrapper	Backward selection	73%	75%	78%	77%	76%	73%	68%
	Wrapper	Forward selection	68%	70%	66%	71%	68%	69%	60%
	Without feature selection	Without feature selection	68%	72%	71%	73%	73%	72%	70%
FPR	Embedded	RF	36.96%	36.96%	32.61%	30.43%	32.61%	32.61%	34.78%
	Filtering	Chi-squared	32.61%	28.26%	36.96%	28.26%	32.61%	30.43%	32.61%
	Wrapper	Backward selection	17.39%	28.26%	23.91%	23.91%	28.26%	32.61%	34.78%
	Wrapper	Forward selection	23.91%	32.61%	39.13%	30.43%	36.96%	36.96%	47.83%
	Without feature selection	Without feature selection	32.61%	30.43%	34.78%	26.09%	32.61%	32.61%	32.61%
FNR	Embedded	RF	30.77%	21.15%	20.19%	25.00%	19.23%	25.00%	25.96%
	Filtering	Chi-squared	28.85%	24.04%	25.96%	24.04%	24.04%	25.00%	24.04%
	Wrapper	Backward selection	37.50%	21.15%	20.19%	22.12%	19.23%	22.12%	28.85%
	Wrapper	Forward selection	40.38%	26.92%	27.88%	26.92%	26.92%	25.00%	31.73%
	Without feature selection	Without feature selection	30.77%	25.96%	23.08%	28.85%	21.15%	24.04%	27.88%
Recall	Embedded	RF	69.23%	78.84%	79.80%	75.00%	80.76%	75.00%	74.03%
	Filtering	Chi-squared	71.15%	75.96%	74.03%	75.96%	75.96%	75.00%	75.96%
	Wrapper	Backward selection	62.5%	78.84%	79.80%	77.88%	80.76%	77.88%	71.15%
	Wrapper	Forward selection	59.61%	73.07%	72.11%	73.07%	73.07%	75.00%	68.26%
	Without feature selection	Without feature selection	69.23%	74.03%	76.92%	71.15%	78.84%	75.96%	72.11%
Precision	Embedded	RF	80.89%	82.82%	84.69%	84.78%	84.84%	83.87%	82.79%
	Filtering	Chi-squared	83.14%	85.86%	81.91%	85.86%	84.04%	84.78%	84.04%
	Wrapper	Backward selection	89.04%	86.31%	88.29%	88.04%	86.59%	84.37%	82.22%
		Forward selection	84.93%	83.51%	80.64%	84.44%	81.72%	82.10%	76.34%
	Without feature selection	Without feature selection	82.75%	84.61%	83.33%	86.04%	84.53%	84.04%	83.33%
F1-score	Embedded	RF	74.61%	80.78%	82.17%	79.59%	82.75%	79.18%	78.17%
	Filtering	Chi-squared	76.68%	80.61%	77.77%	80.61%	79.79%	79.59%	79.79%
	Wrapper	Backward selection	73.44%	82.41%	83.83%	82.65%	83.58%	81.00%	76.28%
		Forward selection	70.05%	77.94%	76.14%	78.35%	77.15%	78.39%	72.08%
	Without feature selection	Without feature selection	75.39%	78.97%	80.00%	77.89%	81.59%	79.79%	77.31%

The evaluation of classification models on the undersampled German credit scoring dataset shows that backward selection with XGBoost achieves the best accuracy, AUC-ROC, and precision. It also results in the lowest FPR. On the other hand, forward selection leads to poor results across all metrics, which means that excessive feature elimination negatively affects model performance when working with undersampled data.

The evaluation of the different models on the German credit scoring dataset under the three scenarios (original, oversampled, and undersampled) shows that the original dataset provides a baseline performance. Oversampling improves model accuracy and recall, while undersampling reduces FPR. Backward selection combined with RF and XGBoost leads the best performance, while forward selection leads to poor results.

4.4. Results on the Taiwan Credit Scoring Dataset

In this section, the performance of the proposed models is explored on the Taiwan credit scoring dataset across the three scenarios: the original dataset, an oversampled dataset, and an undersampled dataset. Table 5 lists the results of the various feature selection strategies and classification models on the original Taiwan credit scoring dataset, without applying any data resampling techniques.

Table 5.
Evaluation Metrics for the Models Using Original Dataset (The Taiwan Dataset).

Measure type Method type Method name DT RF XGBoost CatBoost Voting classifier LightGBM Bagging

Accuracy Embedded RF 82.06% 81.97% 81.95% 81.91% 81.95% 75.15% 80.93%

Filtering Chi-squared 82.06% 82.17% 81.75% 81.82% 82.11% 74.53% 82.35%

Wrapper Backward selection 82.02% 82.37% 81.80% 82.08% 82.04% 76.22% 81.62%

Forward selection 82.02% 82.04% 81.64% 81.62% 81.62% 76.31% 80.04%

Without feature selection Without feature selection 82% 81.95% 81.75% 81.82% 82.11% 74.53% 81.97%

AUC-ROC Embedded RF 66% 65% 66% 66% 68% 71% 65%

Filtering Chi-squared 65% 66% 66% 66% 68% 71% 66%

Wrapper Backward selection 66% 65% 65% 66% 67% 71% 66%

Forward selection 66% 66% 66% 65% 68% 71% 65%

Without feature selection Without feature selection 65% 65% 66% 66% 68% 71% 66%

FPR Embedded RF 5.03% 4.75% 4.89% 4.95% 6.58% 21.59% 6.21%

Filtering Chi-squared 4.69% 4.69% 5.29% 5.09% 6.81% 22.68% 4.78%

Wrapper Backward selection 4.98% 4.03% 4.60% 4.58% 5.98% 19.53% 5.43%

Forward selection 4.80% 4.83% 5.43% 5.29% 7.09% 19.19% 7.06%

Without feature selection Without feature selection 4.69% 4.66% 5.29% 5.09% 6.81% 22.68% 5.09%

FNR Embedded RF 62.91% 64.31% 63.91% 63.91% 58.03% 36.19% 63.91%

Filtering Chi-squared 64.41% 63.61% 63.41% 63.81% 56.53% 35.19% 62.51%

Wrapper Backward selection 63.32% 65.00% 65.60% 64.41% 59.72% 38.58% 63.51%

Forward selection 63.91% 63.71% 63.41% 64.01% 57.73% 39.38% 63.31%

Without feature selection Without feature selection 64.41% 64.71% 63.41% 63.81% 56.53% 35.19% 63.11%

Recall Embedded RF 37.08% 35.69% 36.09% 36.09% 41.97% 63.80% 36.09%

Filtering Chi-squared 35.59% 36.39% 36.59% 36.19% 43.46% 64.80% 37.48%

Wrapper Backward selection 36.68% 34.99% 34.39% 35.59% 40.27% 61.41% 36.49%

Forward selection 36.09% 36.29% 36.59% 35.99% 42.27% 60.61% 36.68%

Without feature selection Without feature selection 35.59% 35.29% 36.59% 36.19% 43.46% 64.80% 36.88%

Precision Embedded RF 67.88% 68.32% 67.91% 67.66% 64.66% 45.87% 62.52%

Filtering Chi-squared 68.52% 68.99% 66.48% 67.09% 64.68% 45.04% 69.24%

Wrapper Backward selection 67.89% 71.34% 68.18% 69.05% 65.90% 47.42% 65.82%

Forward selection 68.30% 68.29% 65.88% 66.11% 63.09% 47.53% 59.83%

Without feature selection Without feature selection 68.52% 68.47% 66.48% 67.09% 64.68% 45.04% 67.51%

F1-score Embedded RF 47.96% 46.88% 47.13% 47.07% 50.90% 53.37% 45.76%

Filtering Chi-squared 46.85% 47.65% 47.20% 47.02% 51.99% 53.14% 48.64%

Wrapper Backward selection 47.63% 46.95% 45.72% 46.97% 50.00% 53.51% 46.95%

Forward selection 47.22% 47.39% 47.05% 46.61% 50.62% 53.28% 45.48%

Without feature selection Without feature selection 46.85% 46.57% 47.20% 47.02% 51.99% 53.14% 47.71%

Measure type	Method type	Method name	DT	RF	XGBoost	CatBoost	Voting classifier	LightGBM	Bagging
Accuracy	Embedded	RF	82.06%	81.97%	81.95%	81.91%	81.95%	75.15%	80.93%
	Filtering	Chi-squared	82.06%	82.17%	81.75%	81.82%	82.11%	74.53%	82.35%
	Wrapper	Backward selection	82.02%	82.37%	81.80%	82.08%	82.04%	76.22%	81.62%
		Forward selection	82.02%	82.04%	81.64%	81.62%	81.62%	76.31%	80.04%
	Without feature selection	Without feature selection	82%	81.95%	81.75%	81.82%	82.11%	74.53%	81.97%
AUC-ROC	Embedded	RF	66%	65%	66%	66%	68%	71%	65%
	Filtering	Chi-squared	65%	66%	66%	66%	68%	71%	66%
	Wrapper	Backward selection	66%	65%	65%	66%	67%	71%	66%
		Forward selection	66%	66%	66%	65%	68%	71%	65%
	Without feature selection	Without feature selection	65%	65%	66%	66%	68%	71%	66%
FPR	Embedded	RF	5.03%	4.75%	4.89%	4.95%	6.58%	21.59%	6.21%
	Filtering	Chi-squared	4.69%	4.69%	5.29%	5.09%	6.81%	22.68%	4.78%
	Wrapper	Backward selection	4.98%	4.03%	4.60%	4.58%	5.98%	19.53%	5.43%
		Forward selection	4.80%	4.83%	5.43%	5.29%	7.09%	19.19%	7.06%
	Without feature selection	Without feature selection	4.69%	4.66%	5.29%	5.09%	6.81%	22.68%	5.09%
FNR	Embedded	RF	62.91%	64.31%	63.91%	63.91%	58.03%	36.19%	63.91%
	Filtering	Chi-squared	64.41%	63.61%	63.41%	63.81%	56.53%	35.19%	62.51%
	Wrapper	Backward selection	63.32%	65.00%	65.60%	64.41%	59.72%	38.58%	63.51%
		Forward selection	63.91%	63.71%	63.41%	64.01%	57.73%	39.38%	63.31%
	Without feature selection	Without feature selection	64.41%	64.71%	63.41%	63.81%	56.53%	35.19%	63.11%
Recall	Embedded	RF	37.08%	35.69%	36.09%	36.09%	41.97%	63.80%	36.09%
	Filtering	Chi-squared	35.59%	36.39%	36.59%	36.19%	43.46%	64.80%	37.48%
	Wrapper	Backward selection	36.68%	34.99%	34.39%	35.59%	40.27%	61.41%	36.49%
		Forward selection	36.09%	36.29%	36.59%	35.99%	42.27%	60.61%	36.68%
	Without feature selection	Without feature selection	35.59%	35.29%	36.59%	36.19%	43.46%	64.80%	36.88%
Precision	Embedded	RF	67.88%	68.32%	67.91%	67.66%	64.66%	45.87%	62.52%
	Filtering	Chi-squared	68.52%	68.99%	66.48%	67.09%	64.68%	45.04%	69.24%
	Wrapper	Backward selection	67.89%	71.34%	68.18%	69.05%	65.90%	47.42%	65.82%
		Forward selection	68.30%	68.29%	65.88%	66.11%	63.09%	47.53%	59.83%
	Without feature selection	Without feature selection	68.52%	68.47%	66.48%	67.09%	64.68%	45.04%	67.51%
F1-score	Embedded	RF	47.96%	46.88%	47.13%	47.07%	50.90%	53.37%	45.76%
	Filtering	Chi-squared	46.85%	47.65%	47.20%	47.02%	51.99%	53.14%	48.64%
	Wrapper	Backward selection	47.63%	46.95%	45.72%	46.97%	50.00%	53.51%	46.95%
		Forward selection	47.22%	47.39%	47.05%	46.61%	50.62%	53.28%	45.48%
	Without feature selection	Without feature selection	46.85%	46.57%	47.20%	47.02%	51.99%	53.14%	47.71%

Table 6 lists the results of the various feature selection strategies and classification models on the oversampled Taiwan credit scoring dataset. The results in Table 6 show that accuracy of all models remains high but slightly lower than the accuracy on the original dataset. The highest accuracy is achieved when chi-squared filtering and backward selection are employed. This indicates that feature selection methods help improve performance when dealing with an oversampled dataset. In contrast, the FPR decreases for most models compared to the original dataset. Therefore, oversampling can be effective in minimizing false alarms.

Table 6.

Evaluation Metrics of the Models Using Oversampled Dataset (The Taiwan Dataset).

Measure type	Method type	Method name	DT	RF	XGBoost	CatBoost	Voting classifier	LightGBM	Bagging
Accuracy	Embedded	RF	72.64%	77.93%	75.51%	75.91%	77.64%	75.95%	77.33%
	Filtering	Chi-squared	72.46%	79.08%	78.13%	78.28%	79.15%	75.40%	80.28%
	Wrapper	Backward selection	73%	79.55%	77.40%	78.35%	78.91%	75.91%	79.97%
		Forward selection	71.8%	78.57%	74.20%	75.53%	76.97%	75.84%	74.53%
	Without feature selection	Without feature selection	71.97%	79.28%	78.06%	78.75%	79.28%	75.40%	80.33%
AUC-ROC	Embedded	RF	62%	69%	65%	67%	69%	71%	65%
	Filtering	Chi-squared	61%	70%	66%	66%	69%	70%	67%
	Wrapper	Backward selection	62%	70%	66%	66%	68%	70%	67%
		Forward selection	60%	69%	64%	65%	67%	70%	64%
	Without feature selection	Without feature selection	60%	70%	66%	66%	69%	71%	67%
FPR	Embedded	RF	19.04%	15.27%	16.19%	16.84%	15.33%	20.39%	13.07%
	Filtering	Chi-squared	18.16%	13.24%	12.30%	11.75%	12.78%	20.65%	8.98%
	Wrapper	Backward selection	17.82%	12.67%	13.04%	11.90%	12.70%	19.36%	9.21%
		Forward selection	19.02%	13.38%	17.64%	15.87%	15.24%	19.45%	17.16%
	Without feature selection	Without feature selection	18.76%	13.30%	12.35%	11.15%	12.70%	21.02%	8.69%
FNR	Embedded	RF	56.33%	45.76%	53.44%	49.35%	46.86%	36.79%	56.13%
	Filtering	Chi-squared	60.22%	47.66%	55.23%	56.43%	48.95%	38.38%	57.13%
	Wrapper	Backward selection	59.02%	47.56%	55.93%	55.63%	50.35%	40.58%	57.73%
		Forward selection	60.22%	49.45%	54.24%	54.44%	50.15%	40.58%	54.44%
	Without feature selection	Without feature selection	60.32%	46.56%	55.33%	56.43%	48.65%	37.09%	57.93%
Recall	Embedded	RF	43.66%	54.23%	46.56%	50.64%	53.14%	63.21%	43.86%
	Filtering	Chi-squared	39.78%	52.34%	44.76%	43.56%	51.04%	61.61%	42.87%
	Wrapper	Backward selection	40.97%	52.44%	44.06%	44.36%	49.56%	59.42%	42.27%
		Forward selection	39.78%	50.54%	45.76%	45.56%	49.85%	59.42%	45.56%
	Without feature selection	Without feature selection	39.68%	53.43%	44.66%	43.56%	51.34%	62.91%	42.07%
Precision	Embedded	RF	39.67%	50.46%	45.20%	46.30%	49.85%	47.06%	49.05%
	Filtering	Chi-squared	38.58%	53.13%	51.08%	51.53%	53.38%	46.11%	57.79%
	Wrapper	Backward selection	39.74%	54.28%	49.22%	51.68%	52.86%	46.81%	56.83%
		Forward selection	37.5%	52.00%	42.65%	45.15%	48.40%	46.70%	43.23%
	Without feature selection	Without feature selection	37.76%	53.54%	50.90%	52.84%	53.70%	46.19%	58.12%
F1-score	Embedded	RF	41.57%	52.28%	45.87%	48.38%	51.44%	53.95%	46.31%
	Filtering	Chi-squared	39.17%	52.73%	47.71%	47.21%	52.19%	52.75%	49.22%
	Wrapper	Backward selection	40.35%	53.34%	46.50%	47.74%	51.20%	52.37%	48.48%
		Forward selection	38.60%	51.26%	44.15%	45.35%	49.11%	52.30%	44.36%
	Without feature selection	Without feature selection	38.69%	53.49%	47.58%	47.75%	52.49%	53.27%	48.81%

Table 7 lists the results of the various feature selection strategies and classification models on the undersampled Taiwan credit scoring dataset. The results in Table 7 show that accuracy on the undersampled dataset is lower than the accuracy on the oversampled dataset. However, RF and XGBoost models that integrate feature selection methods, especially forward selection, achieved high accuray. The results of these models indicate that feature selection method effectively compensates for data reduction due to undersampling. Another important note regarding the results in Table 7 is that the FNR is significantly lower than the FNR in case of the oversampled dataset, which means the models are better at detecting defaulters. This was proven by applying the paired t-test to the FNR values across all model-feature selection combinations. The results showed a statistically significant difference between the two sampling methods. The paired t-test resulted in a p-value of $8.32 \times 10^{- 12}$ , which is lower than the standard significance level of 0.05.

Table 7.

Evaluation Metrics of the Models Using Undersampled Dataset (The Taiwan Dataset).

Measure type	Method type	Method name	DT	RF	XGBoost	CatBoost	Voting classifier	LightGBM	Bagging
Accuracy	Embedded	RF	76.06%	76.84%	74.04%	74.80%	75.15%	74.62%	71.77%
	Filtering	Chi-squared	76.73%	74.62%	74.24%	74.84%	74.31%	73.62%	73.42%
	Wrapper	Backward selection	70.95%	73.80%	74.20%	74.88%	73.86%	73.95%	72.20%
		Forward selection	72.31%	77.04%	76.15%	75.68%	76.64%	75.75%	70.80%
	Without feature selection	Without feature selection	76.73%	74.88%	74.55%	75.71%	74.11%	73.44%	73.24%
AUC-ROC	Embedded	RF	69%	71%	70%	71%	71%	71%	69%
	Filtering	Chi-squared	70%	70%	71%	71%	71%	71%	70%
	Wrapper	Backward selection	69%	70%	71%	71%	70%	70%	68%
		Forward selection	69%	71%	70%	70%	71%	71%	68%
	Without feature selection	Without feature selection	70%	71%	71%	71%	71%	70%	70%
FPR	Embedded	RF	18.53%	18.07%	22.88%	22.05%	21.33%	22.45%	26.28%
	Filtering	Chi-squared	17.76%	22.05%	22.88%	21.99%	22.79%	23.96%	23.99%
	Wrapper	Backward selection	27.25%	23.13%	22.91%	21.65%	23.11%	23.08%	24.68%
		Forward selection	25.14%	17.76%	19.13%	20.13%	18.62%	20.10%	27.28%
	Without feature selection	Without feature selection	17.76%	21.76%	22.59%	20.67%	23.02%	23.99%	24.19%
FNR	Embedded	RF	42.77%	40.88%	36.69%	36.19%	37.09%	35.59%	35.00%
	Filtering	Chi-squared	42.47%	36.99%	35.00%	36.19%	35.79%	34.80%	35.59%
	Wrapper	Backward selection	35.29%	36.89%	35.89%	37.19%	36.69%	36.39%	38.68%
		Forward selection	36.59%	41.08%	40.28%	38.88%	39.88%	38.68%	35.89%
	Without feature selection	Without feature selection	42.47%	36.79%	35.39%	36.89%	35.89%	35.49%	35.69%
Recall	Embedded	RF	57.22%	59.12%	63.31%	63.80%	62.91%	64.40%	65.00%
	Filtering	Chi-squared	57.52%	63.01%	64.20%	63.80%	64.20%	65.20%	64.40%
	Wrapper	Backward selection	64.70%	63.11%	64.10%	62.81%	63.31%	63.60%	61.31%
		Forward selection	63.40%	58.92%	59.72%	61.11%	60.11%	61.31%	64.10%
	Without feature selection	Without feature selection	57.52%	63.21%	64.60%	63.11%	64.10%	64.50%	64.30%
Precision	Embedded	RF	46.97%	48.40%	44.25%	45.35%	45.82%	45.14%	41.50%
	Filtering	Chi-squared	48.16%	45.04%	44.59%	45.42%	44.69%	43.83%	43.50%
	Wrapper	Backward selection	40.51%	43.89%	44.52%	45.42%	44.00%	44.15%	41.61%
		Forward selection	41.98%	48.76%	47.23%	46.54%	48.08%	46.66%	40.26%
	Without feature selection	Without feature selection	48.16%	45.44%	45.06%	46.68%	44.40%	43.53%	43.25%
F1-score	Embedded	RF	51.59%	53.23%	52.09%	53.02%	53.02%	53.08%	50.66%
	Filtering	Chi-squared	52.43%	52.53%	52.63%	53.06%	52.70%	52.42%	51.92%
	Wrapper	Backward selection	49.82%	51.77%	52.55%	52.71%	51.92%	52.12%	49.57%
		Forward selection	50.51%	53.36%	52.75%	52.84%	53.43%	52.99%	49.46%
	Without feature selection	Without feature selection	52.43%	52.87%	53.09%	53.66%	52.46%	51.98%	51.72%

The evaluation of the different models on the Taiwan credit scoring dataset under the three scenarios (original, oversampled, and undersampled) shows that oversampling improves model accuracy and recall but increases FPR. Undersampling improves recall significantly but reduces accuracy and precision. Feature selection methods, especially forward selection, improve accuracy in case of the undersampling scenario.

4.5. Results on the Australian Credit Scoring Dataset

In this section, the performance of the proposed models is explored on the Australian credit scoring dataset across the three scenarios: the original dataset, an oversampled dataset, and an undersampled dataset. Table 8 lists the results of the various feature selection strategies and classification models on the original Australian credit scoring dataset without applying any data resampling techniques. The results in Table 8 show that RF is the most balanced model. It achieves high accuracy and reasonable FPR and FNR. In contrast, models like XGBoost and CatBoost offer slightly less stable results.

Table 8.
Evaluation Metrics for the Models Using Original Dataset (The Australian Dataset).

Measure type Method type Method name DT RF XGBoost CatBoost Voting classifier LightGBM Bagging

Accuracy Embedded RF 86.53% 83.65% 86.53% 84.61% 84.60% 83.65% 85.57%

Filtering Chi-squared 85.57% 83.65% 85.57% 83.65% 84.61% 83.65% 85.57%

Wrapper Backward selection 84.61% 82.69% 84.61% 85.57% 82.69% 83.65% 84.60%

Forward selection 87.50% 81.73% 81.73% 86.53% 82.69% 82.69% 83.65%

Without feature selection Without feature selection 87.50% 82.96% 84.61% 88.46% 84.61% 84.61% 84.61%

AUC-ROC Embedded RF 85% 82% 85% 83% 83% 82% 85%

Filtering Chi-squared 84% 82% 85% 82% 84% 82% 85%

Wrapper Backward selection 83% 85% 84% 84% 82% 84% 83%

Forward selection 85% 80% 81% 85% 82% 82% 82%

Without feature selection Without feature selection 86% 81% 83% 87% 83% 84% 83%

FPR Embedded RF 10.45% 13.43% 10.45% 11.94% 10.45% 13.43% 13.43%

Filtering Chi-squared 10.45% 13.43% 11.94% 13.43% 13.43% 12.43% 11.94%

Wrapper Backward selection 11.94% 19.40% 13.43% 10.45% 16.42% 16.42% 11.94%

Forward selection 7.46% 13.43% 16.42% 10.45% 14.93% 14.93% 11.94%

Without feature selection Without feature selection 8.96% 11.94% 11.94% 8.69% 11.94% 13.43% 10.45%

FNR Embedded RF 18.92% 21.62% 18.92% 21.62% 24.32% 21.62% 16.22%

Filtering Chi-squared 21.60% 21.62% 18.92% 21.62% 18.92% 21.62% 18.92%

Wrapper Backward Selection 21.62% 13.51% 18.92% 21.62% 18.92% 16.22% 21.62%

Forward Selection 21.62% 27.03% 21.62% 18.92% 21.62% 21.62% 24.32%

Without Filter Selection Without Filter Selection 18.92% 27.03% 21.62% 16.22% 21.62% 18.92% 24.32%

Recall Embedded RF 81.08% 78.37% 81.08% 78.37% 75.67% 78.37% 83.78%

Filtering Chi-squared 78.37% 78.37% 81.08% 78.37% 81.08% 78.37% 81.08%

Wrapper Backward Selection 78.37% 86.48% 81.08% 78.37% 81.08% 83.78% 78.37%

Forward Selection 78.37% 72.97% 78.37% 81.08% 78.37% 78.37% 75.67%

Without Filter Selection Without Filter Selection 81.08% 72.97% 78.37% 83.78% 78.37% 81.08% 75.67%

Precision Embedded RF 81.08% 76.31% 81.08% 78.37% 80% 76.31% 77.50%

Filtering Chi-squared 80.55% 76.31% 78.94% 76.31% 76.92% 76.31% 78.94%

Wrapper Backward Selection 78.37% 71.11% 76.92% 80.55% 73.17% 73.80% 78.37%

Forward Selection 85.29% 75% 72.50% 81.08% 74.35% 74.35% 77.77%

Without Filter Selection Without Filter Selection 83.33% 77.14% 78.37% 83.78% 78.37% 76.92% 80%

F1 Score Embedded RF 81.08% 77.33% 81.08% 78.37% 77.77% 77.33% 80.51%

Filtering Chi-squared 79.45% 77.33% 80% 76.31% 78.94% 77.33% 80%

Wrapper Backward Selection 78.37% 78.04% 78.94% 79.45% 76.92% 78.48% 78.37%

Forward Selection 81.69% 73.97% 75.32% 81.08% 76.31% 76.31% 76.71%

Without Filter Selection Without Filter Selection 82.19% 75% 78.37% 83.78% 78.37% 78.94% 77.77%

Measure type	Method type	Method name	DT	RF	XGBoost	CatBoost	Voting classifier	LightGBM	Bagging
Accuracy	Embedded	RF	86.53%	83.65%	86.53%	84.61%	84.60%	83.65%	85.57%
	Filtering	Chi-squared	85.57%	83.65%	85.57%	83.65%	84.61%	83.65%	85.57%
	Wrapper	Backward selection	84.61%	82.69%	84.61%	85.57%	82.69%	83.65%	84.60%
		Forward selection	87.50%	81.73%	81.73%	86.53%	82.69%	82.69%	83.65%
	Without feature selection	Without feature selection	87.50%	82.96%	84.61%	88.46%	84.61%	84.61%	84.61%
AUC-ROC	Embedded	RF	85%	82%	85%	83%	83%	82%	85%
	Filtering	Chi-squared	84%	82%	85%	82%	84%	82%	85%
	Wrapper	Backward selection	83%	85%	84%	84%	82%	84%	83%
		Forward selection	85%	80%	81%	85%	82%	82%	82%
	Without feature selection	Without feature selection	86%	81%	83%	87%	83%	84%	83%
FPR	Embedded	RF	10.45%	13.43%	10.45%	11.94%	10.45%	13.43%	13.43%
	Filtering	Chi-squared	10.45%	13.43%	11.94%	13.43%	13.43%	12.43%	11.94%
	Wrapper	Backward selection	11.94%	19.40%	13.43%	10.45%	16.42%	16.42%	11.94%
		Forward selection	7.46%	13.43%	16.42%	10.45%	14.93%	14.93%	11.94%
	Without feature selection	Without feature selection	8.96%	11.94%	11.94%	8.69%	11.94%	13.43%	10.45%
FNR	Embedded	RF	18.92%	21.62%	18.92%	21.62%	24.32%	21.62%	16.22%
	Filtering	Chi-squared	21.60%	21.62%	18.92%	21.62%	18.92%	21.62%	18.92%
	Wrapper	Backward Selection	21.62%	13.51%	18.92%	21.62%	18.92%	16.22%	21.62%
		Forward Selection	21.62%	27.03%	21.62%	18.92%	21.62%	21.62%	24.32%
	Without Filter Selection	Without Filter Selection	18.92%	27.03%	21.62%	16.22%	21.62%	18.92%	24.32%
Recall	Embedded	RF	81.08%	78.37%	81.08%	78.37%	75.67%	78.37%	83.78%
	Filtering	Chi-squared	78.37%	78.37%	81.08%	78.37%	81.08%	78.37%	81.08%
	Wrapper	Backward Selection	78.37%	86.48%	81.08%	78.37%	81.08%	83.78%	78.37%
		Forward Selection	78.37%	72.97%	78.37%	81.08%	78.37%	78.37%	75.67%
	Without Filter Selection	Without Filter Selection	81.08%	72.97%	78.37%	83.78%	78.37%	81.08%	75.67%
Precision	Embedded	RF	81.08%	76.31%	81.08%	78.37%	80%	76.31%	77.50%
	Filtering	Chi-squared	80.55%	76.31%	78.94%	76.31%	76.92%	76.31%	78.94%
	Wrapper	Backward Selection	78.37%	71.11%	76.92%	80.55%	73.17%	73.80%	78.37%
		Forward Selection	85.29%	75%	72.50%	81.08%	74.35%	74.35%	77.77%
	Without Filter Selection	Without Filter Selection	83.33%	77.14%	78.37%	83.78%	78.37%	76.92%	80%
F1 Score	Embedded	RF	81.08%	77.33%	81.08%	78.37%	77.77%	77.33%	80.51%
	Filtering	Chi-squared	79.45%	77.33%	80%	76.31%	78.94%	77.33%	80%
	Wrapper	Backward Selection	78.37%	78.04%	78.94%	79.45%	76.92%	78.48%	78.37%
		Forward Selection	81.69%	73.97%	75.32%	81.08%	76.31%	76.31%	76.71%
	Without Filter Selection	Without Filter Selection	82.19%	75%	78.37%	83.78%	78.37%	78.94%	77.77%

Table 9 lists the results of the various feature selection strategies and classification models on the oversampled Australian credit scoring dataset. The results show notable improvements in certain metrics. For example, the FNR is reduced to 18.92%, while recall reaches 86.48% for RF. These results indicate that oversampling enables the model to identify defaulting customers more effectively. On the other hand, oversampling has a slight impact on accuracy and precision.

Table 9.

Evaluation Metrics of the Models Using Oversampled Dataset (The Australian Dataset).

Measure type	Method type	Method name	DT	RF	XGBoost	CatBoost	Voting classifier	LightGBM	Bagging
Accuracy	Embedded	RF	85.57%	82.69%	85.57%	86.53%	84.61%	86.53%	85.57%
	Filtering	Chi-squared	86.53%	82.69%	85.57%	84.61%	85.57%	86.53%	86.53%
	Wrapper	Backward Selection	85.57%	86.53%	85.57%	86.53%	83.65%	81.73%	84.61%
		Forward Selection	86.53%	82.69%	83.65%	82.69%	81.73%	82.69%	81.73%
	Without Filter Selection	Without Filter Selection	85.57%	82.69%	85.57%	84.61%	85.57%	84.61%	85.57%
ROC AUC	Embedded	RF	85%	84%	85%	86%	84%	86%	85%
	Filtering	Chi-squared	87%	84%	85%	84%	85%	86%	86%
	Wrapper	Backward Selection	84%	87%	85%	85%	82%	81%	83%
		Forward Selection	85%	81%	81%	81%	79%	82%	80%
	Without Filter Selection	Without Filter Selection	85%	84%	84%	83%	85%	84%	85%
FPR	Embedded	RF	11.94%	20.90%	11.94%	11.94%	13.43%	11.94%	13.43%
	Filtering	Chi-squared	13.43%	20.90%	11.94%	13.43%	11.94%	11.94%	11.94%
	Wrapper	Backward Selection	10.45%	13.43%	11.94%	8.96%	13.43%	16.42%	11.94%
		Forward Selection	10.45%	20.90%	8.96%	11.94%	11.94%	14.93%	14.93%
	Without Filter Selection	Without Filter Selection	11.94%	20.90%	10.45%	11.94%	11.94%	13.43%	13.43%
FNR	Embedded	RF	18.92%	10.81%	18.92%	16.22%	18.92%	16.22%	16.22%
	Filtering	Chi-squared	13.51%	10.81%	18.92%	18.92%	18.92%	16.22%	16.22%
	Wrapper	Backward Selection	21.62%	13.43%	18.92%	21.62%	21.62%	21.62%	21.62%
		Forward Selection	18.92%	10.81%	29.73%	27.03%	29.73%	21.62%	24.32%
	Without Filter Selection	Without Filter Selection	18.92%	10.81%	21.62%	21.62%	18.92%	18.92%	16.22%
Recall	Embedded	RF	81.08%	89.18%	81.08%	83.78%	81.08%	83.78%	83.78%
	Filtering	Chi-squared	86.48%	89.18%	81.08%	81.08%	81.08%	83.78%	83.78%
	Wrapper	Backward Selection	78.37%	75.67%	81.08%	78.37%	78.37%	78.37%	78.37%
		Forward Selection	81.08%	75.67%	70.27%	72.97%	70.27%	78.37%	75.67%
	Without Filter Selection	Without Filter Selection	81.08%	89.18%	78.37%	78.37%	81.08%	81.08%	83.78%
Precision	Embedded	RF	78.94%	70.21%	78.94%	79.48%	76.92%	79.48%	77.5%
	Filtering	Chi-squared	78.04%	70.21%	78.94%	76.92%	78.94%	79.48%	79.48%
	Wrapper	Backward Selection	80.55%	75.67%	78.94%	82.85%	76.31%	72.5%	78.37%
		Forward Selection	81.08%	75.67%	81.25%	77.14%	76.47%	74.35%	73.68%
	Without Filter Selection	Without Filter Selection	78.94%	70.21%	80.55%	78.37%	78.94%	76.92%	77.5%
F1 Score	Embedded	RF	80%	78.57%	80%	81.57%	78.94%	81.57%	80.51%
	Filtering	Chi-squared	82.05%	78.57%	80%	78.94%	80%	81.57%	81.57%
	Wrapper	Backward Selection	79.45%	78.04%	80%	80.55%	77.33%	75.32%	78.37%
		Forward Selection	81.08%	75.67%	75.36%	75%	73.23%	76.31%	74.66%
	Without Filter Selection	Without Filter Selection	80%	78.57%	79.45%	78.37%	80%	78.94%	80.51%

Table 10 lists the results of the various feature selection strategies and classification models on the undersampled Australian credit scoring dataset. The results show a decrease in accuracy and precision for many models. Additionally, the FPR for RF is slightly higher compared to the oversampled data. However, recall remains high, which indicates that the models are able to identify defaulting customers using undersampled data. Overall, overampling offers a better balance between recall, FNR, and accuracy when compared to undersampling.

Table 10.

Evaluation Metrics of the Models Using Undersampled Dataset (The Australian Dataset).

Measure type	Method type	Method name	DT	RF	XGBoost	CatBoost	Voting classifier	LightGBM	Bagging
Accuracy	Embedded	RF	85.57%	82.69%	84.61%	84.61%	83.65%	82.69%	83.65%
	Filtering	Chi-squared	86.53%	82.69%	83.65%	84.61%	83.65%	82.69%	83.65%
	Wrapper	Backward Selection	82.69%	82.69%	83.65%	85.57%	82.69%	83.65%	82.69%
		Forward Selection	83.65%	84.61%	80.76%	81.73%	82.69%	81.73%	81.73%
	Without Filter Selection	Without Filter Selection	86.53%	82.69%	84.61%	87.5%	84.61%	84.61%	84.61%
ROC AUC	Embedded	RF	86%	84%	84%	84%	84%	83%	84%
	Filtering	Chi-squared	87%	84%	83%	84%	84%	83%	84%
	Wrapper	Backward Selection	82%	84%	83%	85%	82%	84%	82%
		Forward Selection	83%	86	78%	82%	82%	80%	80%
	Without Filter Selection	Without Filter Selection	86%	84%	84%	87%	84%	84%	84%
FPR	Embedded	RF	14.93%	20.90%	14.93%	14.93%	16.42%	17.91%	17.91%
	Filtering	Chi-squared	13.43%	20.90%	14.93%	14.93%	16.42%	17.91%	16.42%
	Wrapper	Backward Selection	16.42%	19.40%	14.93%	11.94%	16.42%	16.42%	16.42%
		Forward Selection	14.93%	19.40%	13.43%	17.91%	14.93%	14.93%	14.93%
	Without Filter Selection	Without Filter Selection	11.94%	20.90%	13.43%	11.94%	14.93%	14.93%	14.93%
FNR	Embedded	RF	13.51%	10.81%	16.22%	16.22%	16.22%	16.22%	13.51%
	Filtering Chi-squared	13.51%	10.81%	18.92%	16.22%	16.22%	16.22%	16.22%
	Wrapper	Backward Selection	18.92%	13.51%	18.92%	18.92%	18.92%	16.22%	18.92%
		Forward Selection	18.92%	8.11%	29.73%	18.92%	21.62%	24.32%	24.32%
	Without Filter Selection	Without Filter Selection	16.22%	10.81%	18.92%	13.51%	16.22%	16.22%	16.22%
Recall	Embedded	RF	86.48%	89.18%	83.78%	83.78%	83.78%	83.78%	86.48%
	Filtering	Chi-squared	86.48%	89.18%	81.08%	83.78%	83.78%	83.78%	83.78%
	Wrapper	Backward Selection	81.08%	86.48%	81.08%	81.08%	81.08%	83.78%	81.08%
		Forward Selection	81.08%	91.89%	70.27%	81.08%	78.37%	75.67%	75.67%
	Without Filter Selection	Without Filter Selection	83.78%	89.18%	81.08%	86.48%	83.78%	83.78%	83.78%
Precision	Embedded	RF	76.19%	70.21%	75.60%	75.60%	73.80%	72.09%	72.72%
	Filtering	Chi-squared	78.04%	70.21%	75%	75.60%	73.80%	72.09%	73.80%
	Wrapper	Backward Selection	73.17%	71.11%	75%	78.94%	73.17%	73.80%	73.17%
		Forward Selection	75%	72.34%	74.28%	71.42%	74.35%	73.68%	73.68%
	Without Filter Selection	Without Filter Selection	79.48%	70.21%	76.92%	80%	75.60%	75.60%	75.60%
F1 Score	Embedded	RF	81.01%	78.57%	79.48%	79.48%	78.48%	77.49%	79.01%
	Filtering	Chi-squared	82.05%	78.57%	77.92%	79.48%	78.48%	77.49%	78.48%
	Wrapper	Backward Selection	76.92%	78.04%	77.92%	80%	76.92%	78.48%	76.92%
		Forward Selection	77.92%	80.95%	72.22%	75.94%	76.31%	74.66%	74.66%
	Without Filter Selection	Without Filter Selection	81.57%	78.57%	78.94%	83.11%	79.48%	79.48%	79.48%

The experiments conducted using different models show that feature selection not only improves credit scoring performance but also enhances the interpretability of such models. By reducing the number of input features, feature selection methods can simplify the structure of complex classifiers, which makes their decisions easier to understand. For example, applying feature selection to simple models such as DT can result in more concise and interpretable decision rules. Even for ensemble models such as XGBoost and RF, which are typically less interpretable, reducing the feature space may lead to more explainable models. Therefore, it is recommended to employ feature selection not only as a performance enhancement tool but also as a mechanism for improving the interpretability of machine learning models in credit scoring applications.

5. Conclusion

In this paper, a framework for credit scoring is proposed. This framework consists of three main steps. It starts with data preprocessing. Then, multiple feature selection techniques such as, filtering-based methods and embedded methods are examined. Finally, seven different classifiers such as, RF, XGboost, and the voting classifiers are evaluated using the German, Taiwan, and Australian credit scoring datasets across three different scenarios: the original dataset, an oversampled dataset, and an undersampled dataset. For the original datasets, the model that led to the best performance is RF with backward selection. In the scenario of oversampled datasets, XGBoost with backward selection performed best. This paper highlights that feature selection methods, particularly backward selection, improve models’ performance. RF and XGBoost are the most effective models and oversampling is the best approach for handling imbalanced datasets. For future research, investigating the potential of deep learning models and advanced data enhancement techniques could be promising avenues for further improving credit scoring models. Furthermore, validating the proposed models using proprietary datasets to evaluate the model’s applicability and robustness in real-world contexts. Additionally, exploring the integration of nontraditional data sources, such as social media activity, consumer reviews, or loan application transcripts, could potentially offer additional insights into an individual’s creditworthiness and enable more personalized credit scoring approaches.

Footnotes

ORCID iDs

Mahmoud Abdelsalam

Samir Abdelrazek

Islam R Abdelmaksoud

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Akogul

(2023). A novel approach to increase the efficiency of filter-based feature selection methods in high-dimensional datasets with strong correlation structure. IEEE Access, 11, 115025–115032. https://doi.org/10.1109/ACCESS.2023.3325331

Ala’raj

Abbod

M. F.

(2016). Classifiers consensus system approach for credit scoring. Knowledge-Based Systems, 104, 89–105.

Asuncion

Newman

(2007a). UCI machine learning repository. https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data). Accessed: 2025-01-01.

Asuncion

Newman

(2007b). UCI machine learning repository. https://archive.ics.uci.edu/dataset/143/statlog+australian+credit+approval. Accessed: 2025-05-01.

Belete

D. M.

Manjaiah

(2020). A comparative study of filter and wrapper methods on EDHS–HIV/AIDS dataset. In 2020 Third international conference on smart systems and inventive technology (ICSSIT) (pp. 1264–1271). IEEE.

Carter

J. V.

Pan

Rai

S. N.

Galandiuk

(2016). Roc-ing along: Evaluation and interpretation of receiver operating characteristic curves. Surgery, 159(6), 1638–1645.

Chen

Guestrin

(2016). Xgboost: A scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining (pp. 785–794).

Ellis

D. E.

Hubbard

R. A.

Willis

A. W.

Zuppa

A. F.

Zaoutis

T. E.

Hennessy

(2022). Comparing LASSO and random forest models for predicting neurological dysfunction among fluoroquinolone users. Pharmacoepidemiology and Drug Safety, 31(4), 393–403.

Emmanuel

Sun

Wang

(2024). A machine learning-based credit risk prediction engine system using a stacked classifier and a filter-based feature selection method. Journal of Big Data, 11(1), 23.

10.

Esenogho

Mienye

I. D.

Swart

T. G.

Aruleba

Obaido

(2022). A neural network ensemble with feature engineering for improved credit card fraud detection. IEEE Access, 10, 16400–16407.

11.

Galdi

Tagliaferri

(2019). Data mining: Accuracy and error measures for classification and prediction. Elsevier.

12.

Gholamy

Kreinovich

Kosheleva

(2018). Why 70/30 or 80/20 relation between training and testing sets: A pedagogical explanation. International Journal of Intelligent Technologies and Applied Statistics, 11(2), 105–111.

13.

Zhang

(2018). A novel ensemble method for credit scoring: Adaption of different imbalance ratios. Expert Systems with Applications, 98, 105–117.

14.

Meng

Finley

Wang

Chen

Liu

T. Y.

(2017). LightGBM: A highly efficient gradient boosting decision tree. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan & R. Garnett (Eds.), Advances in neural information processing systems, volume 30. Curran Associates, Inc.

15.

Koc

Ugur

Kestel

A. S.

(2023). The impact of feature selection and transformation on machine learning methods in determining the credit scoring. arXiv preprint arXiv:2303.05427.

16.

Kwon

Jang

Kim

C. O.

(2025). Credit scoring using multi-task siamese neural network for improving prediction performance and stability. Expert Systems with Applications, 259, 125327.

17.

Labatut

Cherifi

(2012). Accuracy measures for the comparison of classifiers. CoRR abs/1207.3790.

18.

Laborda

Ryoo

(2021). Feature selection in a credit scoring model. Mathematics, 9(7), 746.

19.

Leiva

R. G.

Anta

A. F.

Mancuso

Casari

(2019). A novel hyperparameter-free approach to decision tree construction that avoids overfitting by design. IEEE Access, 7, 99978–99987.

20.

Liao

Wang

Xue

Lei

(2024). Data augmentation methods for reject inference in credit risk models. Journal of Financial Data Science, 10(1), 55–72.

21.

Mokheleli

Museba

(2023). Machine learning approach for credit score predictions. Journal of Information Systems and Informatics, 5(2), 497–517.

22.

Prokhorenkova

Gusev

Vorobev

Dorogush

A. V.

Gulin

(2018). CatBoost: Unbiased boosting with categorical features. Advances in neural information processing systems, volume 31.

23.

Valentini

(2012). Ensemble methods. Advances in machine learning and data mining for astronomy (pp. 563–593).

24.

Rofik

Aulia

Musaadah

Ardyani

S. S. F.

Hakim

A. A.

(2024). The optimization of credit scoring model using stacking ensemble learning and oversampling techniques. Journal of Information System Exploration and Research, 2(1), DOI: 10.52465/joiser.v2i1.203.

25.

Song

Sunny

Gurushanth

Mendonca

Mukhia

Patrick

Gurudath

Raghavan

Tsusennaro

, et al. (2021). Classification of imbalanced oral cancer image data from high-risk population. Journal of Biomedical Optics, 26(10), 105001.

26.

Talaat

F. M.

Aljadani

Badawy

Elhosseini

(2024). Toward interpretable credit scoring: Integrating explainable artificial intelligence with deep learning for credit card default prediction. Neural Computing and Applications, 36(9), 4847–4865.

27.

Tripathi

Shukla

A. K.

Reddy

B. R.

Bopche

G. S.

Chandramohan

(2022). Credit scoring models using ensemble learning and classification approaches: A comprehensive survey. Wireless Personal Communications, 123(1), 785–812.

28.

(2025). Inherently interpretable machine learning for credit scoring: Optimal classification tree with hyperplane splits. European Journal of Operational Research, 322(2), 647–664.

29.

Wang

Xiao

Wang

Yao

(2024). A novel federated learning approach with knowledge transfer for credit scoring. Decision Support Systems, 177, 114084.

30.

Yeh

I. C.

Lien

C. H.

(2009). The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients.

31.

Zhang

Chi

(2021). A heterogeneous ensemble credit scoring model based on adaptive classifier selection: An application on imbalanced data. International Journal of Finance & Economics, 26(3), 4372–4385.

Improving Credit Scoring with Feature Selection and Predictive Modeling

Abstract

Keywords

1. Introduction

– Filtering-based methods. – Wrapper-based methods. – Embedded methods. 3.2.1. Filtering-Based Feature Selection Techniques

3.4. Data Resampling

3.5.1. DT

3.5.2. RF

3.5.3. XGBoost

3.5.5. CatBoost

3.5.6. LightGBM

3.5.7. Bagging

4. Experimental Results and Discussion

4.1. Dataset Description

4.2.1. Accuracy

4.2.5. Precision

Footnotes

ORCID iDs

Funding

Declaration of Conflicting Interests

References

–
Filtering-based methods.
–
Wrapper-based methods.
–
Embedded methods.

3.2.1. Filtering-Based Feature Selection Techniques