Abstract
Credit scoring has become increasingly important for financial institutions. With the advancement of artificial intelligence, machine learning methods, especially ensemble learning methods, have become increasingly popular for credit scoring. However, the problems of imbalanced data distribution and underutilized feature information have not been well addressed sufficiently. To make the credit scoring model more adaptable to imbalanced datasets, the original model-based synthetic sampling method is extended herein to balance the datasets by generating appropriate minority samples to alleviate class overlap. To enable the credit scoring model to extract inherent correlations from features, a new bagging-based feature transformation method is proposed, which transforms features using a tree-based algorithm and selects features using the chi-square statistic. Furthermore, a two-layer ensemble method that combines the advantages of dynamic ensemble selection and stacking is proposed to improve the classification performance of the proposed multi-stage ensemble model. Finally, four standardized datasets are used to evaluate the performance of the proposed ensemble model using six evaluation metrics. The experimental results confirm that the proposed ensemble model is effective in improving classification performance and is superior to other benchmark models.
Introduction
With the development of the economy and society, assessing applicants’ credit accurately and efficiently to decrease the proportion of bad loans is becoming increasingly important for financial institutions. Credit scoring is a typical binary classification problem that predicts whether an applicant is trustworthy using a variety of information. Therefore, an effective credit scoring model that can increase the income of financial institutions and reduce unnecessary losses is required.
The financial classification field has been widely explored in previous works. Crook et al. [14] studied the recent development of credit risk assessment using simple individual models. With the continuous advancement of artificial intelligence in the last decade, ensemble learning models that integrate the advantages of different individual models have been proposed and have gradually become a popular research topic. However, the classification performance of ensemble learning methods can be influenced by several factors, in particular, the distribution of the datasets, the selection of features, and the selection of classifiers.
In general, datasets are imbalanced, which significantly influences the performance of classification models. For example, in the datasets owned by financial institutions, the number of bad credit customers is usually much smaller than the number of good credit customers. However, the loss from bad credit customers is far greater than the gain from good credit customers. If a dataset consists of 99% positive samples and 1% negative samples, it is clear that the classification results produced by the machine learning algorithms will be biased to positive samples and neglect the negative samples. Therefore, a balanced dataset is necessary for financial institutions to train an effective classification model. Researchers have proposed imposing constraints to ensure the fairness of classification model [2]; however, a more popular and simpler approach is to use balanced datasets through sampling to reduce unfairness [34]. Moreover, for classifiers trained using imbalanced datasets, the class overlap between samples is one of the main factors influencing classification performance. The overlap between samples from different classes can cause classification hardness and that between samples from the same class can result in classifier overfitting [38].
The feature processing of the dataset also has an important role in improving the performance of classification models. Researchers have proposed selecting useful features conditionally for classifier training [37]. For example, Ji et al. [26] proposed a Min-Max ensemble feature selection method to select features efficiently for large-scaled, high-dimension and imbalanced problems. Although these feature selection methods can exploit useful information efficiently, inherent feature correlations are neglected.
The composition of individual classifiers in ensemble models has been explored in previous studies; however, the majority of them only combine some base classifiers statically, including gradient boosting tree (GBDT) [20], random forest (RF) [8], extreme gradient boosting (XGBoost) [13], light gradient boosting machine (LightGBM) [27], support vector machine (SVM) [12], and logistic regression (LR) [22]. Recently, the dynamic ensemble selection (DES) method [28], which is an extension of the traditional dynamic classifier selection method [25], demonstrates its superior performance in the field of machine learning. It has been proven that an ensemble of classifiers is more robust than a single classifier through dynamic selection.
Therefore, the motivation of this study is to propose an effective ensemble model for credit scoring by balancing the dataset and transforming the features. The highlights of this study are mainly listed as follows: A novel multi-stage ensemble model based on synthetic sampling method and feature transformation method is proposed to achieve superior classification performance and computational efficiency. The model-based synthetic sampling method [31] is extended to alleviate class overlap by selecting minority samples away from the decision boundary and make the model adaptable to imbalanced datasets. A new bagging-based feature transformation method combined with bootstrap aggregation and feature transformation is proposed to transform existing features into new features that contain more inherent correlations. A DES-based two-layer ensemble method, which adopts the DES method in the first layer and the stacking method in the second layer, is proposed to compose the base classifiers into an effective ensemble model. To verify the performance of the proposed model, four standardized datasets are used, and the experimental results are evaluated using six evaluation metrics.
The remainder of this paper is organized as follows. Section 2 presents related work on the imbalanced learning approach, feature processing approaches, and ensemble learning approaches. Section 3 describes the details of the proposed multi-stage ensemble model. Section 4 provides a description of the datasets, evaluation metrics, and parameter settings. Section 5 presents an analysis of the experimental results. In Section 6, the conclusions and future work are presented.
Related work
This section discusses previous research on credit scoring models and elaborates on three aspects: imbalanced learning approaches, feature processing approaches, and ensemble learning approaches.
Imbalanced learning approaches
The problem of imbalanced learning is difficult to address because datasets have class overlap, and the sample size of one class can be considerably larger than that of other classes. For example, the sample size of “no default” in credit scoring datasets is typically significantly larger than that of “default”. The most common method to address the problem of imbalanced datasets is data sampling, which includes under-sampling methods and over-sampling methods.
Under-sampling methods mainly rebalance the datasets by discarding selected majority samples [29]. The simplest under-sampling method is the random under-sampling method, which randomly selects samples from the majority samples to generate a balanced dataset. However, this method may discard informative samples and hence has been extended by researchers. For example, Sun et al. [40] proposed a cluster-based under-sampling method that converts an imbalanced dataset into multiple balanced subsets and then builds a number of classifiers on these multiple subsets. Maglietta et al. [33] proposed a parallel selective sampling method that selects the majority samples based on the distance between samples. In our previous work, He et al. [23] proposed a supervised under-sampling approach called the extended balance cascade approach, which constructs adjustable datasets adaptively based on the data imbalance rate. However, under-sampling methods can cause information loss, which has not been well addressed.
Over-sampling methods mainly rebalance the datasets by adding new minority samples. The simplest over-sampling method is the random over-sampling method that replicates the original minority samples. However, this method can result in overfitting and hence has been extended by researchers. For example, Babu and Ananthanarayanan [4] proposed an enhanced minority over-sampling technique to adjust the distribution of dataset by over-sampling the nearest neighbors of minority samples. Liu and Hsieh [31] proposed a model-based synthetic sampling method that trains regression models to predict the features of new samples. However, the existing over-sampling methods add the generated samples to the dataset for training the classification model directly, without considering the noise of the generated samples and the class overlap.
To overcome the drawbacks of under-sampling and over-sampling methods, the original model-based synthetic sampling method is extended in this study by generating diverse minority samples away from the decision boundary and addressing the noise of the generated samples to alleviate the class overlap.
Feature processing approaches
Feature processing is defined as the processing of feature subsets based on certain criteria [32], which mainly includes feature selection and feature transformation methods.
Feature selection is the process to find an appropriate subset of features to describe the target object from the original feature sets [41] and has been widely used in machine learning. For example, Bryll et al. [11] presented an attribute bagging method that could improve the performance and robustness of classifier ensembles by randomly selecting subsets of features. Das et al. [16] proposed a bi-objective genetic algorithm-based feature selection method that selects informative features from a dataset using boundary region analysis and multivariate mutual information as the objective functions. However, feature selection methods tend to neglect the inherent correlations among features, resulting in information loss.
Feature transformation methods transform features in a specific manner such that certain information hidden in the features can be extracted. For example, Abedin et al. [1] found that logarithmic and square-root feature transformations could improve the performance of tax default prediction. Parisi and Ravichandran [35] proposed a two-stage genetic algorithm-based feature transformation method, which improved the classification performance and computational efficiency by applying dimension reduction and augmenting the relevant features. In our previous work, Zhang et al. [46] proposed an enhanced multi-population niche genetic algorithm to select competent features and classifiers by enhancing the selection, crossover, and mutation steps, and adding niche and migration steps in the traditional genetic algorithm. Zhang et al. [47] proposed a GBDT-based feature transformation method combined with one-hot encoding and dimension reduction; the experimental results proved the outstanding performance of the above methods. However, the existing feature transformation methods can result in dimension explosion and increase the time cost.
Based on the superior performance of the attribute bagging method and tree-based feature transformation method, a new bagging-based feature transformation method is proposed in this study that combines the advantages of the attribute bagging and feature transformation methods to reduce the time cost and extract the inherent correlations of features.
Ensemble learning approaches
Ensemble learning approaches aim to combine the learning results of multiple base classifiers to improve the performance and generalization of the classification model. Traditional ensemble learning approaches include bootstrap aggregation (Bagging) [7], boosting [19], and stacking [42]. Li and Chen [30] compared the three ensemble approaches mentioned above, demonstrating that ensemble learning yields higher performance than individual learners. In our previous work, Yang et al. [45] proposed a multi-stage ensemble model that combined different classifier selection and composition methods to achieve higher classification accuracy.
With the continuous progress and deepening of research, researchers are no longer satisfied with the performance of static ensemble learning. A more flexible ensemble method called the DES method has been proposed to integrate the base classifiers dynamically. It can automatically integrate the most competent base classifiers by estimating the real-time classification performance of each classifier from a classifier pool [15]. Zyblewski et al. [48] proposed a framework to integrate data preprocessing and DES methods for imbalanced data stream classification, demonstrating that the DES methods combined with data preprocessing outperform other ensemble methods.
Considering the advantages of the DES and stacking methods, a DES-based two-layer ensemble method that adopts the DES method in the first layer and stacking method in the second layer is proposed herein to further improve the performance of the final ensemble model.
Model
The multi-stage ensemble model proposed in this study aims to predict imbalanced credit datasets. The original credit dataset is divided into training and testing data; the training data are further divided into training and validation sets. The framework of the proposed model is presented in Fig. 1. The model is divided into three stages. In Stage 1, the original model-based synthetic sampling method is extended to generate synthetic samples that are used to balance the training set. In Stage 2, to extract the inherent correlations among the features, a new bagging-based feature transformation method is proposed to transform the features and reduce the feature dimensions. In Stage 3, a DES-based two-layer ensemble method is proposed that integrates the advantages of the DES and stacking methods. The three stages are described in detail in the following subsections.

Framework of proposed multi-stage ensemble model.
The original model-based synthetic sampling method adopts regression models to generate synthetic samples as new minority samples to balance the dataset. The generated synthetic samples need retain the characteristics of the original dataset [31]. However, all the generated synthetic samples are added to the training set without considering the noise of the synthetic samples, making it difficult to alleviate the class overlap effectively. Therefore, the original model-based synthetic sampling method is extended herein, by selecting appropriately generated samples and removing the noise of the synthetic samples. The selected appropriate synthetic samples form a normal synthetic sample subset and are added to the training set to alleviate class overlap.
The extended model-based synthetic sampling method balances the dataset through the following processes (Fig. 2). First, the training set is divided into majority and minority samples. Then, a random sampling method with replacement is performed on the minority samples to generate temporary samples. For example, F1, F2, ... , Fn represent the features of the samples. The first feature F1 of each temporary sample is assigned a value that is sampled with replacement from the F1 values of the minority samples. The second feature F2 of each temporary sample is assigned a value that is sampled with replacement from the F2 values of the minority samples. Then, the same process is applied to assign values to the other features of each temporary sample.

Schematic diagram of extended model-based synthetic sampling method.
Subsequently, the regression models are trained to generate synthetic samples from temporary samples as new minority samples. For example, Regression model 1 is trained using the original training set, with the first feature F1 regarded as the label and the remaining features (i.e., F2–Fn) regarded as the input feature vectors. Similarly, Regression model 2 is trained using the original training set, with the second feature F2 regarded as the label and the remaining features (i.e., F1, F3–Fn) as the input feature vectors. The same scheme is applied to train the remaining Regression models 3–n. Then, the temporary samples and regression models are employed to generate the synthetic samples. For example, the F1 value of each synthetic sample is obtained from the regression result of Regression model 1, which uses the remaining features of the corresponding temporary sample except F1 (i.e., F2–Fn) as the input feature vector. The F2 value of each synthetic sample is obtained from the regression result of Regression model 2, which uses the remaining features of the corresponding temporary sample except F2 (i.e., F1, F3–Fn) as the input feature vector. The same process is followed to obtain the values of the remaining features (i.e., F3–Fn) of each synthetic sample.
The extended model-based synthetic sampling method generates the synthetic samples that are similar as the minority samples. Therefore, once the synthetic samples are generated, three widely recognized base classifiers (i.e., XGBoost, LightGBM, and GBDT) are selected and trained using the original training set, and then integrated to predict the labels of the synthetic samples. Different from the base classifiers used to obtain the final classification results in Stage 3, the trained classifiers enhanced by the proposed extended model-based synthetic sampling method can select the competent minority samples from all generated synthetic samples. Synthetic samples classified as minorities are regarded as normal samples, whereas synthetic samples classified as majorities are regarded as noise samples. The noise samples are further removed from the synthetic samples; the normal synthetic samples are aggregated as a normal synthetic sample subset and subsequently added to the training set. If the current imbalance ratio (IR) of the training set is larger than the IR setting, i.e., r, the normal synthetic sample subsets are generated again until a balanced training set is obtained.
To extract the inherent correlations of the features and prevent dimension explosion, a new bagging-based feature transformation method is proposed in this study; it combines the advantages of the attribute bagging method and feature transformation method.
As depicted in Fig. 3, the attribute bagging method is first adopted to select approximately half of the features of the balanced training set, through random feature bootstrap sampling, resulting in multiple feature subsets (i.e. Feature subset 1, Feature subset 2, etc.). Based on He et al. [24], a tree algorithm (e.g., XGBoost) is used to transform the feature subsets and generate new features automatically such that the inherent correlations among the features can be extracted. The XGBoost algorithms with different parameters (i.e., XGBoost 1, XGBoost 2, etc.) are trained using the corresponding feature subsets. Each sample with selected features is regarded as the root node of the XGBoost algorithm for tree growth. The leaf nodes of the tree contain the transformed feature information of the feature subsets and are regarded as new features.

Schematic diagram of bagging-based feature transformation method.
To prevent dimension explosion, the chi-square statistic [36] is employed to reduce the feature dimensions through feature selection, resulting in the selected feature subsets. Multiple selected feature subsets are then aggregated. Considering that the dimension of aggregated features may remain large, the chi-square statistic is adopted again to further reduce the final feature dimensions, resulting in the feature transformed training set used for the next stage.
To combine the respective advantages of different individual classifiers, an effective ensemble method is required. Therefore, a DES-based two-layer ensemble method that combines both the DES and stacking methods is proposed to integrate the base classifiers trained using the feature transformed training sets.
DES method assumes that different base classifiers have the best performance in local region of the sample points. When test samples are given, the DES method can automatically select the best-performed base classifier to classify test samples according to the distribution of test samples in sample points [15]. In this study, the classifier selected through the DES method is denoted the DES classifier.
Stacking is also a powerful ensemble method that combines the classification results of several homogeneous or heterogeneous classifiers to achieve superior classification performance. It trains a meta-classifier using the predictive results of individual classifiers as input features. LR has been proven to be a superior choice as a meta-classifier to improve stacking performance [43]. Therefore, LR is used as the meta-classifier in this study.
As shown in Fig. 4, n base classifiers (i.e. Clf 1, Clf 2, etc.) from the base classifier pool are trained using a feature transformed training set and further compared based on the area under the receiver operating characteristic (ROC) curve (AUC) [18]. The m base classifiers (i.e. Sclf 1, Sclf 2, etc.) with higher AUC values on the validation set are selected as the competent classifiers. In the first layer of the DES operation, k DES classifiers (i.e. DESclf 1, DESclf 2, etc.) are generated through the permutation and combination of m selected competent base classifiers. To improve the diversity of the DES classifiers, the DES classifiers that produce correlations higher than the average are filtered, and the remaining classifiers are used for the next layer. In the second layer for stacking integration, the predictive results of the selected DES classifiers are further integrated into the final result by the meta-classifier. Finally, the ensemble model with the best predictive accuracy is selected as the proposed ensemble model.

Schematic diagram of DES-based two-layer ensemble method.
In this section, the experimental settings including datasets, data processing, evaluation metrics, and parameter settings, used to evaluate the performance of the proposed multi-stage ensemble model are introduced.
Datasets and data processing
In this experiment, three datasets (i.e., Australian, German, and Japanese) from the UCI machine learning library [3] and one dataset (i.e., Hmeq) from Kaggle [5] are used to verify the performance of the proposed model. The details of the datasets are presented in Table 1.
Detail of datasets
Detail of datasets
The Australian dataset contains 690 samples, comprising 307 positive samples and 383 negative samples. Its imbalance ratio is 1.25, and its total feature dimension is 15. The German dataset contains 1000 samples, comprising 700 positive samples and 300 negative samples. Its imbalance ratio is 2.33, and its total feature dimension is 21. The Japanese dataset has the same sample size and imbalance ratio as the Australian dataset, and its feature dimension is 16. The Hmeq dataset contains 5960 samples, comprising 1189 positive samples and 383 negative samples. Its imbalance ratio is 4.01, and its total feature dimension is 13.
Six evaluation metrics are used in this study to evaluate the performance of the proposed model. These include accuracy (ACC) [39], balanced accuracy (BA) [10], AUC, F-score, Log loss [6], and Brier score [9]. The evaluation metrics are calculated using true positive (TP), true negative (TN), false positive (FP), and false negative (FN), which originate from the confusion matrix (Table 2).
Confusion matrix
Confusion matrix
ACC is a widely used evaluation metric that indicates the proportion of correctly classified samples to all samples. ACC can be calculated using Equation (1).
BA is commonly used to evaluate the performance of imbalanced datasets. The formula for BA is given in Equation (2), where the true positive rate (TPR) and true negative rate (TNR) can be calculated using Equation (3) and Equation (4), respectively.
AUC is also a popular evaluation metric; it is defined as the area under the ROC curve. The higher the AUC, the better the performance of the classifier model.
F-score is a harmonic average value that combines precision and recall. The precision and recall are defined in Equation (5) and Equation (6), respectively. The F-score is defined in Equation (7).
Log loss quantifies the accuracy of the classifiers; a smaller Log loss means a higher accuracy of the classifier. The definition of the Log loss is given in Equation (8), where y i represents the true class of the sample and p i represents the probability that the model predicts a sample as positive.
Brier score is a score function that measures the accuracy of probabilistic predictions. This can be considered a loss function. Similar to Log loss, a lower Brier score indicates better classification performance.
The original data were divided into two parts: 20% of the original data were used as testing data and 80% were used as training data. The training data were further divided into training and validation sets, with proportions of 80% and 20%. In data preprocessing stage, the methods including standardization and normalization were imported from the Python module “sklearn.” In the extended model-based synthetic sampling method, the classifiers (i.e., XGBoost, LightGBM, and GBDT) used to classify the synthetic samples were imported from the Python module “xgboost,” “lightgbm,” and “sklearn,” respectively. The regression model was imported from the Python module “sklearn.” The final IR, i.e., r, was set to one. In the bagging-based feature transformation method, the default XGBoost, attribute bagging algorithm, and chi-square statistic were adopted, and the final feature dimension was set to 200. In the DES-based two-layer ensemble method, seven base classifiers (i.e., XGBoost, GBDT, Adaboost, RF, LR, SVM, and LightGBM) were selected to generate the base classifier pool. The XGBoost was imported from Python module “xgboost”, the GBDT, Adaboost, RF, LR, and SVM were imported from the Python module “sklearn”, the LightGBM was imported from Python module “lightgbm.” The DES methods were imported from the Python module “deslib.” All classifier parameters were set as default for a fair comparison.
Experimental analysis
In this study, four datasets and six evaluation metrics were used to evaluate the performance of the proposed multi-stage ensemble model. To ensure experimental robustness, each model was trained and tested ten times on each evaluation metric and the average value was reported as the result. All experiments were run on a computer with a 2.6 GHz Intel CORE i7 processor, 16GB of RAM, and Microsoft Windows 10 operating system using Python 3.7. The experimental analysis is presented in the following subsections.
Baseline performance
In this study, to evaluate the performance of the proposed model, seven base classifiers (i.e., XGBoost, GBDT, Adaboost, RF, LR, SVM, and LightGBM) were adopted as the baselines for comparison. The performance of the baselines on four real-world credit datasets (i.e., Australian, German, Japanese, and Hmeq) is presented in Table 3.
Baseline results
Baseline results
To demonstrate the performance of the extended model-based synthetic sampling method, the performance of the base classifiers with the extended model-based synthetic sampling method is presented in Table 4. Results superior to the baseline results in Table 3 are highlighted in bold. This indicates that the majority of the evaluation metrics improved after the extended model-based synthetic sampling method is integrated with the base classifiers, which confirmed that the extended model-based synthetic sampling can increase the adaptability of the base classifiers to imbalanced datasets and improve the classification ability.
Performance evaluation of extended model-based synthetic sampling method
Performance evaluation of extended model-based synthetic sampling method
Most evaluation metrics indicate that most of the base classifiers achieve the better performance after the extended model-based synthetic sampling method is applied. To show the improvement more clearly, the proportion of superior results with bold font to all results is calculated. It shows that 64.9% of the evaluation metrics are improved, demonstrating that the extended model-based synthetic sampling method can deal with the imbalanced data effectively.
To demonstrate the performance of the bagging-based feature transformation method, the performance of the base classifiers with both the extended model-based synthetic sampling method and bagging-based feature transformation method is presented in Table 5. Results superior to the evaluation results of the base classifiers with only the extended model-based sampling method in Table 4 are highlighted in bold. This indicates that the majority of the evaluation metrics improved after the bagging-based feature transformation method is integrated with the base classifiers, which confirmed that the bagging-based feature transformation method can explore inherent correlations between features and improve the classification ability of base classifiers.
Performance evaluation of bagging-based feature transformation method
Performance evaluation of bagging-based feature transformation method
The comparison of evaluation metrics between Table 4 and Table 5 indicates that most of the base classifiers achieve the better performance after the bagging-based feature transformation method is applied. To show the improvement more clearly, the proportion of superior results with bold font to all results is calculated. It shows that 85.1% of the evaluation metrics are improved, demonstrating that the bagging-based feature transformation method can deal with the imbalanced data effectively.
In the DES-based two-layer ensemble method, seven base classifiers are selected conditionally to generate the DES classifiers. LR and SVM are both linear classification models, XGBoost, GBDT, RF, and LightGBM are tree-based classification models, and AdaBoost is the aggregation of tree classification models. To select competent base classifiers, they are tested on the validation set. If their AUC is higher than the average AUC of all base classifiers, they are considered as the competent classifiers for the corresponding dataset. As indicated in Table 6, for example, XGBoost, AdaBoost, and LR are selected as competent base classifiers for the Australian dataset.
Selected competent base classifiers for each dataset
Selected competent base classifiers for each dataset
Then, the DES classifiers generated through the permutation and combination of selected competent base classifiers are filtered according to the correlations among them. Figure 5 displays the correlation among 15 DES classifiers in the form of a heatmap for different datasets, where the darker color indicates a higher correlation. The DES classifiers with higher correlations are subsequently removed to increase the diversity of the DES classifiers.

Heatmap of DES classifiers correlations on different datasets.
To verify the superiority of the proposed ensemble model, the final ensemble results are compared with the best-performing base classifiers with both the extended model-based synthetic sampling method and bagging-based feature transformation method. Referring to Table 7, the proposed ensemble model is denoted as Eclf. The best-performing base classifier is marked as Clf. For each dataset, the ensemble results superior to the evaluation results of best-performing base classifiers are highlighted in bold. This indicates that the majority of the evaluation metrics (i.e., 58.3%) of the proposed ensemble model are superior to those of the best-performing base classifiers, which confirmed that the DES-based two-layer ensemble method can select and compose the competent base classifiers dynamically to achieve superior classification performance.
Performance evaluation of DES-based two-layer ensemble method
The performance of the proposed ensemble model is compared with those of other recent benchmark ensemble models proposed by He et al. [23], Zhang et al. [46], García et al. [21], Xiao et al. [44], and Liu and Hsieh [31]. In Table 8, the evaluation results of the proposed ensemble model superior to those of the other benchmark models are highlighted in bold. Certain evaluation metrics are not adopted in the benchmark works; hence, the corresponding indicators are marked as “/” in this table. The table indicates that the proposed ensemble model demonstrated the best performance for the majority of the evaluation metrics. To compare the performance more clearly, Fig. 6 (a)-(d) shows the comparison of AUC between the proposed model and benchmark models on Australia, German, Japanese and Hmeq datasets respectively.

Comparison of AUC results among different models on four datasets.
Performance comparisons between the proposed model and benchmark models
Besides the above benchmark ensemble models, it is worth noting that the proposed model performs worse than another ensemble model proposed by Zhang et al. [47] in dealing with the noise-filled small credit datasets though they are not directly comparable. Differing from the proposed model and other benchmark ensemble models that handle the outliers (noises) by removing them, Zhang et al. [47] handle the outliers by boosting them, which tends to make the model to be biased to the outlier samples [17], produce overfitting to the outliers and add the computational complexity. Though Zhang et al. [47] perform better on the noise-filled small credit datasets, it is not taken for granted that they can perform well in the large-scaled credit datasets with few noises. Though the proposed model is not directly comparable with Zhang et al. [47] because they handle the outliers with different noise processing methods, the future work can look into the exploration of large-scaled credit datasets with many or few noises by integrating the two different ideas for handling outliers.
Ensemble models have attracted great attention in various application fields including credit scoring. This study proposes a novel multi-stage ensemble model based on synthetic sampling and feature transformation through the extended model-based synthetic sampling method, a new bagging-based feature transformation method and a DES-based two-layer ensemble method. The proposed ensemble model is verified on four standardized credit datasets using six evaluation metrics. The experimental results demonstrate the superior performance of the proposed ensemble model over other benchmark models.
However, there are limitations in this study. First, the influence of sensitive features on the classification fairness of the model is not considered. Additionally, in the feature transformation stage, only one feature transformation method is used, which can be extended by integrating several heterogeneous feature transformation methods to improve the robustness of the model. Finally, the performance of the proposed model has room for improvement by integrating different outliers handling methods. These studies will be conducted in the future.
Footnotes
Acknowledgments
The work has been supported by National Natural Science Foundation of China (No. 51875503, No. 51975512), Zhejiang Natural Science Foundation of China (No. LZ20E050001), Zhejiang Key R & D Project of China (No.2021C03153), and the Humanities and Social Sciences Research Project of the Education Ministry of China (No.20YJC870003).
