A novel multi-stage ensemble model for credit scoring based on synthetic sampling and feature transformation

Abstract

Credit scoring has become increasingly important for financial institutions. With the advancement of artificial intelligence, machine learning methods, especially ensemble learning methods, have become increasingly popular for credit scoring. However, the problems of imbalanced data distribution and underutilized feature information have not been well addressed sufficiently. To make the credit scoring model more adaptable to imbalanced datasets, the original model-based synthetic sampling method is extended herein to balance the datasets by generating appropriate minority samples to alleviate class overlap. To enable the credit scoring model to extract inherent correlations from features, a new bagging-based feature transformation method is proposed, which transforms features using a tree-based algorithm and selects features using the chi-square statistic. Furthermore, a two-layer ensemble method that combines the advantages of dynamic ensemble selection and stacking is proposed to improve the classification performance of the proposed multi-stage ensemble model. Finally, four standardized datasets are used to evaluate the performance of the proposed ensemble model using six evaluation metrics. The experimental results confirm that the proposed ensemble model is effective in improving classification performance and is superior to other benchmark models.

Keywords

Ensemble learning credit scoring synthetic sampling feature transformation

1 Introduction

With the development of the economy and society, assessing applicants’ credit accurately and efficiently to decrease the proportion of bad loans is becoming increasingly important for financial institutions. Credit scoring is a typical binary classification problem that predicts whether an applicant is trustworthy using a variety of information. Therefore, an effective credit scoring model that can increase the income of financial institutions and reduce unnecessary losses is required.

The financial classification field has been widely explored in previous works. Crook et al. [14] studied the recent development of credit risk assessment using simple individual models. With the continuous advancement of artificial intelligence in the last decade, ensemble learning models that integrate the advantages of different individual models have been proposed and have gradually become a popular research topic. However, the classification performance of ensemble learning methods can be influenced by several factors, in particular, the distribution of the datasets, the selection of features, and the selection of classifiers.

In general, datasets are imbalanced, which significantly influences the performance of classification models. For example, in the datasets owned by financial institutions, the number of bad credit customers is usually much smaller than the number of good credit customers. However, the loss from bad credit customers is far greater than the gain from good credit customers. If a dataset consists of 99% positive samples and 1% negative samples, it is clear that the classification results produced by the machine learning algorithms will be biased to positive samples and neglect the negative samples. Therefore, a balanced dataset is necessary for financial institutions to train an effective classification model. Researchers have proposed imposing constraints to ensure the fairness of classification model [2]; however, a more popular and simpler approach is to use balanced datasets through sampling to reduce unfairness [34]. Moreover, for classifiers trained using imbalanced datasets, the class overlap between samples is one of the main factors influencing classification performance. The overlap between samples from different classes can cause classification hardness and that between samples from the same class can result in classifier overfitting [38].

The feature processing of the dataset also has an important role in improving the performance of classification models. Researchers have proposed selecting useful features conditionally for classifier training [37]. For example, Ji et al. [26] proposed a Min-Max ensemble feature selection method to select features efficiently for large-scaled, high-dimension and imbalanced problems. Although these feature selection methods can exploit useful information efficiently, inherent feature correlations are neglected.

The composition of individual classifiers in ensemble models has been explored in previous studies; however, the majority of them only combine some base classifiers statically, including gradient boosting tree (GBDT) [20], random forest (RF) [8], extreme gradient boosting (XGBoost) [13], light gradient boosting machine (LightGBM) [27], support vector machine (SVM) [12], and logistic regression (LR) [22]. Recently, the dynamic ensemble selection (DES) method [28], which is an extension of the traditional dynamic classifier selection method [25], demonstrates its superior performance in the field of machine learning. It has been proven that an ensemble of classifiers is more robust than a single classifier through dynamic selection.

Therefore, the motivation of this study is to propose an effective ensemble model for credit scoring by balancing the dataset and transforming the features. The highlights of this study are mainly listed as follows:

A novel multi-stage ensemble model based on synthetic sampling method and feature transformation method is proposed to achieve superior classification performance and computational efficiency.

The model-based synthetic sampling method [31] is extended to alleviate class overlap by selecting minority samples away from the decision boundary and make the model adaptable to imbalanced datasets.

A new bagging-based feature transformation method combined with bootstrap aggregation and feature transformation is proposed to transform existing features into new features that contain more inherent correlations.

A DES-based two-layer ensemble method, which adopts the DES method in the first layer and the stacking method in the second layer, is proposed to compose the base classifiers into an effective ensemble model.

To verify the performance of the proposed model, four standardized datasets are used, and the experimental results are evaluated using six evaluation metrics.

The remainder of this paper is organized as follows. Section 2 presents related work on the imbalanced learning approach, feature processing approaches, and ensemble learning approaches. Section 3 describes the details of the proposed multi-stage ensemble model. Section 4 provides a description of the datasets, evaluation metrics, and parameter settings. Section 5 presents an analysis of the experimental results. In Section 6, the conclusions and future work are presented.

2 Related work

This section discusses previous research on credit scoring models and elaborates on three aspects: imbalanced learning approaches, feature processing approaches, and ensemble learning approaches.

2.1 Imbalanced learning approaches

The problem of imbalanced learning is difficult to address because datasets have class overlap, and the sample size of one class can be considerably larger than that of other classes. For example, the sample size of “no default” in credit scoring datasets is typically significantly larger than that of “default”. The most common method to address the problem of imbalanced datasets is data sampling, which includes under-sampling methods and over-sampling methods.

Under-sampling methods mainly rebalance the datasets by discarding selected majority samples [29]. The simplest under-sampling method is the random under-sampling method, which randomly selects samples from the majority samples to generate a balanced dataset. However, this method may discard informative samples and hence has been extended by researchers. For example, Sun et al. [40] proposed a cluster-based under-sampling method that converts an imbalanced dataset into multiple balanced subsets and then builds a number of classifiers on these multiple subsets. Maglietta et al. [33] proposed a parallel selective sampling method that selects the majority samples based on the distance between samples. In our previous work, He et al. [23] proposed a supervised under-sampling approach called the extended balance cascade approach, which constructs adjustable datasets adaptively based on the data imbalance rate. However, under-sampling methods can cause information loss, which has not been well addressed.

Over-sampling methods mainly rebalance the datasets by adding new minority samples. The simplest over-sampling method is the random over-sampling method that replicates the original minority samples. However, this method can result in overfitting and hence has been extended by researchers. For example, Babu and Ananthanarayanan [4] proposed an enhanced minority over-sampling technique to adjust the distribution of dataset by over-sampling the nearest neighbors of minority samples. Liu and Hsieh [31] proposed a model-based synthetic sampling method that trains regression models to predict the features of new samples. However, the existing over-sampling methods add the generated samples to the dataset for training the classification model directly, without considering the noise of the generated samples and the class overlap.

To overcome the drawbacks of under-sampling and over-sampling methods, the original model-based synthetic sampling method is extended in this study by generating diverse minority samples away from the decision boundary and addressing the noise of the generated samples to alleviate the class overlap.

2.2 Feature processing approaches

Feature processing is defined as the processing of feature subsets based on certain criteria [32], which mainly includes feature selection and feature transformation methods.

Feature selection is the process to find an appropriate subset of features to describe the target object from the original feature sets [41] and has been widely used in machine learning. For example, Bryll et al. [11] presented an attribute bagging method that could improve the performance and robustness of classifier ensembles by randomly selecting subsets of features. Das et al. [16] proposed a bi-objective genetic algorithm-based feature selection method that selects informative features from a dataset using boundary region analysis and multivariate mutual information as the objective functions. However, feature selection methods tend to neglect the inherent correlations among features, resulting in information loss.

Feature transformation methods transform features in a specific manner such that certain information hidden in the features can be extracted. For example, Abedin et al. [1] found that logarithmic and square-root feature transformations could improve the performance of tax default prediction. Parisi and Ravichandran [35] proposed a two-stage genetic algorithm-based feature transformation method, which improved the classification performance and computational efficiency by applying dimension reduction and augmenting the relevant features. In our previous work, Zhang et al. [46] proposed an enhanced multi-population niche genetic algorithm to select competent features and classifiers by enhancing the selection, crossover, and mutation steps, and adding niche and migration steps in the traditional genetic algorithm. Zhang et al. [47] proposed a GBDT-based feature transformation method combined with one-hot encoding and dimension reduction; the experimental results proved the outstanding performance of the above methods. However, the existing feature transformation methods can result in dimension explosion and increase the time cost.

Based on the superior performance of the attribute bagging method and tree-based feature transformation method, a new bagging-based feature transformation method is proposed in this study that combines the advantages of the attribute bagging and feature transformation methods to reduce the time cost and extract the inherent correlations of features.

2.3 Ensemble learning approaches

Ensemble learning approaches aim to combine the learning results of multiple base classifiers to improve the performance and generalization of the classification model. Traditional ensemble learning approaches include bootstrap aggregation (Bagging) [7], boosting [19], and stacking [42]. Li and Chen [30] compared the three ensemble approaches mentioned above, demonstrating that ensemble learning yields higher performance than individual learners. In our previous work, Yang et al. [45] proposed a multi-stage ensemble model that combined different classifier selection and composition methods to achieve higher classification accuracy.

With the continuous progress and deepening of research, researchers are no longer satisfied with the performance of static ensemble learning. A more flexible ensemble method called the DES method has been proposed to integrate the base classifiers dynamically. It can automatically integrate the most competent base classifiers by estimating the real-time classification performance of each classifier from a classifier pool [15]. Zyblewski et al. [48] proposed a framework to integrate data preprocessing and DES methods for imbalanced data stream classification, demonstrating that the DES methods combined with data preprocessing outperform other ensemble methods.

Considering the advantages of the DES and stacking methods, a DES-based two-layer ensemble method that adopts the DES method in the first layer and stacking method in the second layer is proposed herein to further improve the performance of the final ensemble model.

3 Model

The multi-stage ensemble model proposed in this study aims to predict imbalanced credit datasets. The original credit dataset is divided into training and testing data; the training data are further divided into training and validation sets. The framework of the proposed model is presented in Fig. 1. The model is divided into three stages. In Stage 1, the original model-based synthetic sampling method is extended to generate synthetic samples that are used to balance the training set. In Stage 2, to extract the inherent correlations among the features, a new bagging-based feature transformation method is proposed to transform the features and reduce the feature dimensions. In Stage 3, a DES-based two-layer ensemble method is proposed that integrates the advantages of the DES and stacking methods. The three stages are described in detail in the following subsections.

Fig. 1

Framework of proposed multi-stage ensemble model.

3.1 Extended model-based synthetic sampling method

The original model-based synthetic sampling method adopts regression models to generate synthetic samples as new minority samples to balance the dataset. The generated synthetic samples need retain the characteristics of the original dataset [31]. However, all the generated synthetic samples are added to the training set without considering the noise of the synthetic samples, making it difficult to alleviate the class overlap effectively. Therefore, the original model-based synthetic sampling method is extended herein, by selecting appropriately generated samples and removing the noise of the synthetic samples. The selected appropriate synthetic samples form a normal synthetic sample subset and are added to the training set to alleviate class overlap.

The extended model-based synthetic sampling method balances the dataset through the following processes (Fig. 2). First, the training set is divided into majority and minority samples. Then, a random sampling method with replacement is performed on the minority samples to generate temporary samples. For example, F1, F2, ... , Fn represent the features of the samples. The first feature F1 of each temporary sample is assigned a value that is sampled with replacement from the F1 values of the minority samples. The second feature F2 of each temporary sample is assigned a value that is sampled with replacement from the F2 values of the minority samples. Then, the same process is applied to assign values to the other features of each temporary sample.

Fig. 2

Schematic diagram of extended model-based synthetic sampling method.

Subsequently, the regression models are trained to generate synthetic samples from temporary samples as new minority samples. For example, Regression model 1 is trained using the original training set, with the first feature F1 regarded as the label and the remaining features (i.e., F2–Fn) regarded as the input feature vectors. Similarly, Regression model 2 is trained using the original training set, with the second feature F2 regarded as the label and the remaining features (i.e., F1, F3–Fn) as the input feature vectors. The same scheme is applied to train the remaining Regression models 3–n. Then, the temporary samples and regression models are employed to generate the synthetic samples. For example, the F1 value of each synthetic sample is obtained from the regression result of Regression model 1, which uses the remaining features of the corresponding temporary sample except F1 (i.e., F2–Fn) as the input feature vector. The F2 value of each synthetic sample is obtained from the regression result of Regression model 2, which uses the remaining features of the corresponding temporary sample except F2 (i.e., F1, F3–Fn) as the input feature vector. The same process is followed to obtain the values of the remaining features (i.e., F3–Fn) of each synthetic sample.

The extended model-based synthetic sampling method generates the synthetic samples that are similar as the minority samples. Therefore, once the synthetic samples are generated, three widely recognized base classifiers (i.e., XGBoost, LightGBM, and GBDT) are selected and trained using the original training set, and then integrated to predict the labels of the synthetic samples. Different from the base classifiers used to obtain the final classification results in Stage 3, the trained classifiers enhanced by the proposed extended model-based synthetic sampling method can select the competent minority samples from all generated synthetic samples. Synthetic samples classified as minorities are regarded as normal samples, whereas synthetic samples classified as majorities are regarded as noise samples. The noise samples are further removed from the synthetic samples; the normal synthetic samples are aggregated as a normal synthetic sample subset and subsequently added to the training set. If the current imbalance ratio (IR) of the training set is larger than the IR setting, i.e., r, the normal synthetic sample subsets are generated again until a balanced training set is obtained.

3.2 Bagging-based feature transformation method

To extract the inherent correlations of the features and prevent dimension explosion, a new bagging-based feature transformation method is proposed in this study; it combines the advantages of the attribute bagging method and feature transformation method.

As depicted in Fig. 3, the attribute bagging method is first adopted to select approximately half of the features of the balanced training set, through random feature bootstrap sampling, resulting in multiple feature subsets (i.e. Feature subset 1, Feature subset 2, etc.). Based on He et al. [24], a tree algorithm (e.g., XGBoost) is used to transform the feature subsets and generate new features automatically such that the inherent correlations among the features can be extracted. The XGBoost algorithms with different parameters (i.e., XGBoost 1, XGBoost 2, etc.) are trained using the corresponding feature subsets. Each sample with selected features is regarded as the root node of the XGBoost algorithm for tree growth. The leaf nodes of the tree contain the transformed feature information of the feature subsets and are regarded as new features.

Fig. 3

Schematic diagram of bagging-based feature transformation method.

To prevent dimension explosion, the chi-square statistic [36] is employed to reduce the feature dimensions through feature selection, resulting in the selected feature subsets. Multiple selected feature subsets are then aggregated. Considering that the dimension of aggregated features may remain large, the chi-square statistic is adopted again to further reduce the final feature dimensions, resulting in the feature transformed training set used for the next stage.

3.3 DES-based two-layer ensemble method

To combine the respective advantages of different individual classifiers, an effective ensemble method is required. Therefore, a DES-based two-layer ensemble method that combines both the DES and stacking methods is proposed to integrate the base classifiers trained using the feature transformed training sets.

DES method assumes that different base classifiers have the best performance in local region of the sample points. When test samples are given, the DES method can automatically select the best-performed base classifier to classify test samples according to the distribution of test samples in sample points [15]. In this study, the classifier selected through the DES method is denoted the DES classifier.

Stacking is also a powerful ensemble method that combines the classification results of several homogeneous or heterogeneous classifiers to achieve superior classification performance. It trains a meta-classifier using the predictive results of individual classifiers as input features. LR has been proven to be a superior choice as a meta-classifier to improve stacking performance [43]. Therefore, LR is used as the meta-classifier in this study.

As shown in Fig. 4, n base classifiers (i.e. Clf 1, Clf 2, etc.) from the base classifier pool are trained using a feature transformed training set and further compared based on the area under the receiver operating characteristic (ROC) curve (AUC) [18]. The m base classifiers (i.e. Sclf 1, Sclf 2, etc.) with higher AUC values on the validation set are selected as the competent classifiers. In the first layer of the DES operation, k DES classifiers (i.e. DESclf 1, DESclf 2, etc.) are generated through the permutation and combination of m selected competent base classifiers. To improve the diversity of the DES classifiers, the DES classifiers that produce correlations higher than the average are filtered, and the remaining classifiers are used for the next layer. In the second layer for stacking integration, the predictive results of the selected DES classifiers are further integrated into the final result by the meta-classifier. Finally, the ensemble model with the best predictive accuracy is selected as the proposed ensemble model.

Fig. 4

Schematic diagram of DES-based two-layer ensemble method.

4 Experimental settings

In this section, the experimental settings including datasets, data processing, evaluation metrics, and parameter settings, used to evaluate the performance of the proposed multi-stage ensemble model are introduced.

4.1 Datasets and data processing

In this experiment, three datasets (i.e., Australian, German, and Japanese) from the UCI machine learning library [3] and one dataset (i.e., Hmeq) from Kaggle [5] are used to verify the performance of the proposed model. The details of the datasets are presented in Table 1.

Table 1
Detail of datasets

Datasets Sample size Positive samples Negative samples Imbalance ratio Total features

Australian 690 307 383 1.25 15

German 1000 700 300 2.33 21

Japanese 690 383 307 1.25 16

Hmeq 5960 1189 4771 4.01 13

Datasets	Sample size	Positive samples	Negative samples	Imbalance ratio	Total features
Australian	690	307	383	1.25	15
German	1000	700	300	2.33	21
Japanese	690	383	307	1.25	16
Hmeq	5960	1189	4771	4.01	13

The Australian dataset contains 690 samples, comprising 307 positive samples and 383 negative samples. Its imbalance ratio is 1.25, and its total feature dimension is 15. The German dataset contains 1000 samples, comprising 700 positive samples and 300 negative samples. Its imbalance ratio is 2.33, and its total feature dimension is 21. The Japanese dataset has the same sample size and imbalance ratio as the Australian dataset, and its feature dimension is 16. The Hmeq dataset contains 5960 samples, comprising 1189 positive samples and 383 negative samples. Its imbalance ratio is 4.01, and its total feature dimension is 13.

4.2 Evaluation metrics

Six evaluation metrics are used in this study to evaluate the performance of the proposed model. These include accuracy (ACC) [39], balanced accuracy (BA) [10], AUC, F-score, Log loss [6], and Brier score [9]. The evaluation metrics are calculated using true positive (TP), true negative (TN), false positive (FP), and false negative (FN), which originate from the confusion matrix (Table 2).

Table 2
Confusion matrix

Predict

Positive Negative

Real Positive True positive False positive

Negative False negative True negative

		Predict
Real	Positive	True positive	False positive
	Negative	False negative	True negative

ACC is a widely used evaluation metric that indicates the proportion of correctly classified samples to all samples. ACC can be calculated using Equation (1). $Accuracy = \frac{TP + TN}{TP + TN + FP + FN}$ (1)

BA is commonly used to evaluate the performance of imbalanced datasets. The formula for BA is given in Equation (2), where the true positive rate (TPR) and true negative rate (TNR) can be calculated using Equation (3) and Equation (4), respectively. $Balanced Accuracy = \frac{TPR + TNR}{2}$ (2) $TPR = \frac{TP}{TP + FN}$ (3) $TNR = \frac{TN}{FP + TN}$ (4)

AUC is also a popular evaluation metric; it is defined as the area under the ROC curve. The higher the AUC, the better the performance of the classifier model.

F-score is a harmonic average value that combines precision and recall. The precision and recall are defined in Equation (5) and Equation (6), respectively. The F-score is defined in Equation (7). $Precision = \frac{TP}{TP + FP}$ (5) $Recall = \frac{TP}{TP + FN}$ (6) $F - score = \frac{2 \times Precision \times Recall}{Precision + Recall}$ (7)

Log loss quantifies the accuracy of the classifiers; a smaller Log loss means a higher accuracy of the classifier. The definition of the Log loss is given in Equation (8), where y_i represents the true class of the sample and p_i represents the probability that the model predicts a sample as positive.

$\begin{matrix} Logloss = - \frac{1}{n} \\ \sum_{i = 1}^{n} (y_{i} \times log (p_{i}) + (1 - y_{i}) \times log (1 - p_{i})) \end{matrix}$ (8)

Brier score is a score function that measures the accuracy of probabilistic predictions. This can be considered a loss function. Similar to Log loss, a lower Brier score indicates better classification performance.

4.3 Parameter settings

The original data were divided into two parts: 20% of the original data were used as testing data and 80% were used as training data. The training data were further divided into training and validation sets, with proportions of 80% and 20%. In data preprocessing stage, the methods including standardization and normalization were imported from the Python module “sklearn.” In the extended model-based synthetic sampling method, the classifiers (i.e., XGBoost, LightGBM, and GBDT) used to classify the synthetic samples were imported from the Python module “xgboost,” “lightgbm,” and “sklearn,” respectively. The regression model was imported from the Python module “sklearn.” The final IR, i.e., r, was set to one. In the bagging-based feature transformation method, the default XGBoost, attribute bagging algorithm, and chi-square statistic were adopted, and the final feature dimension was set to 200. In the DES-based two-layer ensemble method, seven base classifiers (i.e., XGBoost, GBDT, Adaboost, RF, LR, SVM, and LightGBM) were selected to generate the base classifier pool. The XGBoost was imported from Python module “xgboost”, the GBDT, Adaboost, RF, LR, and SVM were imported from the Python module “sklearn”, the LightGBM was imported from Python module “lightgbm.” The DES methods were imported from the Python module “deslib.” All classifier parameters were set as default for a fair comparison.

5 Experimental analysis

In this study, four datasets and six evaluation metrics were used to evaluate the performance of the proposed multi-stage ensemble model. To ensure experimental robustness, each model was trained and tested ten times on each evaluation metric and the average value was reported as the result. All experiments were run on a computer with a 2.6 GHz Intel CORE i7 processor, 16GB of RAM, and Microsoft Windows 10 operating system using Python 3.7. The experimental analysis is presented in the following subsections.

5.1 Baseline performance

In this study, to evaluate the performance of the proposed model, seven base classifiers (i.e., XGBoost, GBDT, Adaboost, RF, LR, SVM, and LightGBM) were adopted as the baselines for comparison. The performance of the baselines on four real-world credit datasets (i.e., Australian, German, Japanese, and Hmeq) is presented in Table 3.

Table 3
Baseline results

Datasets Classifiers ACC BA AUC F-score Log loss Brier score

Australian XGBoost 0.84638 0.84803 0.92853 0.83226 0.34813 0.10424

GBDT 0.84493 0.84623 0.92751 0.83095 0.35225 0.10593

AdaBoost 0.86304 0.86414 0.93386 0.85054 0.53017 0.17091

RF 0.85870 0.85895 0.92978 0.84479 0.38880 0.10348

LR 0.85507 0.85746 0.92212 0.84456 0.35905 0.10730

SVM 0.84710 0.85206 0.90854 0.83924 0.38321 0.11743

LightGBM 0.85217 0.85209 0.92424 0.83518 0.42236 0.11305

German XGBoost 0.74700 0.66597 0.77045 0.82737 0.52505 0.17349

GBDT 0.74150 0.66260 0.76687 0.82284 0.53212 0.17602

AdaBoost 0.73850 0.63211 0.76206 0.82800 0.59920 0.20423

RF 0.74800 0.65630 0.77321 0.83060 0.51322 0.17108

LR 0.74800 0.66819 0.77048 0.82780 0.52398 0.17271

SVM 0.74800 0.65507 0.77313 0.83099 0.51400 0.17077

LightGBM 0.74100 0.66945 0.75870 0.82039 0.63334 0.19043

Japanese XGBoost 0.85507 0.85494 0.92644 0.86672 0.35575 0.10488

GBDT 0.85000 0.85018 0.92178 0.86122 0.36479 0.10619

AdaBoost 0.84928 0.85060 0.93100 0.86109 0.53418 0.17279

RF 0.86522 0.86390 0.92603 0.87808 0.38254 0.10785

LR 0.83986 0.83969 0.91091 0.85290 0.40173 0.11645

SVM 0.84275 0.84423 0.90947 0.85465 0.38130 0.11724

LightGBM 0.86304 0.86239 0.92541 0.87491 0.42447 0.10898

Hmeq XGBoost 0.89782 0.80308 0.93284 0.72358 0.25519 0.07519

GBDT 0.90201 0.81337 0.93500 0.73823 0.25064 0.07342

AdaBoost 0.88800 0.77808 0.91775 0.68730 0.49320 0.15333

RF 0.90831 0.83109 0.96031 0.76095 0.21852 0.06488

LR 0.86904 0.73576 0.85109 0.61801 0.35202 0.10080

SVM 0.89639 0.79371 0.92390 0.71351 0.26640 0.07676

LightGBM 0.91460 0.83950 0.95788 0.77660 0.20909 0.06263

Datasets	Classifiers	ACC	BA	AUC	F-score	Log loss	Brier score
Australian	XGBoost	0.84638	0.84803	0.92853	0.83226	0.34813	0.10424
	GBDT	0.84493	0.84623	0.92751	0.83095	0.35225	0.10593
	AdaBoost	0.86304	0.86414	0.93386	0.85054	0.53017	0.17091
	RF	0.85870	0.85895	0.92978	0.84479	0.38880	0.10348
	LR	0.85507	0.85746	0.92212	0.84456	0.35905	0.10730
	SVM	0.84710	0.85206	0.90854	0.83924	0.38321	0.11743
	LightGBM	0.85217	0.85209	0.92424	0.83518	0.42236	0.11305
German	XGBoost	0.74700	0.66597	0.77045	0.82737	0.52505	0.17349
	GBDT	0.74150	0.66260	0.76687	0.82284	0.53212	0.17602
	AdaBoost	0.73850	0.63211	0.76206	0.82800	0.59920	0.20423
	RF	0.74800	0.65630	0.77321	0.83060	0.51322	0.17108
	LR	0.74800	0.66819	0.77048	0.82780	0.52398	0.17271
	SVM	0.74800	0.65507	0.77313	0.83099	0.51400	0.17077
	LightGBM	0.74100	0.66945	0.75870	0.82039	0.63334	0.19043
Japanese	XGBoost	0.85507	0.85494	0.92644	0.86672	0.35575	0.10488
	GBDT	0.85000	0.85018	0.92178	0.86122	0.36479	0.10619
	AdaBoost	0.84928	0.85060	0.93100	0.86109	0.53418	0.17279
	RF	0.86522	0.86390	0.92603	0.87808	0.38254	0.10785
	LR	0.83986	0.83969	0.91091	0.85290	0.40173	0.11645
	SVM	0.84275	0.84423	0.90947	0.85465	0.38130	0.11724
	LightGBM	0.86304	0.86239	0.92541	0.87491	0.42447	0.10898
Hmeq	XGBoost	0.89782	0.80308	0.93284	0.72358	0.25519	0.07519
	GBDT	0.90201	0.81337	0.93500	0.73823	0.25064	0.07342
	AdaBoost	0.88800	0.77808	0.91775	0.68730	0.49320	0.15333
	RF	0.90831	0.83109	0.96031	0.76095	0.21852	0.06488
	LR	0.86904	0.73576	0.85109	0.61801	0.35202	0.10080
	SVM	0.89639	0.79371	0.92390	0.71351	0.26640	0.07676
	LightGBM	0.91460	0.83950	0.95788	0.77660	0.20909	0.06263

5.2 Performance of extended model-based synthetic sampling method

To demonstrate the performance of the extended model-based synthetic sampling method, the performance of the base classifiers with the extended model-based synthetic sampling method is presented in Table 4. Results superior to the baseline results in Table 3 are highlighted in bold. This indicates that the majority of the evaluation metrics improved after the extended model-based synthetic sampling method is integrated with the base classifiers, which confirmed that the extended model-based synthetic sampling can increase the adaptability of the base classifiers to imbalanced datasets and improve the classification ability.

Table 4
Performance evaluation of extended model-based synthetic sampling method

Datasets Classifiers ACC BA AUC F-score Log loss Brier score

Australian XGBoost 0.86377 0.86397 0.93034 0.85299 0.34742 0.10009

GBDT 0.86739 0.86775 0.93047 0.85760 0.34753 0.09962

AdaBoost 0.86087 0.86136 0.92974 0.85084 0.52605 0.16894

RF 0.86087 0.86126 0.92325 0.84865 0.44394 0.10590

LR 0.86377 0.86545 0.92401 0.85618 0.36420 0.10675

SVM 0.84420 0.84696 0.90913 0.83572 0.38615 0.11748

LightGBM 0.86087 0.86082 0.92608 0.84895 0.42911 0.10721

German XGBoost 0.75600 0.67456 0.77788 0.83454 0.50566 0.16674

GBDT 0.75350 0.68228 0.77156 0.83071 0.51685 0.16971

AdaBoost 0.75300 0.65435 0.76989 0.83603 0.61199 0.21017

RF 0.75950 0.65839 0.77715 0.84098 0.51639 0.16497

LR 0.73150 0.71257 0.75773 0.79933 0.59549 0.19480

SVM 0.73850 0.65803 0.75879 0.82155 0.54518 0.18013

LightGBM 0.75600 0.67247 0.76979 0.83542 0.60604 0.18141

Japanese XGBoost 0.85072 0.85069 0.93095 0.86555 0.33819 0.10207

GBDT 0.85362 0.85396 0.92584 0.86679 0.35201 0.10489

AdaBoost 0.85217 0.85233 0.93232 0.86697 0.53224 0.17182

RF 0.86884 0.86767 0.92988 0.88248 0.36748 0.10304

LR 0.84565 0.84908 0.91454 0.85741 0.40843 0.11787

SVM 0.83478 0.83841 0.90706 0.84815 0.39012 0.12100

LightGBM 0.86087 0.85930 0.92974 0.87632 0.40938 0.10874

Hmeq XGBoost 0.89883 0.80835 0.93000 0.72941 0.25918 0.07625

GBDT 0.90185 0.81912 0.93088 0.74219 0.25560 0.07479

AdaBoost 0.88884 0.78524 0.91198 0.69566 0.54071 0.17523

RF 0.90956 0.83329 0.96035 0.76419 0.22376 0.06510

LR 0.84513 0.78824 0.83320 0.65126 0.42248 0.12051

SVM 0.89656 0.80804 0.92180 0.72632 0.27942 0.07985

LightGBM 0.91477 0.84034 0.95738 0.77763 0.20870 0.06220

Datasets	Classifiers	ACC	BA	AUC	F-score	Log loss	Brier score
Australian	XGBoost	0.86377	0.86397	0.93034	0.85299	0.34742	0.10009
	GBDT	0.86739	0.86775	0.93047	0.85760	0.34753	0.09962
	AdaBoost	0.86087	0.86136	0.92974	0.85084	0.52605	0.16894
	RF	0.86087	0.86126	0.92325	0.84865	0.44394	0.10590
	LR	0.86377	0.86545	0.92401	0.85618	0.36420	0.10675
	SVM	0.84420	0.84696	0.90913	0.83572	0.38615	0.11748
	LightGBM	0.86087	0.86082	0.92608	0.84895	0.42911	0.10721
German	XGBoost	0.75600	0.67456	0.77788	0.83454	0.50566	0.16674
	GBDT	0.75350	0.68228	0.77156	0.83071	0.51685	0.16971
	AdaBoost	0.75300	0.65435	0.76989	0.83603	0.61199	0.21017
	RF	0.75950	0.65839	0.77715	0.84098	0.51639	0.16497
	LR	0.73150	0.71257	0.75773	0.79933	0.59549	0.19480
	SVM	0.73850	0.65803	0.75879	0.82155	0.54518	0.18013
	LightGBM	0.75600	0.67247	0.76979	0.83542	0.60604	0.18141
Japanese	XGBoost	0.85072	0.85069	0.93095	0.86555	0.33819	0.10207
	GBDT	0.85362	0.85396	0.92584	0.86679	0.35201	0.10489
	AdaBoost	0.85217	0.85233	0.93232	0.86697	0.53224	0.17182
	RF	0.86884	0.86767	0.92988	0.88248	0.36748	0.10304
	LR	0.84565	0.84908	0.91454	0.85741	0.40843	0.11787
	SVM	0.83478	0.83841	0.90706	0.84815	0.39012	0.12100
	LightGBM	0.86087	0.85930	0.92974	0.87632	0.40938	0.10874
Hmeq	XGBoost	0.89883	0.80835	0.93000	0.72941	0.25918	0.07625
	GBDT	0.90185	0.81912	0.93088	0.74219	0.25560	0.07479
	AdaBoost	0.88884	0.78524	0.91198	0.69566	0.54071	0.17523
	RF	0.90956	0.83329	0.96035	0.76419	0.22376	0.06510
	LR	0.84513	0.78824	0.83320	0.65126	0.42248	0.12051
	SVM	0.89656	0.80804	0.92180	0.72632	0.27942	0.07985
	LightGBM	0.91477	0.84034	0.95738	0.77763	0.20870	0.06220

Most evaluation metrics indicate that most of the base classifiers achieve the better performance after the extended model-based synthetic sampling method is applied. To show the improvement more clearly, the proportion of superior results with bold font to all results is calculated. It shows that 64.9% of the evaluation metrics are improved, demonstrating that the extended model-based synthetic sampling method can deal with the imbalanced data effectively.

5.3 Performance of bagging-based feature transformation method

To demonstrate the performance of the bagging-based feature transformation method, the performance of the base classifiers with both the extended model-based synthetic sampling method and bagging-based feature transformation method is presented in Table 5. Results superior to the evaluation results of the base classifiers with only the extended model-based sampling method in Table 4 are highlighted in bold. This indicates that the majority of the evaluation metrics improved after the bagging-based feature transformation method is integrated with the base classifiers, which confirmed that the bagging-based feature transformation method can explore inherent correlations between features and improve the classification ability of base classifiers.

Table 5
Performance evaluation of bagging-based feature transformation method

Datasets Classifiers ACC BA AUC F-score Log loss Brier score

Australian XGBoost 0.87319 0.87472 0.93884 0.85985 0.32369 0.09336

GBDT 0.86957 0.87108 0.93315 0.85580 0.34272 0.09678

AdaBoost 0.86812 0.86975 0.93772 0.85496 0.50917 0.16134

RF 0.86812 0.87057 0.93507 0.85578 0.58837 0.09668

LR 0.86812 0.87073 0.94062 0.85575 0.32990 0.09369

SVM 0.86594 0.87242 0.93261 0.85871 0.33120 0.09966

LightGBM 0.86884 0.86984 0.93553 0.85482 0.41189 0.10372

German XGBoost 0.75900 0.68459 0.79180 0.83409 0.50433 0.16587

GBDT 0.76500 0.70088 0.79073 0.83606 0.50984 0.16689

AdaBoost 0.75650 0.67066 0.78701 0.83508 0.61329 0.21072

RF 0.73900 0.61752 0.79047 0.83192 0.50987 0.16986

LR 0.75550 0.67817 0.79335 0.83245 0.49946 0.16413

SVM 0.75350 0.65291 0.79683 0.83674 0.50161 0.16470

LightGBM 0.75500 0.68051 0.78498 0.83125 0.59691 0.17843

Japanese XGBoost 0.87391 0.87454 0.93675 0.88305 0.32469 0.09410

GBDT 0.87536 0.87620 0.93548 0.88459 0.32806 0.09473

AdaBoost 0.86594 0.86652 0.93569 0.87549 0.51284 0.16297

RF 0.87899 0.88142 0.93681 0.88668 0.53778 0.09346

LR 0.86594 0.86731 0.93187 0.87479 0.34851 0.10205

SVM 0.85725 0.86418 0.92111 0.85978 0.34678 0.10495

LightGBM 0.87391 0.87421 0.93363 0.88381 0.41046 0.09920

Hmeq XGBoost 0.90570 0.81522 0.93053 0.73503 0.24498 0.07090

GBDT 0.90772 0.81989 0.93243 0.74187 0.24051 0.06927

AdaBoost 0.89891 0.79788 0.92024 0.70937 0.53626 0.17303

RF 0.90814 0.82476 0.96039 0.74468 0.21533 0.06485

LR 0.89195 0.79558 0.90101 0.69592 0.28440 0.08257

SVM 0.89983 0.80462 0.91036 0.71602 0.28432 0.07969

LightGBM 0.92089 0.84641 0.95837 0.78212 0.19789 0.05871

Datasets	Classifiers	ACC	BA	AUC	F-score	Log loss	Brier score
Australian	XGBoost	0.87319	0.87472	0.93884	0.85985	0.32369	0.09336
	GBDT	0.86957	0.87108	0.93315	0.85580	0.34272	0.09678
	AdaBoost	0.86812	0.86975	0.93772	0.85496	0.50917	0.16134
	RF	0.86812	0.87057	0.93507	0.85578	0.58837	0.09668
	LR	0.86812	0.87073	0.94062	0.85575	0.32990	0.09369
	SVM	0.86594	0.87242	0.93261	0.85871	0.33120	0.09966
	LightGBM	0.86884	0.86984	0.93553	0.85482	0.41189	0.10372
German	XGBoost	0.75900	0.68459	0.79180	0.83409	0.50433	0.16587
	GBDT	0.76500	0.70088	0.79073	0.83606	0.50984	0.16689
	AdaBoost	0.75650	0.67066	0.78701	0.83508	0.61329	0.21072
	RF	0.73900	0.61752	0.79047	0.83192	0.50987	0.16986
	LR	0.75550	0.67817	0.79335	0.83245	0.49946	0.16413
	SVM	0.75350	0.65291	0.79683	0.83674	0.50161	0.16470
	LightGBM	0.75500	0.68051	0.78498	0.83125	0.59691	0.17843
Japanese	XGBoost	0.87391	0.87454	0.93675	0.88305	0.32469	0.09410
	GBDT	0.87536	0.87620	0.93548	0.88459	0.32806	0.09473
	AdaBoost	0.86594	0.86652	0.93569	0.87549	0.51284	0.16297
	RF	0.87899	0.88142	0.93681	0.88668	0.53778	0.09346
	LR	0.86594	0.86731	0.93187	0.87479	0.34851	0.10205
	SVM	0.85725	0.86418	0.92111	0.85978	0.34678	0.10495
	LightGBM	0.87391	0.87421	0.93363	0.88381	0.41046	0.09920
Hmeq	XGBoost	0.90570	0.81522	0.93053	0.73503	0.24498	0.07090
	GBDT	0.90772	0.81989	0.93243	0.74187	0.24051	0.06927
	AdaBoost	0.89891	0.79788	0.92024	0.70937	0.53626	0.17303
	RF	0.90814	0.82476	0.96039	0.74468	0.21533	0.06485
	LR	0.89195	0.79558	0.90101	0.69592	0.28440	0.08257
	SVM	0.89983	0.80462	0.91036	0.71602	0.28432	0.07969
	LightGBM	0.92089	0.84641	0.95837	0.78212	0.19789	0.05871

The comparison of evaluation metrics between Table 4 and Table 5 indicates that most of the base classifiers achieve the better performance after the bagging-based feature transformation method is applied. To show the improvement more clearly, the proportion of superior results with bold font to all results is calculated. It shows that 85.1% of the evaluation metrics are improved, demonstrating that the bagging-based feature transformation method can deal with the imbalanced data effectively.

5.4 Performance of DES-based two-layer ensemble method

In the DES-based two-layer ensemble method, seven base classifiers are selected conditionally to generate the DES classifiers. LR and SVM are both linear classification models, XGBoost, GBDT, RF, and LightGBM are tree-based classification models, and AdaBoost is the aggregation of tree classification models. To select competent base classifiers, they are tested on the validation set. If their AUC is higher than the average AUC of all base classifiers, they are considered as the competent classifiers for the corresponding dataset. As indicated in Table 6, for example, XGBoost, AdaBoost, and LR are selected as competent base classifiers for the Australian dataset.

Table 6
Selected competent base classifiers for each dataset

XGBoost GBDT AdaBoost RF LR SVM LightGBM

Australian √ √ √

German √ √ √

Japanese √ √ √ √ √

Hmeq √ √ √ √

	XGBoost	GBDT	AdaBoost	RF	LR	SVM	LightGBM
Australian	√		√		√
German	√				√	√
Japanese	√	√	√	√			√
Hmeq	√	√		√			√

Then, the DES classifiers generated through the permutation and combination of selected competent base classifiers are filtered according to the correlations among them. Figure 5 displays the correlation among 15 DES classifiers in the form of a heatmap for different datasets, where the darker color indicates a higher correlation. The DES classifiers with higher correlations are subsequently removed to increase the diversity of the DES classifiers.

Fig. 5

Heatmap of DES classifiers correlations on different datasets.

To verify the superiority of the proposed ensemble model, the final ensemble results are compared with the best-performing base classifiers with both the extended model-based synthetic sampling method and bagging-based feature transformation method. Referring to Table 7, the proposed ensemble model is denoted as Eclf. The best-performing base classifier is marked as Clf. For each dataset, the ensemble results superior to the evaluation results of best-performing base classifiers are highlighted in bold. This indicates that the majority of the evaluation metrics (i.e., 58.3%) of the proposed ensemble model are superior to those of the best-performing base classifiers, which confirmed that the DES-based two-layer ensemble method can select and compose the competent base classifiers dynamically to achieve superior classification performance.

Table 7

Performance evaluation of DES-based two-layer ensemble method

Datasets	Classifiers	ACC	BA	AUC	F-score	Log loss	Brier score
Australian	Clf 1	0.87319	0.87472	0.93884	0.85985	0.32369	0.09336
	Eclf 1	0.87754	0.88018	0.93996	0.86628	0.43280	0.10330
German	Clf 2	0.76500	0.70088	0.79073	0.83606	0.50984	0.16689
	Eclf 2	0.77200	0.69907	0.81119	0.84432	0.66886	0.17990
Japanese	Clf 3	0.87899	0.88142	0.93681	0.88668	0.53778	0.09346
	Eclf 3	0.87971	0.87959	0.93714	0.89255	0.42469	0.10069
Hmeq	Clf 4	0.92089	0.84641	0.95837	0.78212	0.19789	0.05871
	Eclf 4	0.91745	0.84797	0.96359	0.78461	0.36140	0.07173

5.5 Performance comparisons between the proposed ensemble model and benchmark ensemble models

The performance of the proposed ensemble model is compared with those of other recent benchmark ensemble models proposed by He et al. [23], Zhang et al. [46], García et al. [21], Xiao et al. [44], and Liu and Hsieh [31]. In Table 8, the evaluation results of the proposed ensemble model superior to those of the other benchmark models are highlighted in bold. Certain evaluation metrics are not adopted in the benchmark works; hence, the corresponding indicators are marked as “/” in this table. The table indicates that the proposed ensemble model demonstrated the best performance for the majority of the evaluation metrics. To compare the performance more clearly, Fig. 6 (a)-(d) shows the comparison of AUC between the proposed model and benchmark models on Australia, German, Japanese and Hmeq datasets respectively.

Fig. 6

Comparison of AUC results among different models on four datasets.

Table 8

Performance comparisons between the proposed model and benchmark models

Datasets	Ensemble models	ACC	BA	AUC	F-score	Log loss	Brier score
Australian	He et al. [23]	/	/	0.93404	0.85020	0.33193	/
	Zhang et al. [46]	0.87540	/	0.93700	/	/	0.09380
	García et al. [21]	/	/	0.93600	/	/	/
	Xiao et al. [44]	0.86890	/	0.91280	/	/	/
	The proposed model	0.87754	0.88018	0.93996	0.86628	0.43280	0.10330
German	He et al. [23]	/	/	0.80021	0.84439	0.49369	/
	Zhang et al. [46]	0.76820	/	0.80290	/	/	0.16030
	García et al. [21]	/	/	0.79400	/	/	/
	Xiao et al. [44]	0.73760	/	0.75610	/	/	/
	The proposed model	0.77200	0.69907	0.81119	0.84432	0.66886	0.17990
Japanese	He et al. [23]	/	/	0.93058	0.87004	0.33957	/
	Zhang et al. [46]	0.87200	/	0.93870	/	/	0.09470
	García et al. [21]	/	/	0.93600	/	/	/
	The proposed model	0.87971	0.87959	0.93714	0.89255	0.42469	0.10069
Hmeq	Liu &Hsieh [31]	/	/	0.92330	/	/	/
	The proposed model	0.91745	0.84797	0.96359	0.78461	0.36140	0.07173

Besides the above benchmark ensemble models, it is worth noting that the proposed model performs worse than another ensemble model proposed by Zhang et al. [47] in dealing with the noise-filled small credit datasets though they are not directly comparable. Differing from the proposed model and other benchmark ensemble models that handle the outliers (noises) by removing them, Zhang et al. [47] handle the outliers by boosting them, which tends to make the model to be biased to the outlier samples [17], produce overfitting to the outliers and add the computational complexity. Though Zhang et al. [47] perform better on the noise-filled small credit datasets, it is not taken for granted that they can perform well in the large-scaled credit datasets with few noises. Though the proposed model is not directly comparable with Zhang et al. [47] because they handle the outliers with different noise processing methods, the future work can look into the exploration of large-scaled credit datasets with many or few noises by integrating the two different ideas for handling outliers.

6 Conclusion and future work

Ensemble models have attracted great attention in various application fields including credit scoring. This study proposes a novel multi-stage ensemble model based on synthetic sampling and feature transformation through the extended model-based synthetic sampling method, a new bagging-based feature transformation method and a DES-based two-layer ensemble method. The proposed ensemble model is verified on four standardized credit datasets using six evaluation metrics. The experimental results demonstrate the superior performance of the proposed ensemble model over other benchmark models.

However, there are limitations in this study. First, the influence of sensitive features on the classification fairness of the model is not considered. Additionally, in the feature transformation stage, only one feature transformation method is used, which can be extended by integrating several heterogeneous feature transformation methods to improve the robustness of the model. Finally, the performance of the proposed model has room for improvement by integrating different outliers handling methods. These studies will be conducted in the future.

Footnotes

Acknowledgments

The work has been supported by National Natural Science Foundation of China (No. 51875503, No. 51975512), Zhejiang Natural Science Foundation of China (No. LZ20E050001), Zhejiang Key R & D Project of China (No.2021C03153), and the Humanities and Social Sciences Research Project of the Education Ministry of China (No.20YJC870003).

References

Abedin

M.Z.

, Chi

G.T.

, Uddin

M.M.

, Satu

M.S.

, Khan

and Hajek

, Tax default prediction using feature transformation-based machine learning, IEEE Access 9 (2020), 19864–19881.

Agarwal

, Beygelzimer

, Dudík

, Langford

and Wallach

H.M.

, A reductions approach to fair classification. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden 80 (2018), pp. 60–69, July 10-15, 2018.

Asuncion

, Newman

, UCI Machine Learning Repository. Irvine, CA: School of Information and Computer Science, University of California. (2007). http://www.ics.uci.edu/~mlearn/MLRepository.html.

Babu

and Ananthanarayanan

N.R.

, EMOTE: Enhanced minority oversampling technique, Journal of Intelligent & Fuzzy Systems 33(1) (2017), 67–78.

Baesens

, Roesch

, Scheule

, Credit Risk Analytics: Measurement Techniques, Applications and Examples in SAS. John Wiley & Sons, Hoboken, New Jersey. (2016).

Bishop

C.M.

, Pattern Recognition and Machine Learning. New York: Springer. (2006).

Breiman

, Bagging predictors, Machine Learning 24(2) (1996), 123–140.

Breiman

, Random forests, Machine Learning 45(1) (2001), 5–32.

Brier

G.W.

, Verification of forecasts expressed in terms of probability, Monthly Weather Review 78(1) (1950), 1–3.

10.

Brodersen

K.H.

, Ong

C.S.

, Stephan

K.E.

and Buhmann

J.M.

, The balanced accuracy and its posterior distribution. In Proceedings of the 20th International Conference on Pattern Recognition, Istanbul, Turkey, (2010), pp. 3121–3124 August 23-26, 2010.

11.

Bryll

, Gutierrez-Osuna

and Quek

, Attribute bagging: Improving accuracy of classifier ensembles by using random feature subsets, Pattern Recognition 36(6) (2003), 1291–1302.

12.

Chen

F.L.

and Li

F.C.

, Combination of feature selection approaches with SVM in credit scoring, Expert Systems with Applications 37(7) (2010), 4902–4909.

13.

Chen

T.Q.

and Guestrin

, XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,, San Francisco, USA (2016), pp. 785–794, August 13-17, 2016.

14.

Crook

J.N.

, Edelman

D.B.

and Thomas

L.C.

, Recent developments in consumer credit risk assessment, European Journal of Operational Research 183(3) (2007), 1447–1465.

15.

Cruz

R.M.O.

, Sabourin

, Cavalcanti

G.D.C.

and Ing

, Ren, Meta-des: A dynamic ensemble selection framework using meta-learning, Pattern Recognition 48(5) (2015), 1925–1935.

16.

Das

A.K.

, Das

and Ghosh

, Ensemble feature selection using bi-objective genetic algorithm, Knowledge-Based Systems 123 (2017), 116–127.

17.

Domingues

, Filippone

, Michiardi

and Zouaoui

, A comparative evaluation of outlier detection algorithms: Experiments and analyses, Pattern Recognition 74 (2018), 406–421.

18.

Fawcett

, ROC graphs: Notes and practical considerations for researchers, Pattern Recognition Letters 31(8) (2004), 1–38.

19.

Freund

and Schapire

R.E.

, Experiments with a new boosting algorithm. In Proceedings of the 13th International Conference on Machine Learning, Bari, Italy (1996), pp. 148–156, July 3-6, 1996.

20.

Friedman

J.H.

, Greedy function approximation: A gradient boosting machine, The Annals of Statistics 29(5) (2001), 1189–1232.

21.

García

, Marqu'es

A.I.

and S'anchez

J.S.

, Exploring the synergetic effects of sample types on the performance of ensembles for credit risk and corporate bankruptcy prediction, Information Fusion 47 (2019), 88–101.

22.

Hand

D.J.

and Kelly

M.G.

, Superscorecards, IMA Journal of Management Mathematics 4 (2002), 273–281.

23.

H.L.

, Zhang

W.Y.

and Zhang

, A novel ensemble method for credit scoring: Adaption of different imbalance ratios, Expert Systems with Applications 98 (2018), 105–117.

24.

X.R.

, Pan

J.F.

, Jin

, Xu

T.B.

, Liu

, Xu

, Shi

Y.X.

, Atallah

, Herbrich

, Bowers

and Candela

J.Q.

, Practical lessons from predicting clicks on ads at Facebook. In Proceedings of the 8th International Workshop on Data Mining for Online Advertising, New York, NY, USA (2014), pp. 1–9, August 24-27, 2014.

25.

T.K.

, Hull

J.J.

and Srihari

S.N.

, Decision combination in multiple classifier systems, IEEE Transactions on Pattern Analysis and Machine Intelligence 16(1) (1994), 66–75.

26.

, Huang

Y.X.

, Qiang

B.H.

and Li

, Min-max ensemble feature selection, Journal of Intelligent & Fuzzy Systems 33(6) (2017), 3441–3450.

27.

G.L.

, Meng

, Finley

, Wang

T.F.

, Chen

, Ma

W.D.

, Ye

Q.W.

and Liu

T.Y.

, LightGBM: A highly efficient gradient boosting decision tree. In Proceedings of Annual 2017 Conference on Neural Information Processing Systems, California, USA (2017), pp. 3146–3154, December 4-9, 2017.

28.

A.H.R.

, Sabourin

, Britto

A.S.

Jr. , From dynamic classifier selection to dynamic ensemble selection, Pattern Recognition 41(5) (2008), 1718–1731.

29.

Kumar

N.S.

, Rao

K.N.

, Govardhan

, Reddy

K.S.

and Mahmood

A.M.

, Undersampled k-means approach for handling imbalanced distributed data, Progress in Artificial Intelligence 3(1) (2014), 29–38.

30.

Y.H.

and Chen

W.D.

, A comparative performance assessment of ensemble learning for credit scoring, Mathematics 8(10) (2020), 1756–1775.

31.

Liu

C.L.

and Hsieh

P.Y.

, Model-based synthetic sampling for imbalanced data, IEEE Transactions on Knowledge and Data Engineering 32(8) (2020), 1543–1556.

32.

Liu

, Motoda

, Feature Selection for Knowledge Discovery and Data Mining. Kluwer Academic Publishers, Norwell, MA. (1998).

33.

Maglietta, Rosalia, D’Addabbo and Annarita, Parallel selective sampling method for imbalanced and large data classification, Pattern Recognition Letters 62 (2015), 61–67.

34.

Ntoutsi

, Fafalios

, Gadiraju

, Iosifidis

, Nejdl

, Vidal

, Ruggieri

, Turini

, Papadopoulos

, Krasanakise

, Kompatsiaris

, Kinder-Kurlanda

, Wagner

, Karimi

, Fernandez

, Alani

, Berendt

, Kruegel

, Heinze

, Broelemann

, Kasneci

, Tiropanis

and Staab

, Bias in data-driven AI system-an introductory survey, WIREs Data Mining and Knowledge Discovery 10(3) (2020), e1356.

35.

Parisi

and Ravichandran

, Evolutionary feature transformation to improve prognostic prediction of hepatitis, Knowledge-Based Systems 200 (2020), 106012.

36.

Pearson

K.X.

, On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling, The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 50(302) (1990), 157–175.

37.

Rogers

and Gunn

, Ensemble algorithms for feature selection, In Deterministic and Statistical Methods in Machine Learning, ser. Lecture Notes in Computer Science, J. Winkler, M. Niranjan, and N. Lawrence (Eds.), Springer Berlin / Heidelberg 3635 (2005), 180–198.

38.

Smith

M.R.

, Martinez

and Giraud-Carrier

, An instance level analysis of data complexity, Machine Learning 95(2) (2014), 225–256.

39.

Stehman

S.V.

, Selecting and interpreting measures of thematic classification accuracy, Remote Sensing of Environment 62(1) (1997), 77–89.

40.

Sun

, Song

, Zhu

, Sun

, Xu

and Zhou

, A novel ensemble method for classifying imbalanced data, Pattern Recognition 48(5) (2015), 1623–1637.

41.

Tripathi

, Edla

D.R.

and Cheruku

, Hybrid credit scoring model using neighborhood rough set and multi-layer ensemble classification, Journal of Intelligent & Fuzzy Systems 34(3) (2018), 1543–1549.

42.

Wolpert

D.H.

, Stacked generalization, Neural Networks 5(2) (1992), 241–259.

43.

Xia

, Liu

, Da

and Xie

, A novel heterogeneous ensemble credit scoring model based on bstacking approach, Expert Systems with Applications 93 (2018), 182–199.

44.

Xiao

, Zhou

, Zhong

, Xie

, Gu

and Liu

, Cost-sensitive semi-supervised selective ensemble model for customer credit scoring, Knowledge-Based Systems 189 (2020), 105118.

45.

Yang

D.Q.

, Zhang

W.Y.

, Wu

, Ablanedo-Rosas

J.H.

, Yang

L.X.

and Yu

W.Z.

, A novel multi-stage ensemble model with fuzzy clustering and optimized classifier composition for corporate bankruptcy prediction, Journal of Intelligent & Fuzzy Systems 43(3) (2021), 4169–4185.

46.

Zhang

W.Y.

, He

H.L.

and Zhang

, A novel multi-stage hybrid model with enhanced multi-population niche genetic algorithm: An application in credit scoring, Expert Systems with Applications 121 (2019), 221–232.

47.

Zhang

W.Y.

, Yang

D.Q.

, Zhang

, Ablanedo-Rosas

J.H.

, Wu

and Lou

, A novel multi-stage ensemble model with enhanced outlier adaptation for credit scoring, Expert Systems with Applications 165 (2021), 113872.

48.

Zyblewski

, Sabourin

and Woniak

, Preprocessed dynamic classifier ensemble selection for highly imbalanced drifted data streams, Information Fusion 66 (2020), 138–154.

A novel multi-stage ensemble model for credit scoring based on synthetic sampling and feature transformation

Abstract

Keywords

1 Introduction

2 Related work

2.1 Imbalanced learning approaches

2.2 Feature processing approaches

2.3 Ensemble learning approaches

3 Model

4.1 Datasets and data processing

Table 1 Detail of datasets Datasets Sample size Positive samples Negative samples Imbalance ratio Total features Australian 690 307 383 1.25 15 German 1000 700 300 2.33 21 Japanese 690 383 307 1.25 16 Hmeq 5960 1189 4771 4.01 13

Table 2 Confusion matrix Predict Positive Negative Real Positive True positive False positive Negative False negative True negative

5 Experimental analysis

5.1 Baseline performance

Table 6 Selected competent base classifiers for each dataset XGBoost GBDT AdaBoost RF LR SVM LightGBM Australian √ √ √ German √ √ √ Japanese √ √ √ √ √ Hmeq √ √ √ √

Footnotes

Acknowledgments

References

Table 1
Detail of datasets

Datasets Sample size Positive samples Negative samples Imbalance ratio Total features

Australian 690 307 383 1.25 15

German 1000 700 300 2.33 21

Japanese 690 383 307 1.25 16

Hmeq 5960 1189 4771 4.01 13

Table 2
Confusion matrix

Predict

Positive Negative

Real Positive True positive False positive

Negative False negative True negative

Table 6
Selected competent base classifiers for each dataset

XGBoost GBDT AdaBoost RF LR SVM LightGBM

Australian √ √ √

German √ √ √

Japanese √ √ √ √ √

Hmeq √ √ √ √