A novel multi-stage ensemble model with multiple K-means-based selective undersampling: An application in credit scoring

Abstract

With the advancement of machine learning, credit scoring can be performed better. As one of the widely recognized machine learning methods, ensemble learning has demonstrated significant improvements in the predictive accuracy over individual machine learning models for credit scoring. This study proposes a novel multi-stage ensemble model with multiple K-means-based selective undersampling for credit scoring. First, a new multiple K-means-based undersampling method is proposed to deal with the imbalanced data. Then, a new selective sampling mechanism is proposed to select the better-performing base classifiers adaptively. Finally, a new feature-enhanced stacking method is proposed to construct an effective ensemble model by composing the shortlisted base classifiers. In the experiments, four datasets with four evaluation indicators are used to evaluate the performance of the proposed model, and the experimental results prove the superiority of the proposed model over other benchmark models.

Keywords

Credit scoring ensemble model imbalanced learning K-means stacking

1 Introduction

The credit market has developed rapidly over the past few decades, and financial institutions have faced severe challenges. Previously, financial institutions relied on the experiences of managers or simple statistical methods to evaluate borrowers and decide whether to issue loans [41]. However, traditional statistical methods gradually lost their effectiveness due to the increasingly complex characteristics of borrowers, causing financial institutions to incur losses. Therefore, machine learning methods, which have advantages such as predictive accuracy and stability, have been applied widely in credit scoring to identify potential behavior patterns from past data and assess the credit scores of borrowers before issuing loans [40].

In complex real-world applications for credit scoring, it is helpful to select, organize, and optimize various machine learning methods well [9, 50]. One promising scheme is to employ ensemble learning to improve the effectiveness of the credit scoring model [39]. Ensemble learning integrates multiple weak classifiers to form a strong classifier, which can effectively improve the stability of the model and accuracy of the prediction [20]. Various base classifiers, such as multilayer perceptron (MLP) [35], decision tree (DT) [37], AdaBoost [12], gradient boosting (GBDT) [13], random forest (RF) [4], extreme gradient boosting (XGBoost) [8], linear discriminant analysis (LDA) [27], and light gradient boosting machine (LGBM) [23], can be used to compose an ensemble model.

Although ensemble learning has many advantages, similar to all machine learning methods, it is affected by imbalanced data [15, 28]. Further, in the real world of financial institutions, the negative samples (users with good credit) significantly outnumber the positive samples (users with bad credit), and most real datasets can be regarded as imbalanced. Failure to consider the imbalanced data problem may result in the credit scoring model being overwhelmed by the negative samples and ignoring the positive samples. Therefore, this study is motivated by solving the imbalanced data problem and building an effective credit scoring model. The main contributions of this study can be summarized as follows:

A novel multi-stage credit scoring model is constructed to obtain a superior prediction result from real datasets with different distributions. The proposed methods can enhance the robustness of the credit scoring model by dealing with imbalanced data and composing the better-performing base classifiers effectively.

A new multiple K-means-based undersampling (MKBUS) method is proposed to address the imbalanced data problem. It can not only alleviate class overlapping and small disjuncts problems but also reduce information loss during sampling.

A new selective sampling mechanism (SSM) is proposed. It can adaptively select the better-performing ones between the candidate base classifiers trained using the original imbalanced data and sampled data, respectively, to constitute the candidate trained base classifiers for the proposed model.

A new feature-enhanced stacking (FES) method is proposed to compose the shortlisted base classifiers into an effective classifier ensemble.

Four imbalanced credit scoring datasets were used to verify the effectiveness of the proposed model. Inspired by the statistical method proposed by Demšar [10], the Friedman test [14] and the Nemenyi test [32] were used to validate the reliability of the experiments further.

The remainder of this paper is organized as follows. Section 2 presents a literature review on imbalanced learning and classifier ensemble. Section 3 introduces the methodology of the proposed model. Section 4 details the experimental design. Section 5 presents the experimental results and experimental analysis. Section 6 provides conclusions and directions for future studies.

2 Literature review

The ensemble model for credit scoring is a hot topic of research in the last few decades. In this study, the model performance was improved by addressing the imbalanced data problem using a novel multi-stage ensemble model with multiple K-means-based selective undersampling. In this section, the related works regarding the two aspects are reviewed, that is, imbalanced learning and classifier ensemble.

2.1 Imbalanced learning

In the field of machine learning, learning from imbalanced data is a relatively new challenge that has been receiving increasing attention from both academia and industry [17]. Imbalanced data are characterized by an imbalanced number of positive and negative samples in the training data. Galar et al. [15] reported that class overlapping and small disjuncts in imbalanced data increase the difficulty in training effective classifiers. In the past two decades, to alleviate the impact of imbalanced data on classifiers, many scholars have proposed imbalanced learning approaches; in particular, the widely used undersampling methods (e.g., reducing the number of samples).

Random undersampling (RUS) is one of the simplest undersampling approaches for balancing data by randomly removing some of the negative samples. Recently, Lin et al. [30] proposed a clustering-based undersampling approach in which the K-means clustering technique is used to cluster the negative samples, and the cluster centers are used to replace the original negative samples. However, no evidence has been provided to show that the centers of negative clusters have the same effects as negative samples on the classifiers; this method may introduce extra noisy data and eliminate useful information. Ofek et al. [33] proposed a fast clustering-based undersampling method in which K-means is used to cluster the positive samples and then assign as many negative samples as positive samples in each positive cluster. The assigned criterion is to assign the negative samples to the closest positive clusters. However, the negative samples that cause class overlapping remain because they are usually close to the positive clusters. Furthermore, the K-means-based undersampling methods mentioned above only cluster positive samples or negative samples but ignore the complex relationships between the negative and positive samples that tend to result in class overlapping. Additionally, K-means clustering has high volatility, which makes the K-means-based undersampling significantly volatile as well. Therefore, a single K-means scheme cannot divide all samples correctly [2].

Although the existing imbalanced learning approaches, including RUS and K-means clustering technique, can alleviate the effect of imbalanced data on most classifiers to a certain degree, the undersampling methods are highly volatile and tend to cause information loss. Few studies have addressed data imbalance by effectively resolving class overlapping and small disjuncts problems.

To overcome the abovementioned limitations, a new imbalanced learning method called MKBUS, is proposed in this study, which handles undersampling by combining multiple K-means schemes [22] with different initial centers. MKBUS deals with imbalanced data by removing the negative samples from clusters that are dominated by positive samples and by retaining all the samples in clusters that are dominated by negative samples. Compared to traditional sampling approaches, MKBUS not only alleviates class overlapping and small disjuncts problems but also reduces information loss during sampling.

Although the undersampling approaches can improve the performance of most classifiers, it is still possible that the original imbalanced data perform better than the sampled data in training a few classifiers. It has been reported in some studies that most classifiers performed better when sampled data were used to train the models, but a few classifiers still performed worse when sampled data were used [7 , 49]. The types of classifiers that perform better in the original imbalanced data are less affected by data imbalance and can be considered as those that are adaptable to imbalanced data. However, in the existing literature on ensemble learning, sampled data were primarily used to train all the base classifiers, and few studies considered the influence of classifier adaptability on the performance of ensemble models.

To overcome the limitation mentioned above, a new selective sampling mechanism, i.e., SSM, is proposed in this study to adaptively select the better-performing ones between the candidate base classifiers trained using the original imbalanced data and sampled data, respectively.

2.2 Classifier ensemble

In the domain of credit scoring, machine learning has the potential to outperform traditional statistical methods. However, Wolpert & Macready [46] suggested that no single complex classifier in machine learning can effectively solve various problems. Therefore, the ensemble of multiple base classifiers has been employed widely to solve credit scoring problems. Ensemble methods, such as stacking [45], boosting [38], and bagging [3], can be used to combine multiple base classifiers to construct ensemble models that outperform any single base classifier. Bagging and boosting are often used to construct homogeneous models from the same type of base classifiers, and stacking is suitable for constructing both heterogeneous models (with different types of classifiers) and homogeneous models (with the same type of classifiers).

Recently, the superior performance and robustness of heterogeneous models have been demonstrated and have drawn considerable attention [42 –44]. The stacking method integrates base classifiers by learning the prediction results of base classifiers through a meta-classifier. However, Lessman et al. [26] evaluated different types of ensemble models for credit scoring and argued that stacking was sometimes less effective than boosting and bagging. Xia et al. [47] discovered that a simple stacking structure design and poorly performing base classifiers are the major factors affecting the performance of the ensemble model. Therefore, in this study, the design of the stacking structure is improved through a new FES method that adaptively selects and combines the features of training data, after which the shortlisted base classifiers are combined into an effective ensemble model for credit scoring.

3 Methodology

The model proposed in this study consists of three stages: the MKBUS method for addressing imbalanced data, the SSM method for selecting the better-performing base classifiers, and the FES method for combining the shortlisted base classifiers into an effective ensemble model. The framework of the proposed model is shown in Fig. 1.

Fig. 1

Framework of the proposed model.

3.1 Multiple K-means-based undersampsling (MKBUS) method

Traditional imbalanced learning approaches via clustering-based undersampling have high volatility and tend to cause information loss. To overcome these drawbacks, a new imbalanced learning method, i.e., MKBUS, is proposed in this study, with the following steps demonstrated using an illustrative example shown in Fig. 2.

Fig. 2

Process of the MKBUS.

Step 1: Divide the original imbalanced training data into multiple Z clusters using the multiple K-means schemes [22] with different initial centers. Then, inspired by Fred [11], use voting method to integrate multiple decisions of these K-means schemes to obtain the Z ensemble clusters.

Step 2: Calculate the negative imbalance coefficient, θ_a, in the ath ensemble cluster using Equation (1), where a belongs to [1, Z], and $X_{neg}^{a}$ and $X_{pos}^{a}$ represent the number of negative and positive samples in the ath ensemble cluster, respectively. $θ_{a} = \frac{X_{neg}^{a}}{X_{pos}^{a} + X_{neg}^{a}}$ (1)

Step 3: Compare the negative imbalance coefficient of each ensemble cluster with a given threshold, R. If the negative imbalance coefficient in an ensemble cluster is lower than R, all positive samples in this cluster are retained and all negative samples in this cluster are removed.

Step 4: Combine Z ensemble clusters to form the sampled training data.

From these steps, it can be observed that MKBUS integrates the decisions of multiple K-means schemes to obtain ensemble clusters first. Next, the imbalanced data is handled by selectively removing the negative samples from the ensemble clusters that are dominated by positive samples. This is unlike the traditional RUS, which randomly removes some negative samples or existing clustering-based undersampling approaches that only cluster positive or negative samples. The K-means algorithm organizes all samples with similar features into the same clusters. Considering that the samples labeled with different classes (positive or negative) in the same ensemble clusters may cause class overlapping and small disjuncts, which increase the difficulty of effectively training classifiers [15], hence, MKBUS removes the negative samples from the ensemble clusters that are dominated by positive samples. On the contrary, MKBUS retains the negative samples in the clusters that are dominated by negative samples and hence reduces the information loss arising from undersampling. Compared to traditional clustering-based undersampling approaches, MKBUS can cluster samples more effectively by alleviating volatility and without removing too much useful information, which may cause information loss.

3.2 Selective sampling mechanism (SSM) method

The sampling approaches are useful for most classifiers that are less adaptable to imbalanced data. However, they change the distribution of the original data, affecting the performance of classifiers that are adaptable to imbalanced data. It is necessary to consider the influence of classifier adaptability on the performance of ensemble models and identify the original imbalanced data and sampled data that perform better in training different classifiers.

However, the existing ensemble models mainly use sampled data to train all the base classifiers and ignore the influence of classifier adaptability. To overcome these shortcomings, a new SSM method is proposed in this study to adaptively select the better-performing ones between the candidate base classifiers trained using the original imbalanced data and sampled data, respectively.

As depicted in Fig. 3, Clf_i represents the ith one of I candidate base classifiers, and Sclf_i and Oclf_i represent the ith candidate base classifiers trained using the sampled data and original imbalanced data, respectively.

Fig. 3

Framework of the SSM.

First, all the candidate base classifiers are trained using the sampled data and original imbalanced data respectively. Next, the validation data is adopted for pairwise comparison between the candidate base classifiers trained using sampled data and original imbalanced data, where the performance of each trained base classifier is measured based on the area under the receiver operating characteristic curve (AUC) [19]. If the AUC of Sclf_i is higher than that of Oclf_i, this candidate base classifier is less adaptable to imbalanced data, and Sclf_i is selected as one of the candidate trained base classifiers. Otherwise, this candidate base classifier is adaptable to imbalanced data, and Oclf_i is selected as one of the candidate trained base classifiers. Finally, the performances of the candidate trained base classifiers are evaluated and ranked by AUC values to form the shortlisted base classifiers that will be composed into a classifier ensemble in the next stage.

3.3 Feature-enhanced stacking (FES) and ensemble

Although the stacking method has been proven to improve the performance of the credit scoring model, the model robustness tends to be affected by the simple design of the stacking structure and underperforming base classifiers. Therefore, in this study, the architecture of stacking is improved by implementing a new FES method comprising three modules: primitive feature generation, synthetic feature generation, and feature selection and combination. Owing to its excellent performance as the meta-classifier in the stacking method, logistic regression (LR) is employed as a meta-classifier in the FES method. Figure 4 illustrates an example of the FES framework according to the following steps.

Fig. 4

Framework of the FES.

The implementation process of FES is as follows:

Step 1: Through the primitive feature generation in the first layer, the shortlisted K base classifiers obtained through MKBUS and SSM are employed to predict the validation data and testing data, and the corresponding prediction results are expressed as V_k (k = 1, 2, ... K) and T_k (k = 1, 2, ... K), respectively. Each V_k and T_k can be used as primitive features for training and testing the meta-classifier, respectively, in the second layer.

Step 2: Through the synthetic feature generation in the first layer, the validation results, V_k (k = 1, 2, ... K), are permutated and combined to produce the synthetic features, SV_m (m = 1, 2, ... M), based on the simple average of the combined V_k. M is determined using Equation (2). The synthetic features SV_m, can also be used for training the meta-classifier in the second layer. Similarly, the testing results T_k, are permutated and combined to produce the synthetic features, ST_m (m = 1, 2, ... M), which can be used for testing the meta-classifier in the second layer. $M = \sum_{k = 1}^{K} \frac{K!}{k! \times (K - k)!}$ (2)

Step 3: In the second layer, the primitive and synthetic features are selected and combined to train and test the meta-classifier. This can be achieved by discovering an optimum feature subset (i.e., salient features) by recursively eliminating useless features [18]. The meta-classifier outputted more accurate and robust predictions owing to the input of more comprehensive information obtained by enhancing the salient primitive and synthetic features. The final ensemble model that output the prediction results of the testing data as the final result is composed of the shortlisted base classifiers with enhanced features.

4 Experimental design

4.1 Credit datasets

To verify the effectiveness of the proposed model, four popular real-world credit datasets were employed in the experiment, namely, the Australian, Japanese, and German datasets, which are credit datasets from the UCI machine learning repository [1], and the AER dataset contributed by Greene [16]. The details of these datasets are listed in Table 1.

Table 1
Details of the four datasets

Dataset Total samples Positive samples Negative samples Numeric features Nominal features Total features

Australian 690 307 383 8 7 15

Japanese 690 307 383 5 11 16

German 1000 300 700 7 14 21

AER 1319 296 1023 6 6 12

Dataset	Total samples	Positive samples	Negative samples	Numeric features	Nominal features	Total features
Australian	690	307	383	8	7	15
Japanese	690	307	383	5	11	16
German	1000	300	700	7	14	21
AER	1319	296	1023	6	6	12

The Australian dataset contains 690 samples, including 307 positive samples and 383 negative samples. The dimension of the input features, including the class label, is 15, with eight numeric features and seven nominal features. The Japanese dataset contains 690 samples, including 307 positive samples and 383 negative samples. The dimension of the input features, including the class label, is 16, with five numeric features and 11 nominal features. The German dataset contains 1000 samples, including 300 positive samples and 700 negative samples. The dimension of the input features, including the class label, is 21, with seven numeric features and 14 nominal features. The AER dataset contains 1319 samples, with 296 positive samples and 1023 negative samples. The dimension of the input features, including the class label, is 12, with six numeric features and six nominal features.

Data preprocessing plays an important role in ensemble modeling. In the data preprocessing stage, the missing values of the numeric features were filled with the mean values, and the missing values of the nominal features were filled with the mode values. To handle the different orders of magnitude on different numerical features, standardization and normalization were applied to the numerical features by removing the mean and unit variance, respectively. To leverage the meaningful distance relationships between the different nominal feature values, dummy coding was applied to transform the nominal feature values into binary attributes [5].

4.2 Evaluation indicators

In this study, four evaluation indicators were adopted: the AUC, geometric mean (G-Mean) [24], Kolmogorov-Smirnov test (KS) [29], and balanced accuracy (BACC) [6]. These evaluation indicators are calculated based on the values of the true positive (TP), false positive (FP), true negative (TN), and false negative (FN). The performance of a classifier is better if the values of the indicators mentioned above are higher. The specific calculation rules of these indicators are as follows.

AUC is a classical evaluation indicator in classification problems, and is defined as the area between the receiver operating characteristic curve and the coordinate axis.

G-Mean is a comprehensive indicator and is widely used to measure the accuracy of models in imbalanced learning. The definition of G-Mean is presented in Equations (3–5). $G - Mean = \sqrt{Sensitivity * Specificity}$ (3) $Sensitivity = TPR = \frac{TP}{TP + FN}$ (4) $Specificity = \frac{TN}{TN + FP}$ (5)

KS is used to measure a classifier’s ability to identify the samples correctly. The definition of KS is presented in Equation (7), where t represents the number of cumulative quantiles. The values of the true positive rate (TPR) and false positive rate (FPR) are represented by TPR(t) and FPR(t), respectively, when the cumulative quantile accumulates until t. The definitions of the TPR and FPR are provided in Equations (4) and (6), respectively. $FPR = \frac{FP}{FP + TN}$ (6) $KS = \max (| TPR (t) - FPR (t) |)$ (7)

BACC can more accurately reflect the classifier’s actual performance in imbalanced learning. It is defined in Equation (9), and the true negative rate (TNR) is defined in Equation (8). $TNR = \frac{TN}{TN + FP}$ (8) $BACC = \frac{TPR + TNR}{2}$ (9)

4.3 Experimental parameter settings

The raw dataset was divided into training data, validation data, and testing data at a ratio of 0.64, 0.16, and 0.2, respectively, using the “train_test_split” function in Python’s skit-learn (sklearn) module. In the data preprocessing stage, the standardization and normalization of the numerical features was performed using the preprocessing package in Python’s sklearn module.

The candidate base classifiers, RF, DT, AdaBoost, MLP, GBDT, and LDA, were imported from the sklearn module. XGBoost and LGBM were imported from Python’s XGBoost module and LightGBM module, respectively. To facilitate a fair comparison, the default hyperparameters were adopted for all the candidate base classifiers.

5 Experimental results and analysis

In this study, eight candidate base classifiers, RF, XGBoost, AdaBoost, DT, GBDT, LGBM, MLP, and LDA, were adopted. Their performances were evaluated on four datasets using four evaluation indicators. The evaluation results of these base classifiers were used as benchmarks for comparison.

To avoid single-bias results and enhance the credibility of the experimental results, each experiment was performed 10 times, and the average values of the evaluation indicators were used for comparison. All the experiments were performed using Python version 3.7.5 on a PC with a 3.8 GHz Intel Core I7-10700K processor, 32 GB RAM, and Windows 10 operating system.

5.1 Baseline results

The baseline results, that is, the performances of the eight candidate base classifiers on the four datasets (i.e., Australian, Japanese, German, and AER), are shown in Table 2.

Table 2
Baseline results on the four datasets

Dataset Classifier AUC BACC G-mean KS

Australian AdaBoost 0.9049 0.8434 0.8423 0.7213

LGBM 0.9204 0.8501 0.8490 0.7371

DT 0.8044 0.8044 0.8033 0.6088

RF 0.9263 0.8642 0.8637 0.7718

XGBoost 0.9173 0.8547 0.8539 0.7420

GBDT 0.9254 0.8664 0.8655 0.7740

MLP 0.9223 0.8532 0.8529 0.7489

LDA 0.9317 0.8660 0.8655 0.7732

Japanese AdaBoost 0.8918 0.8490 0.8487 0.7278

LGBM 0.9212 0.8549 0.8546 0.7420

DT 0.8104 0.8104 0.8099 0.6207

RF 0.9230 0.8621 0.8613 0.7593

XGBoost 0.9182 0.8558 0.8556 0.7414

GBDT 0.9186 0.8568 0.8561 0.7433

MLP 0.9014 0.8376 0.8370 0.7044

LDA 0.8851 0.8351 0.8340 0.7054

German AdaBoost 0.7438 0.6613 0.6374 0.4283

LGBM 0.7642 0.6692 0.6422 0.4412

DT 0.6190 0.6190 0.6055 0.2381

RF 0.7565 0.6260 0.5547 0.4112

XGBoost 0.7607 0.6642 0.6369 0.4460

GBDT 0.7622 0.6592 0.6250 0.4460

MLP 0.7677 0.6693 0.6390 0.4248

LDA 0.7553 0.6639 0.6354 0.4307

AER AdaBoost 0.9421 0.8311 0.8244 0.7609

LGBM 0.9458 0.8472 0.8414 0.7622

DT 0.8030 0.8030 0.7964 0.6060

RF 0.9356 0.8375 0.8298 0.7627

XGBoost 0.9428 0.8476 0.8422 0.7666

GBDT 0.9474 0.8448 0.8390 0.7658

MLP 0.9278 0.7908 0.7760 0.7358

LDA 0.9062 0.7229 0.6793 0.6854

Dataset	Classifier	AUC	BACC	G-mean	KS
Australian	AdaBoost	0.9049	0.8434	0.8423	0.7213
	LGBM	0.9204	0.8501	0.8490	0.7371
	DT	0.8044	0.8044	0.8033	0.6088
	RF	0.9263	0.8642	0.8637	0.7718
	XGBoost	0.9173	0.8547	0.8539	0.7420
	GBDT	0.9254	0.8664	0.8655	0.7740
	MLP	0.9223	0.8532	0.8529	0.7489
	LDA	0.9317	0.8660	0.8655	0.7732
Japanese	AdaBoost	0.8918	0.8490	0.8487	0.7278
	LGBM	0.9212	0.8549	0.8546	0.7420
	DT	0.8104	0.8104	0.8099	0.6207
	RF	0.9230	0.8621	0.8613	0.7593
	XGBoost	0.9182	0.8558	0.8556	0.7414
	GBDT	0.9186	0.8568	0.8561	0.7433
	MLP	0.9014	0.8376	0.8370	0.7044
	LDA	0.8851	0.8351	0.8340	0.7054
German	AdaBoost	0.7438	0.6613	0.6374	0.4283
	LGBM	0.7642	0.6692	0.6422	0.4412
	DT	0.6190	0.6190	0.6055	0.2381
	RF	0.7565	0.6260	0.5547	0.4112
	XGBoost	0.7607	0.6642	0.6369	0.4460
	GBDT	0.7622	0.6592	0.6250	0.4460
	MLP	0.7677	0.6693	0.6390	0.4248
	LDA	0.7553	0.6639	0.6354	0.4307
AER	AdaBoost	0.9421	0.8311	0.8244	0.7609
	LGBM	0.9458	0.8472	0.8414	0.7622
	DT	0.8030	0.8030	0.7964	0.6060
	RF	0.9356	0.8375	0.8298	0.7627
	XGBoost	0.9428	0.8476	0.8422	0.7666
	GBDT	0.9474	0.8448	0.8390	0.7658
	MLP	0.9278	0.7908	0.7760	0.7358
	LDA	0.9062	0.7229	0.6793	0.6854

5.2 Evaluation of MKBUS

To verify the effectiveness of the proposed MKBUS, the sampled data with MKBUS were used to train eight candidate base classifiers, and their performances were evaluated based on the four evaluation indicators. The experimental results are presented in Table 3. The values in bold indicate that compared with the baseline results on the same dataset, applying MKBUS improved the performance of the base classifier.

Table 3
Performance of candidate base classifiers after the application of MKBUS

Dataset Classifier AUC BACC G-mean KS

Australian AdaBoost 0.9145 0.8573 0.8566 0.7401

LGBM 0.9260 0.8679 0.8673 0.7688

DT 0.8194 0.8194 0.8189 0.6388

RF 0.9270 0.8732 0.8717 0.7750

XGBoost 0.9219 0.8694 0.8689 0.7630

GBDT 0.9238 0.8622 0.8611 0.7614

MLP 0.9352 0.8736 0.8708 0.7760

LDA 0.9307 0.8665 0.8648 0.7763

Japanese AdaBoost 0.8949 0.8365 0.8354 0.7075

LGBM 0.9235 0.8603 0.8597 0.7567

DT 0.8299 0.8299 0.8294 0.6598

RF 0.9217 0.8635 0.8623 0.7645

XGBoost 0.9184 0.8583 0.8578 0.7465

GBDT 0.9209 0.8520 0.8509 0.7465

MLP 0.9060 0.8479 0.8471 0.7302

LDA 0.8796 0.8349 0.8339 0.7083

German AdaBoost 0.7385 0.6658 0.6461 0.4186

LGBM 0.7689 0.6676 0.6450 0.4462

DT 0.6185 0.6185 0.6045 0.2369

RF 0.7567 0.6502 0.6025 0.4150

XGBoost 0.7592 0.6686 0.6483 0.4326

GBDT 0.7655 0.6763 0.6515 0.4538

MLP 0.7788 0.6782 0.6557 0.4529

LDA 0.7543 0.6693 0.6493 0.4326

AER AdaBoost 0.9431 0.8348 0.8294 0.7622

LGBM 0.9453 0.8419 0.8368 0.7654

DT 0.8155 0.8155 0.8105 0.6310

RF 0.9376 0.8524 0.8471 0.7632

XGBoost 0.9419 0.8478 0.8430 0.7604

GBDT 0.9478 0.8538 0.8491 0.7691

MLP 0.9275 0.8112 0.8014 0.7300

LDA 0.9055 0.7257 0.6850 0.6826

Dataset	Classifier	AUC	BACC	G-mean	KS
Australian	AdaBoost	0.9145	0.8573	0.8566	0.7401
	LGBM	0.9260	0.8679	0.8673	0.7688
	DT	0.8194	0.8194	0.8189	0.6388
	RF	0.9270	0.8732	0.8717	0.7750
	XGBoost	0.9219	0.8694	0.8689	0.7630
	GBDT	0.9238	0.8622	0.8611	0.7614
	MLP	0.9352	0.8736	0.8708	0.7760
	LDA	0.9307	0.8665	0.8648	0.7763
Japanese	AdaBoost	0.8949	0.8365	0.8354	0.7075
	LGBM	0.9235	0.8603	0.8597	0.7567
	DT	0.8299	0.8299	0.8294	0.6598
	RF	0.9217	0.8635	0.8623	0.7645
	XGBoost	0.9184	0.8583	0.8578	0.7465
	GBDT	0.9209	0.8520	0.8509	0.7465
	MLP	0.9060	0.8479	0.8471	0.7302
	LDA	0.8796	0.8349	0.8339	0.7083
German	AdaBoost	0.7385	0.6658	0.6461	0.4186
	LGBM	0.7689	0.6676	0.6450	0.4462
	DT	0.6185	0.6185	0.6045	0.2369
	RF	0.7567	0.6502	0.6025	0.4150
	XGBoost	0.7592	0.6686	0.6483	0.4326
	GBDT	0.7655	0.6763	0.6515	0.4538
	MLP	0.7788	0.6782	0.6557	0.4529
	LDA	0.7543	0.6693	0.6493	0.4326
AER	AdaBoost	0.9431	0.8348	0.8294	0.7622
	LGBM	0.9453	0.8419	0.8368	0.7654
	DT	0.8155	0.8155	0.8105	0.6310
	RF	0.9376	0.8524	0.8471	0.7632
	XGBoost	0.9419	0.8478	0.8430	0.7604
	GBDT	0.9478	0.8538	0.8491	0.7691
	MLP	0.9275	0.8112	0.8014	0.7300
	LDA	0.9055	0.7257	0.6850	0.6826

According to a majority of the evaluation indicators, almost all of the candidate base classifiers performed better after MKBUS was applied. As explained in Section 3.2, the sampling approaches are effective in improving the performance of classifiers that are less adaptable to imbalanced data but not sensitive to classifiers that are adaptable to imbalanced data. This justifies the application of the SSM to enhance the effectiveness of the sampling approaches in ensemble models. The effectiveness of the SSM will be verified in the next subsection.

To compare the effectiveness of the proposed MKBUS with that of traditional undersampling methods, such as the selection of the near-miss samples (NearMiss) [48] and clustering-based undersampling (CBUS) [30], the sampled data obtained using the NearMiss and CBUS were used to train eight candidate base classifiers on the same datasets respectively. The evaluation results are presented in Table 4. If the NearMiss or CBUS outperformed MKBUS for a base classifier, the value of the evaluation indicator is in bold. The comparison showed that the NearMiss and CBUS only outperformed MKBUS on a few evaluation indicators for a few candidate base classifiers. This is because the proposed MKBUS can cluster samples more effectively by alleviating volatility and without removing too much useful information, which may cause information loss.

Table 4

Performance of candidate base classifiers after the application of NearMiss and CBUS

Near Miss				CBUS
Dataset	Classifier	AUC	BACC	G-mean	KS	AUC	BACC	G-mean	KS
Australian	AdaBoost	0.8981	0.8411	0.8402	0.7235	0.8987	0.8309	0.8300	0.7095
	LGBM	0.9156	0.8544	0.8533	0.7375	0.9198	0.8542	0.8531	0.7364
	DT	0.7917	0.7917	0.7915	0.5834	0.7976	0.7976	0.7965	0.5951
	RF	0.9212	0.8599	0.8593	0.7557	0.9275	0.8696	0.8690	0.7674
	XGBoost	0.9126	0.8559	0.8551	0.7372	0.9189	0.8583	0.8576	0.7484
	GBDT	0.9167	0.8573	0.8566	0.7518	0.9205	0.8667	0.8660	0.7666
	MLP	0.9120	0.8393	0.8391	0.7250	0.9325	0.8717	0.8711	0.7686
	LDA	0.9240	0.8678	0.8672	0.7620	0.9311	0.8676	0.8669	0.7712
Japanese	AdaBoost	0.8846	0.8245	0.8234	0.6916	0.8809	0.8244	0.8237	0.6850
	LGBM	0.9167	0.8514	0.8511	0.7362	0.9202	0.8592	0.8589	0.7493
	DT	0.7922	0.7922	0.7918	0.5845	0.7901	0.7901	0.7896	0.5803
	RF	0.9210	0.8708	0.8699	0.7661	0.9228	0.8656	0.8650	0.7502
	XGBoost	0.9086	0.8420	0.8417	0.7222	0.9145	0.8539	0.8536	0.7423
	GBDT	0.9162	0.8485	0.8476	0.7329	0.9162	0.8501	0.8496	0.7380
	MLP	0.9012	0.8309	0.8303	0.7009	0.8987	0.8333	0.8326	0.6951
	LDA	0.8835	0.8350	0.8339	0.7005	0.8832	0.8322	0.8311	0.7014
German	AdaBoost	0.6755	0.3593	0.6530	0.3593	0.7438	0.6613	0.6374	0.4283
	LGBM	0.7060	0.3600	0.6582	0.3557	0.7598	0.6577	0.6266	0.4348
	DT	0.6233	0.2467	0.6233	0.2467	0.6190	0.6190	0.6055	0.2381
	RF	0.6825	0.3224	0.6277	0.3248	0.7560	0.6296	0.5634	0.4100
	XGBoost	0.7101	0.3645	0.6617	0.3645	0.7607	0.6642	0.6369	0.4460
	GBDT	0.7101	0.3767	0.6567	0.3798	0.7619	0.6599	0.6256	0.4474
	MLP	0.6782	0.3424	0.6501	0.3360	0.7744	0.6770	0.6474	0.4429
	LDA	0.7014	0.3688	0.6555	0.3688	0.7553	0.6639	0.6354	0.4307
AER	AdaBoost	0.8977	0.8233	0.8228	0.6978	0.9421	0.8311	0.8244	0.7609
	LGBM	0.9120	0.8357	0.8355	0.6893	0.9458	0.8472	0.8414	0.7622
	DT	0.7494	0.7494	0.7483	0.4989	0.8030	0.8030	0.7964	0.6060
	RF	0.9038	0.8179	0.8178	0.6709	0.9356	0.8375	0.8298	0.7627
	XGBoost	0.9001	0.8095	0.8091	0.6611	0.9428	0.8476	0.8422	0.7666
	GBDT	0.9061	0.8192	0.8188	0.6838	0.9474	0.8448	0.8390	0.7658
	MLP	0.8932	0.7881	0.7851	0.6619	0.9278	0.7908	0.7760	0.7358
	LDA	0.8670	0.7788	0.7773	0.6303	0.9062	0.7229	0.6793	0.6854

5.3 Evaluation of FES and SSM

To verify the effectiveness of the proposed FES and SSM methods, the FES ensemble model consisting of four shortlisted base classifiers with MKBUS applied was evaluated using four evaluation indicators, after which the FES ensemble model consisting of four shortlisted base classifiers, with both MKBUS and SSM applied was evaluated. The experimental results are presented in Table 5. For the FES ensemble model with MKBUS applied, the values in bold indicate that it outperformed the single best-performing candidate base classifier with only MKBUS applied. Further, the values in bold indicate that the FES ensemble model to which both MKBUS and SSM were applied outperformed the FES ensemble model to which only MKBUS was applied.

Table 5
Performance of FES ensemble models with MKBUS with/without SSM applied

Dataset Method AUC BACC G-mean KS

Australian MKBUS + FES 0.9388 0.8737 0.8722 0.7761

MKBUS + SSM + FES 0.9398 0.8684 0.8667 0.7784

Japanese MKBUS + FES 0.9235 0.8663 0.8652 0.7650

MKBUS + SSM + FES 0.9245 0.8677 0.8664 0.7647

German MKBUS + FES 0.7789 0.6972 0.6881 0.4474

MKBUS + SSM + FES 0.7796 0.7013 0.6997 0.4484

AER MKBUS + FES 0.9498 0.8725 0.8714 0.7709

MKBUS + SSM + FES 0.9502 0.8736 0.8727 0.7776

Dataset	Method	AUC	BACC	G-mean	KS
Australian	MKBUS + FES	0.9388	0.8737	0.8722	0.7761
	MKBUS + SSM + FES	0.9398	0.8684	0.8667	0.7784
Japanese	MKBUS + FES	0.9235	0.8663	0.8652	0.7650
	MKBUS + SSM + FES	0.9245	0.8677	0.8664	0.7647
German	MKBUS + FES	0.7789	0.6972	0.6881	0.4474
	MKBUS + SSM + FES	0.7796	0.7013	0.6997	0.4484
AER	MKBUS + FES	0.9498	0.8725	0.8714	0.7709
	MKBUS + SSM + FES	0.9502	0.8736	0.8727	0.7776

The above comparison shows that the FES ensemble model performs better than the single best-performing candidate base classifier for most evaluation indicators. This is because the FES integrates the advantages of multiple base classifiers, and the internal meta-classifier performs more accurately and with robust predictions owing to the input of more comprehensive information obtained by enhancing the salient primitive and synthetic features. This proves the effectiveness of the FES. Additionally, after the SSM is applied, the performance of the FES ensemble model is further enhanced for most evaluation indicators. This is because the SSM can help the ensemble model to adaptively select the better-performing classifiers between the candidate base classifiers trained using the original imbalanced data and sampled data, respectively. This proves the effectiveness of the SSM.

5.4 Statistical test results

To enhance the reliability of the experiment further, and inspired by Demšar [10], statistical tests were applied. First, the Friedman test [14] was used to test the significance of all methods, and when the null hypothesis was rejected, the Nemenyi test [32] was applied. In the Friedman test, 34 classification models were ranked, based on different evaluation indicators. These models included eight base classifiers before the MKBUS method was applied, eight base classifiers after the MKBUS method was applied, eight base classifiers after the NearMiss method was applied, eight base classifiers after the CBUS method was applied, the FES ensemble model with the MKBUS method applied, and the proposed model (i.e., the FES ensemble model with both MKBUS and SSM methods applied). Next, the average ranking of each method was calculated by averaging the rankings of all classification models. Finally, the average method ranking was used to calculate the statistical significance of the Friedman test. Table 6 summarizes the significance test results of the average method ranking using the Friedman test.

Table 6
Significance test results of the average method ranking using the Friedman test

Dataset Method AUC BACC G-mean KS

Australian Baseline 5 5 5 4

NearMiss 6 6 6 6

CBUS 4 4 4 5

MKBUS 3 3 3 3

MKBUS + FES 2 1 1 2

MKBUS + SSM + FES 1 2 2 1

Japanese Baseline 3 3 3 4

NearMiss 6 6 6 6

CBUS 5 5 5 5

MKBUS 4 4 4 3

MKBUS + FES 2 2 2 1

MKBUS + SSM + FES 1 1 1 2

German Baseline 4 4 5 5

NearMiss 6 6 3 6

CBUS 5 5 6 4

MKBUS 3 3 4 3

MKBUS + FES 2 2 2 2

MKBUS + SSM + FES 1 1 1 1

AER Baseline 3 4 4 4

NearMiss 6 6 6 6

CBUS 3 4 4 4

MKBUS 5 3 3 3

MKBUS + FES 2 2 2 2

MKBUS + SSM + FES 1 1 1 1

Results 10.7705 11.7368 5.8421 14.1429

Dataset	Method	AUC	BACC	G-mean	KS
Australian	Baseline	5	5	5	4
	NearMiss	6	6	6	6
	CBUS	4	4	4	5
	MKBUS	3	3	3	3
	MKBUS + FES	2	1	1	2
	MKBUS + SSM + FES	1	2	2	1
Japanese	Baseline	3	3	3	4
	NearMiss	6	6	6	6
	CBUS	5	5	5	5
	MKBUS	4	4	4	3
	MKBUS + FES	2	2	2	1
	MKBUS + SSM + FES	1	1	1	2
German	Baseline	4	4	5	5
	NearMiss	6	6	3	6
	CBUS	5	5	6	4
	MKBUS	3	3	4	3
	MKBUS + FES	2	2	2	2
	MKBUS + SSM + FES	1	1	1	1
AER	Baseline	3	4	4	4
	NearMiss	6	6	6	6
	CBUS	3	4	4	4
	MKBUS	5	3	3	3
	MKBUS + FES	2	2	2	2
	MKBUS + SSM + FES	1	1	1	1
Results		10.7705	11.7368	5.8421	14.1429

According to the Friedman test results, the null hypothesis was rejected. The Nemenyi test was then applied to compare the performance of the proposed method with that of the other methods. The graphical representation of the Nemenyi test results is shown in Fig. 5. In the figure, the critical distance (CD) indicates the mean ranking score difference. The more the position of the proposed method on the coordinate axis is to the right, the better the performance of the classifier using this method. It can be seen from Fig. 5 that the proposed model MKBUS + SSM + FES was superior to all other models.

Fig. 5

Graphical representation of the Nemenyi test results.

5.5 Performance comparison between the proposed model and benchmark models

To verify the effectiveness of the proposed model further, some existing ensemble models were implemented as benchmark ensemble models on the three standard credit scoring datasets, that is, the Australian, Japanese, and German datasets. The selected benchmark ensemble models were the models proposed by Jadhav et al. [21], Rajaleximi et al. [36], Lan et al. [25], and Nanni & Lumini [31]. Because the source codes of these benchmark ensemble models were not publicly available, and most of them only published the achieved AUC as the evaluation indicator on some datasets, the original AUC values in the references were cited directly for comparison with those of the proposed model. For the Australian and German datasets, the comparison results between the proposed model and the benchmark ensemble models proposed by Jadhav et al. [21], Rajaleximi et al. [36], and Lan et al. [25], were provided because these benchmark ensemble models were applied on these two datasets in the references. For the Japanese dataset, the comparison results between the proposed model and benchmark ensemble models proposed by Rajaleximi et al. [36] and Nanni & Lumini [31] were provided because these benchmark ensemble models were applied on the Japanese dataset in the references. For the AER dataset, no comparison results between the proposed model and benchmark ensemble models were provided in the experiment because none of the benchmark ensemble models were applied to the AER dataset according to these references.

The comparison results are shown in Fig. 6, with various models differentiated using different filling patterns. It can be observed that on the Australian, Japanese, and German datasets, the proposed model significantly outperformed the benchmark models based on the AUC values. For the Australian dataset, the proposed model outperformed the second-best benchmark ensemble model, which was proposed by Rajaleximi et al. [25], with a 2.25% improvement in the AUC value. For the Japanese dataset, the proposed model outperformed the second-best benchmark ensemble model, which was proposed by Nanni & Lumini [31], with a 0.10% improvement in the AUC value. For the German dataset, the proposed model outperformed the second-best benchmark ensemble model, which was proposed by Jadhav et al. [21], with a 0.79% improvement in the AUC value.

Fig. 6

Comparison of AUC values of the proposed model and benchmark models on three datasets.

6 Conclusion & future work

Credit scoring plays a crucial role in financial business development. In this study, a novel multi-stage ensemble model that integrates the MKBUS, SSM, and FES methods was proposed to achieve a superior classification performance for credit scoring. The proposed ensemble model was verified on four credit scoring datasets based on four comprehensive evaluation indicators. The experimental results demonstrated the superior performance of the proposed ensemble model over the four benchmark ensemble models.

However, this study has some limitations. For example, to reduce the computational complexity, the hyperparameters of the base classifiers were not sufficiently optimized. This can be performed in future studies. Additionally, various sampling algorithms can be combined through weighting to address the imbalanced data problem more comprehensively. Finally, the multi-label classification problem will also be considered in future studies, which can bring the research closer to the real world.

Supporting information

The raw data for this paper has uploaded to Figshare (https://doi.org/10.6084/m9.figshare.12928418.v2). The raw data are divided into four parts: the Australian credit dataset (Australian.csv), the German credit dataset (German.csv), the Japanese credit dataset (JapanData.csv), and AER dataset (AER.csv).

Conflicts of interest

The authors declare that there is no conflict of interest regarding the publication of this article.

Footnotes

Acknowledgments

This work was supported by the National Natural Science Foundation of China (No. 51875503, No. 51975512), Zhejiang Natural Science Foundation of China (No. LZ20E050001).

References

Asuncion

and Newman

, UCI Machine Learning Repository, Irvine, CA: School of Information and Computer Science, University of California (2007). http://www.ics.uci.edu/mlearn/MLRepository.html.

Ayad

H.G.

and Kamel

M.S.

, On voting-based consensus of cluster ensembles, Pattern Recognition 43(5) (2010), 1943–1953.

Breiman

, Bagging predictors, Machine Learning 24 (1996), 123–140.

Breiman

, Random forests, Machine Learning 45 (2001), 5–32.

Breiman

, Friedman

, Stone

C.J.

and Olshen

R.A.

, Classification and Regression Trees, CRC Press. (1984).

Brodersen

K.H.

, Ong

C.S.

, Stephan

K.E.

and Buhmann

J.M.

, The balanced accuracy and its posterior distribution, In Proceedings of the 20th International Conference on Pattern Recognition, Istanbul, Turkey, (2010), 3121–3124.

Brown

and Mues

, An experimental comparison of classification algorithms for imbalanced credit scoring data sets,, Expert Systems with Applications 39(3) (2012), 3446–3453.

Chen

T.Q.

and Guestrin

, Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, USA, (2016), 785–794.

Davis

R.H.

, Edelman

D.B.

and Gammerman

A.J.

, Machine-learning algorithms for credit-card applications, IMA Journal of Management Mathematics 4(1) (1992), 43–51.

10.

Demšar

, Statistical comparisons of classifiers over multiple data sets, Journal of Machine Learning Research (2006), 1–30.

11.

Fred

, Finding consistent clusters in data partitions. In Multiple Classifier Systems, Cambridge, United Kingdom, (2001), 309–318.

12.

Freund

and Schapire

R.E.

, Experiments with a new boosting algorithm. In Proceedings of the 13th International Conference on Machine Learning, Bari, Italy, (1996), 148–156.

13.

Friedman

J.H.

, Greedy function approximation: A gradient boosting machine, Annals of Statistics 29(5) (2001), 1189–1232.

14.

Friedman

, A comparison of alternative tests of significance for the problem of m rankings, The Annals of Mathematical Statistics 11(1) (1940), 86–92.

15.

Galar

, Fernandez

, Barrenechea

, Bustince

and Herrera

, A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches, IEEE Transactions on Systems Man and Cybernetics, Part C 42(4) (2011), 463–484.

16.

Greene

W.H.

, Econometric Analysis, Pearson Education India. (2003).

17.

Guo

H.X.

, Li

Y.J.

, Shang

, Gu

M.Y.

, Huang

Y.Y.

and Bing

, Learning from class-imbalanced data: Review of methods and applications, Expert Systems with Applications 73 (2017), 220–239.

18.

Guyon

, Weston

, Barnhill

and Vapnik

, Gene selection for cancer classification using support vector machines, Machine Learning 46 (2002), 389–422.

19.

Hanley

J.A.

and McNeil

B.J.

, The meaning and use of the area under a receiver operating characteristic (ROC) curve, Radiology 143(1) (1982), 29–36.

20.

Hung

and Chen

J.H.

, A selective ensemble based on expected probabilities for bankruptcy prediction, Expert Systems with Applications 36(3) (2009), 5297–5303.

21.

Jadhav

, He

H.M.

and Jenkins

, Information gain directed genetic algorithm wrapper feature selection for credit rating, Applied Soft Computing 69 (2018), 541–553.

22.

Jain

A.K.

and Dubes

R.C.

, Algorithms for Clustering Data, Prentice Hall. (1988).

23.

G.L.

, Meng

, Finley

, Wang

T.F.

, Chen

, Ma

W.D.

, Ye

Q.W.

and Liu

T.Y.

, LightGBM: A highly efficient gradient boosting decision tree. In Proceedings of Annual 2017 Conference on Neural Information Processing Systems, California, USA, (2017), 3146–3154.

24.

Kubat

and Matwin

, Addressing the curse of imbalanced training data sets: One-sided selection. In Proceedings of the 4th International Conference on Machine Learning, Nashville, USA, (1997), 170–186.

25.

Lan

Q.J.

, Xu

X.J.

, Ma

H.J.

and Li

, Multivariable data imputation for the analysis of incomplete credit data, Expert Systems with Applications 141 (2020), 112926.

26.

Lessman

, Baesens

, Seow

H.V.

and Thomas

L.C.

, Benchmarking state-of-the-art classification algorithms for credit scoring: An update of research, European Journal of Operational Research 247(1) (2015), 124–136.

27.

and Yuan

B.Z.

, 2D-LDA: A statistical linear discriminant analysis for image matrix, Pattern Recognition Letters 26(5) (2005), 527–532.

28.

Y.J.

, Wei

J.S.

, Kang

and Wu

Z.Y.

, An efficient noise-filtered ensemble model for customer churn analysis in aviation industry, Journal of Intelligent & Fuzzy Systems 37(2) (2019), 2575–2585.

29.

Lilliefors

H.W.

, On the Kolmogorov-Smirnov test for normality with mean and variance unknown, Journal of the American Statistical Association 62(318) (1967), 399–402.

30.

Lin

W.C.

, Tsai

C.F.

, Hu

Y.H.

and Jhang

J.S.

, Clustering-based undersampling in class-imbalanced data, Information Sciences 409 (2017), 17–26.

31.

Nanni

and Lumini

, An experimental comparison of ensemble of classifiers for bankruptcy prediction and credit scoring, Expert Systems with Applications 36(2) (2009), 3028–3033.

32.

Nemenyi

P.B.

, Distribution Free Multiple Comparisons. PhD Thesis, Princeton University. (1963).

33.

Ofek

, Rokach

, Stern

and Shabtai

, Fast-CBUS: A fast clustering-based undersampling method for addressing the class imbalance problem, Neurocomputing 243 (2017), 88–102.

34.

Orriols-Puig

and Bernadó-Mansilla

, Evolutionary rule-based systems for imbalanced data sets, Soft Computing 13 (2009), 213–225.

35.

Pal

S.K.

and Mitra

, Multilayer perceptron, fuzzy sets, and classification, IEEE Transactions on Neural Networks 3(5) (1992), 683–697.

36.

Rajaleximi

, Ahmed

and Alenezi

, Feature selection using optimized multiple rank score model for credit scoring, International Journal of Intelligent Engineering and Systems 12(2) (2019), 74–84.

37.

Safavian

S.R.

and Landgrebe

, A survey of decision tree classifier methodology, IEEE Transactions on Systems Man and Cybernetics 21(3) (1991), 660–674.

38.

Schapire

R.E.

, The strength of weak learnability, Machine Learning 5 (1990), 197–227.

39.

Sun

, Lang

, Fujita

and Li

, Imbalanced enterprise credit evaluation with DTE-SBD: Decision tree ensemble based on SMOTE and bagging with differentiated sampling rates, Information Sciences 425 (2018), 76–91.

40.

Thomas

L.C.

, A survey of credit and behavioural scoring: Forecasting financial risk of lending to consumers, International Journal of Forecasting 16(2) (2000), 149–172.

41.

Tripathi

, Edla

D.R.

and Cheruku

, Hybrid credit scoring model using neighborhood rough set and multi-layer ensemble classification, Journal of Intelligent & Fuzzy Systems 34(3) (2018), 1543–1549.

42.

Tsai

C.F.

, Hsu

Y.F.

and Yen

D.C.

, A comparative study of classifier ensembles for bankruptcy prediction, Applied Soft Computing 24 (2014), 977–984.

43.

Wang

, Hao

J.X.

, Ma

and Jiang

H.B.

, A comparative assessment of ensemble learning for credit scoring, Expert Systems with Applications 38(1) (2011), 223–230.

44.

Wei

, Yang

D.Q.

, Zhang

W.Y.

and Zhang

, A novel noise-adapted two-layer ensemble model for credit scoring based on backflow learning, IEEE Access 7 (2019), 99217–99230.

45.

Wolpert

D.H.

, Stacked generalization, Neural Networks 5(2) (1992), 241–259.

46.

Wolpert

D.H.

and Macready

W.G.

, No free lunch theorems for optimization, IEEE Transactions on Evolutionary Computation 1(1) (1997), 67–82.

47.

Xia

Y.F.

, Liu

C.Z.

, Da

B.W.

and Xie

F.M.

, A novel heterogeneous ensemble credit scoring model based on bstacking approach, Expert Systems with Applications 93 (2018), 182–199.

48.

Zhang

J.P.

and Mani

, KNN approach to unbalanced data distributions:Acase study involving information extraction. In Proceeding of Workshop on Learning from Imbalanced Datasets, Washington DC, USA, (2003), 1–7.

49.

Zhu

, Guo

and Xue

J.H.

, Adjusting the imbalance ratio by the dimensionality of imbalanced data, Pattern Recognition Letters 133 (2020), 217–223.

50.

Zhu

X.H.

, Ni

Z.W.

, Zhang

G.R.

, Jin

F.F.

, Cheng

M.Y.

and Li

J.M.

, Combining weak-link coevolution binary artificial fish swarm algorithm and complementarity measure for ensemble pruning, Journal of Intelligent & Fuzzy Systems 35(2) (2018), 1431–1444.

A novel multi-stage ensemble model with multiple K-means-based selective undersampling: An application in credit scoring

Abstract

Keywords

1 Introduction

2 Literature review

2.1 Imbalanced learning

2.2 Classifier ensemble

3 Methodology

4.1 Credit datasets

Table 1 Details of the four datasets Dataset Total samples Positive samples Negative samples Numeric features Nominal features Total features Australian 690 307 383 8 7 15 Japanese 690 307 383 5 11 16 German 1000 300 700 7 14 21 AER 1319 296 1023 6 6 12

5 Experimental results and analysis

5.1 Baseline results

Supporting information

Conflicts of interest

Footnotes

Acknowledgments

References

Table 1
Details of the four datasets

Dataset Total samples Positive samples Negative samples Numeric features Nominal features Total features

Australian 690 307 383 8 7 15

Japanese 690 307 383 5 11 16

German 1000 300 700 7 14 21

AER 1319 296 1023 6 6 12