A hybrid model with novel feature selection method and enhanced voting method for credit scoring

Abstract

With the in-depth application of artificial intelligence technology in the financial field, credit scoring models constructed by machine learning algorithms have become mainstream. However, the high-dimensional and complex attribute features of the borrower pose challenges to the predictive competence of the model. This paper proposes a hybrid model with a novel feature selection method and an enhanced voting method for credit scoring. First, a novel feature selection combined method based on a genetic algorithm (FSCM-GA) is proposed, in which different classifiers are used to select features in combination with a genetic algorithm and combine them to generate an optimal feature subset. Furthermore, an enhanced voting method (EVM) is proposed to integrate classifiers, with the aim of improving the classification results in which the prediction probability values are close to the threshold. Finally, the predictive competence of the proposed model was validated on three public datasets and five evaluation metrics (accuracy, AUC, F-score, Log loss and Brier score). The comparative experiment and significance test results confirmed the good performance and robustness of the proposed model.

Keywords

Credit scoring hybrid model feature selection machine learning ensemble learning

1 Introduction

The gradual maturity of artificial intelligence has promoted the development of Internet finance. Credit defaults have always existed in the financial field, and as the main type of financial risk, have caused serious economic losses. Hence, credit risk management has become the core component of the development strategy of financial institutions. Based on the above background, artificial intelligence technology and machine learning algorithms were used to construct a credit scoring model with superior predictive power, which can effectively mitigate risk and reduce economic losses for financial institutions. Previous studies [1 –3] have demonstrated the contribution of artificial intelligence toward the construction of a credit evaluation model. To a large extent, the predictive performance of a credit evaluation model depends on the quality of the datasets, and is reflected in its attribute characteristics.

In today’s era of big data, the datasets trained by machine learning usually have multidimensional features, including many non-functional and redundant features. These features reduce the learning capacity of the machine learning model and thus fail to achieve good predictive performance. Therefore, the exfoliation of an effective feature selection method is essential for a credit evaluation model. In recent years, this has been the focus of many researchers. Some researchers [4 –8] have made specific contributions to the exploration of feature selection methods. However, this subject is still worthy of further in-depth research. In recent years, the ensemble method has been generally considered an advanced method to improve prediction performance, and it can be divided into two types, homogeneous ensemble and heterogeneous ensemble [9]. Adaptive boosting (AdaBoost) [10], random forest (RF) [11], gradient boosting decision tree (GBDT) [11], XGBoost [12], and a light gradient boosting machine (LightGBM) [13] are typical homogeneous ensemble models. The superior performance of heterogeneous ensemble combining different types of classifiers has been gradually proven [14, 15]. The most effective approach to ensemble modeling has yet to be determined, and this is still being explored by researchers. Furthermore, the predictive capability of base classifiers immediately affects the forecast results of the ensemble model. Therefore, exploring an effective classifier selection method is important for enhancing the predictive capability of ensemble models.

Because the economic loss and impact caused by the default risk on the financial industry are significant, constructing a credit scoring model by applying artificial intelligence technology plays a pivotal role in the development of this industry. Herein, a hybrid model is proposed with a novel feature selection method and an enhanced voting method to acquire a conspicuous improvement in credit scoring. This hybrid model consists of three stages: classifier selection stage, feature selection stage, and classifier ensemble stage. First, a feature selection combined method based on a genetic algorithm (FSCM-GA) is proposed to acquire the optimal feature subset. It uses the optimal classifiers obtained through the classifier selection stage as the evaluation function of the GA to acquire their respective optimal feature subsets and combines these subsets as the optimal feature subset of the prediction model. Furthermore, an enhanced voting method (EVM) is proposed that improves the classification of the average prediction probability within the interval set and is applied to integrate the classifiers to output the final prediction results. Finally, the proposed model is employed to test and verify its predictive competence on credit datasets.

The organizational structure of this research is as follows. Section 2 reviews the research and conclusions associated with the proposed model. The proposed hybrid model is described in Section 3. Section 4 presents a review of the experimental design, which mainly includes dataset description and preprocessing, selected evaluation indicators, and parameter settings. In Section 5, the experimental results and the statistical significance test of the proposed model are discussed. Section 6 elaborates the conclusions and future work.

2 Related work

Our proposed model involves three aspects: classifier selection, feature selection, and ensemble modeling. In recent research, the focus has been on how to combine them to construct a more effective credit scoring model. This is consistent with our own research. Previous researches are introduced in this section.

2.1 Genetic algorithm

Nature-inspired algorithms are constructed based on natural laws and human experience, which can provide a feasible solution to the problem at an acceptable cost. In general, the degree of deviation between the feasible solution and optimal solution cannot be controlled in advance [16]. Nature-inspired algorithms are designed to solve optimization problems by the following steps:

Randomly generate the initial solution.

Calculate the target function value based on the initial solution.

Based on the information obtained, the population (feasible solution set) is constantly updated through selection, crossover, and other operators.

Repeat until the iteration ends.

Nature-inspired algorithms are widely employed to solve nonlinear optimization problems, such as studies [17 –19] and [20]. As a typical nature-inspired algorithm, the genetic algorithm [21] is a computational model of the biological evolution process that simulates the natural selection and genetic mechanism of Darwinian biological evolution. Each chromosome represents a solution as an individual, and all individuals constitute a population as the set of possible solutions. First, the population is initialized to randomly generate some solutions. A specific evaluation function is then selected to assess the fitness of each individual, and excellent individuals are then selected for crossover and mutation steps to update the population till the end of the iteration. Finally, the GA outputs the optimal individual as the optimal solution. In studies [22 –24] and [25], the authors demonstrated the powerful search capabilities of the GA.

However, the programming of the GA is complex, and its search speed is relatively slow. Furthermore, the quality of the solution obtained by the GA is directly affected by the selection, crossover, and mutation operator parameter settings.

The GA is widely applied to combine classification models and improve the prediction accuracy. In [26], the authors combined a GA with an ensemble classifier to form a new hybrid algorithm application with three different benchmark datasets. Th authors in [27] applied a genetic algorithm for feature selection to verify its performance. In [7], the authors proposed a hybrid GA with neural networks, which was applied to feature selection. The authors in [28] proposed a hybrid structure that fuses statistical theory and genetic algorithms for the selection of critical ratios. The performance of the GA for the combinatorial optimization of classifiers was proved in [29]. In addition, extended genetic algorithms have gradually attracted the attention of researchers. The authors in [8] proposed a hybrid method that combines the filtering method and multi-population genetic algorithm, and is applied to find the optimal feature subset. An enhanced genetic algorithm was proposed in [30] and employed for classifier selection and feature selection. In this study, the GA is employed to form a novel feature selection combined method to select the features.

2.2 Feature selection methods

Feature selection refers to the process of selecting some of the most effective features from the original features and reduce the feature dimensions of the raw dataset. It is indispensable in data preprocessing to increase the predictive precision of the classification model [31]. Through experience and generalization, feature selection methods include three types, i.e., filtering approaches, wrapper approaches, and embedded approaches.

Filter approaches mainly include the variance selection method, correlation coefficient method, and mutual information. The wrapper approach considers the performance of the classifier as an evaluation criterion for selecting the features. Therefore, such a method must determine a base classifier to select the features. Embedded methods use machine learning models for feature selection, which are automatically carried out during the learner training process. Embedded approaches have the merits of superior speed and efficiency. However, profound prior knowledge is required to adjust the model. To obtain a better feature subset, researchers have attempted to create different combinations of the above three methods. In [4], the authors proposed four feature selection methods combined with an SVM, involving the traditional LDA, DT, rough set, and F-score methods as feature preprocessing steps. The experimental results demonstrate that the hybrid feature selection method is robust and effective in determining the optimal feature subset. It is also proved that the effective feature selection method can enhance the predictive competence of the classifier. The authors in [32] proposed a fast filter approach, which demonstrates the effectiveness of the feature selection approach in terms of dimensionality reduction, eliminating irrelevant data, improving the learning accuracy, and improving the comprehensibility of the results. Five different feature selection methods combined with a new voting method were used for feature selection in [33]. In [34], the authors discussed the effectiveness of feature selection approaches combined with six different models for application to bankruptcy and credit datasets, and the experimental results proved that feature selection does not always enhance the predictive accuracy of the model. However, the GA-based feature selection approach can enhance the capability of the prediction model. The authors in [30] proposed an enhanced multi-population niche GA and initialized the population with prior knowledge obtained by filtering methods for feature selection. Currently, the development of feature selection methods that combine multiple methods is a research hotspot. In this study, FSCM-GA is proposed, which uses the predictive performance of different classifiers as the fitness function of the GA to obtain their respective feature subsets and combines them to form the optimal feature subset.

2.3 Ensemble methods

It is expected that a classifier will be trained with good learning capability in machine learning algorithms. The ensemble model has been shown to have superior learning results and robustness compared to a single weak classifier in previous studies. In addition, it can be divided into two types: homogeneous ensemble and heterogeneous ensemble. A homogeneous ensemble contains only the same type of base classifiers. Heterogeneous ensembles train different types of classifiers and synthesize the respective output results of the base classifiers, compensate for their deficiencies, and obtain a better result [15]. Bagging [35], boosting [36], stacking [37] and voting are widely applied classifier ensemble methods. Herein, an enhanced voting method is proposed to integrate the classifiers.

The idea behind voting is to combine conceptually different machine learning classifiers and use majority voting (hard voting) or the average predicted probabilities (soft voting) to predict the class labels. Such a classifier can be useful for a set of equally well-performing models to balance out their individual weaknesses. In the hard voting ensemble model, for the same data, the output results of the ensemble model are the prediction results of the majority of the base classifiers. In contrast to hard voting, soft voting returns the class label by calculating the average prediction probability of the base classifiers. Specific weights are assigned to each classifier using the weight parameter. When weights are provided, the predicted class probabilities for each classifier are collected, multiplied by the classifier weight, and averaged. The final prediction result is then derived from the class label with the highest average probability.

In recent years, the effectiveness of these different ensemble methods has been demonstrated by many researchers. The authors in [38] proved the superior performance of bagging and boosting, and explained why ensembles can frequently outperform any single classifier. In [39], the authors proved the superior performance of voting methods for multiple classification problems. In [40], the authors compared the performance of these integration approaches with the classifier itself. With the support of statistical analysis, the ensemble technique proved to be outstanding in solving the classification problem. In [14], the authors also proved the superior performance of a heterogeneous ensemble model compared with a single classifier. The above studies confirm the superior capability of the ensemble models. However, the current study mainly focuses on the selection of the most appropriate ensemble approach based on a given combination of base classifiers. Therefore, more effort needs to be put into research on the optimal combination approach.

In this study, the EVM is proposed to integrate the classifiers and output the final result. Section 3 describes the EVM in detail.

3 Modeling

The proposed hybrid model with a novel feature selection method and enhanced voting method is introduced in this section, which includes three aspects: classifier selection, feature selection, and classifier ensemble. The detailed process is as follows. First, a candidate classifier repository is constructed. Then, the data are preprocessed and divided into training and test data, and candidate classifiers in the classifier repository are sorted by calculating the average accuracy of their 5-fold cross-validation in the training data. The top-performing classifiers form an optimal classifier subset. Subsequently, the proposed FSCM-GA is employed to acquire the optimal feature subset based on the classifiers obtained in the previous stage. Finally, the two subsets obtained above are used as the base classifier and the input data of the proposed EVM to output the final prediction result. The framework of the proposed hybrid model is presented in Fig. 1.

Fig. 1

Framework of the proposed hybrid model.

3.1 Classifier selection

Previous studies have demonstrated the superior competence of ensemble models. In addition, the prediction competence of the base classifiers directly affects the prediction results of the ensemble model. Hence, the core task of the classifier ensemble is to find base classifiers with a high predictive competence. In this study, accuracy [41] was employed to assess the predictive performance of the classifiers, considering that it is an accepted performance metric in machine learning. Furthermore, a five-fold cross-validation were selected to evaluate the classification models because such a validation is widely recognized as a more appropriate way to verify the robustness of the machine learning algorithms [42].

Initially, a classifier repository is constructed, which contains several commonly used machine learning models. The predictive competence of the candidate classifiers in the classifier repository is evaluated by calculating its average accuracy based on a five-fold cross-validation. A higher accuracy indicates a better candidate classifier with superior predictive performance. These candidate classifiers are ranked according to their accuracy and the optimal classifier subset is composed of the top candidate classifiers.

3.2 FSCM-GA

Real credit datasets are usually multi-dimensional and complex, which has significant impact on the construction of the model. Hence, the selection of the most helpful features and a reduction of the data dimension is a vital issue that has consistently been the subject of research.

The GA is a typical heuristic algorithm widely used in machine learning. This is a general algorithm for resolving search optimization problems. Compared with some conventional optimization algorithms, the GA can usually achieve better optimization results at a faster speed. In previous studies, the GA has also been shown to perform well in a data dimensionality reduction.

The feature selection procedure of the GA is divided into the following parts: feature encoding, initial population, fitness evaluation, selection, crossover, mutation, and optimal individual decoding. Feature coding refers to the generation of a chromosome with a matching length based on the feature dimension. In general, binary encoding was applied. The initial population consisted of randomly generated chromosomes. The fitness function is applied to assess each individual, for which the fitness value represents the degree of excellence. In this study, the average accuracy of the 5-fold cross-validation in the training data was applied to evaluate each individual. Because the discrepancy in fitness between the majority of the individuals is insignificant, more dominant individuals must be retained in the selection operation to find the optimal individual. Hence, a rank selection operator is employed in the FSCM-GA to select superior individuals. To increase the probability of obtaining a superior individual during each iteration, a crossover and mutation are indispensable steps. A crossover occurs when two pairs of chromosomes exchange genes in a way that creates two new individuals. In this study, a one-point crossover is selected as the crossover operator, which refers to the random setting of a crossover point in the coding string of the individual, and the point then exchanges a portion of the chromosomes of two paired individuals. A mutation refers to the fact that some genes in the coding string of an individual chromosome will be replaced by other alleles of the gene with a certain probability, thus generating a new individual. When the GA terminates, the optimal individual gene is decoded and output as the optimal solution for this operation.

The proposed FSCM-GA is a novel feature selection combined method based on the GA, the main process of which is as follows: The classifiers in the optimal classifier subset are, in turn, sent to the GA to evaluate individual fitness and obtain their respective feature subsets. The optimal feature subset is then obtained by combining the above feature subsets. The combination process is as follows: Those features that are selected by any classifier are selected. Namely, features not selected by any classifier are deleted. The framework of the FSCM-GA is presented in Fig. 2.

Fig. 2

Framework of FSCM-GA.

The average accuracy of the classifier obtained by a 5-fold cross-validation on the training data was used as the evaluation function, of which 80%of the training data were used as the training set to train the classifier and the remaining 20%were used as the validation set to verify the performance. Therefore, the optimal feature subset obtained by combining each feature subset has the advantages of superior robustness and efficiency.

3.3 EVM

To achieve the effectiveness of a prediction, an effective ensemble approach is essential. The voting method has been proven to enhance the performance [39], and hard or majority voting is based on a binary decision rule that selects a class with a major vote. Soft voting outputs the results by combining the predicted probabilities of the classifiers. Therefore, the voting ensemble method has the merit of complementing the shortcomings of each base classifier to obtain a satisfactory performance. In this study, the focus was on disposing the classifications where the predicted probabilities of classifiers are within a set interval. Hence, the EVM was proposed to obtain a superior predictive performance for a binary classification.

Compared with the normal voting methods, the proposed EVM pays more attention to the classification of the average prediction probabilities of the base classifiers within the interval set and enhances the classification results within the interval, where the calculation formula for the minimum value of the interval is given by Equation (1), and the maximum value was calculated using Equation (2).

M i n i m u m v a l u e = \frac{⌊ \frac{N}{2} ⌋}{N}

(1)

$M a x i m u m v a l u e = \frac{⌊ \frac{N}{2} ⌋}{N}$ (2) where the symbols ⌊· ⌋ and ⌈· ⌉ represent “floor” and “ceiling” operations, respectively, and N represents the number of base classifiers, which means that the interval varies with the number of base classifiers.

The detailed procedure for the EVM is described below. For each sample in the dataset, if the average prediction probability of the base classifiers is within the interval set as mentioned above, the classification result is then determined based on the majority prediction results of the base classifiers. By contrast, when the average prediction probability of the base classifiers is not within the interval and the value is greater than 0.5, the classification result of the EVM is 1; otherwise, it is 0. The framework of the EVM is shown in Fig. 3.

Fig. 3

Framework of the EVM.

4 Experimental design

The experimental design of the proposed hybrid model is presented in this section, and its prediction competence and robustness are evaluated. The following four parts are described in detail: dataset description, performance evaluation indicators, data preprocessing, and experiment parameter settings.

4.1 Dataset description

In this study, three public credit datasets from the UCI Machine Learning Repository [43] were adopted to validate the prediction competence of the proposed credit scoring model, including the Australian, German, and Japanese datasets. The detailed introduction of the datasets is shown in Table 1.

Table 1
Description of the datasets

Dataset Sample size Good samples Bad samples Total features (numerical /categorical)

Australian 690 307 383 14 (6/8)

German 1000 700 300 20 (7/13)

Japanese 690 307 383 15 (6/9)

Dataset	Sample size	Good samples	Bad samples	Total features (numerical /categorical)
Australian	690	307	383	14 (6/8)
German	1000	700	300	20 (7/13)
Japanese	690	307	383	15 (6/9)

The Australian dataset involves a total of 690 samples, of which the number of samples labeled as non-defaulting (good samples) is 307, and the number of defaulting samples (bad samples) is 383. There are14 attribute feature dimensions for each sample, excluding class labels, 8 of which are numerical features and 6 of which are categorical features. The German dataset involves a total of 1000 samples, of which 700 samples are labeled as non-defaulting (good samples), and 300 are defaulting samples (bad samples). There are 20 attribute feature dimensions for each sample, excluding the class labels, 7 of which are numerical features and 13 of which are categorical features. The Japanese dataset involves the same number of samples as the Australian dataset, except that 383 samples of the total samples are good samples and 307 samples are bad samples. There are 15 attribute feature dimensions for each sample, excluding the class labels, 6 of which are numerical features and 9 are categorical features.

4.2 Performance metrics

In this research, five recognized evaluation metrics in machine learning were used to verify the prediction competence of the credit scoring model proposed, including the accuracy, AUC, F-score, Log loss [44], and Brier score [45]. The accuracy, AUC, and F-score can be calculated using TP, TN, FP, and FN values.

Where TP represents the number of actual positive values predicted by the model are positive, TN represents the number of actual positive values predicted by the model are negative, FP represents the number of actual negative values that the model predicts are positive, and FN represents the number of actual negative values predicted by the model.

Accuracy is an important metric for evaluating the predictive capability of the classification models, and can be represented by Equation (3). The higher the accuracy, the better the prediction capability of the classification model. $Accuracy = \frac{TP + TN}{TP + FP + TN + FN}$ (3)

AUC is widely applied to evaluate the classification problems, which can effectively demonstrate the performance of the classifiers. The area enclosed by the ROC curve and the coordinate axis is defined as the AUC value. The higher the AUC value, the better the prediction competence of the classification model.

The F-score can be calculated using Equation (6). For a more intuitive representation of the F-score, the defined formulas for Precision and Recall are shown in Equation (4) and Equation (5). Log loss is a significant classification measurement based on probability. As defined in Equation (7), n represents the number of samples; y_i and p_i represent the actual value and the probability prediction, respectively; and y_i ∈ {0, 1}. $Precision = \frac{TP}{TP + FP}$ (4) $Recall = \frac{TP}{TP + FP}$ (5) $F - score = \frac{2 \times Precision \times Recall}{Precision + Recall}$ (6)

$\begin{matrix} Log loss \\ = \frac{- 1}{n} \sum_{i = 1}^{n} (y_{i} \times log (p_{i})) + (1 - y_{i}) \times \log (1 - p_{i})) \end{matrix}$ (7)

Brier score has also been frequently used to evaluate machine learning models in recent years. The Brier score that measures the mean squared deviation between the predicted probability and the actual label. The lower the Brier score is, the better the performance of the model. This is defined in Equation (8) $Brier score = \frac{1}{n} \sum_{i = 1}^{n} {(p_{i} - y_{i})}^{2}$ (8) where n is the number of samples, p_i represents the probability prediction, and y_i represents the actual label value of sample i.

4.3 Data preprocessing

Before training the proposed credit scoring model, data preprocessing is indispensable for improving the interpretability of the data. In general, there are often missing values for the acquired datasets. Therefore, to remove the adverse impacts of the missing values, the average value is used to fill in the numerical features, as well as the mode that has yet to appear to fill in the categorical features. Numeric features were standardized by removing the mean and scaling to the unit variance. Subsequently, one-hot encoding was employed to extend the categorical features. After the above process, to improve the learning effect of the proposed hybrid model, a feature correlation analysis was used to delete one of the two explanatory features with a correlation of greater than 0.97.

4.4 Experiment parameter settings

The original dataset was divided into training data and test data according to the standard ratio of 0.8 to 0.2. In addition, the training data were further divided into training set and validation set through a five-fold cross-validation, of which 80%of the training data were used as the training set and the remaining 20%were used as the validation set. The authors in [46] proved that a cross-validation might lead to a higher average performance than a traditional holdout validation, and can also reduce the risk of a poor performance. Hence, the original dataset is divided into training set, validation set, and test set, with a proportion of 0.64:0.16:0.20. This is the data division approach commonly used by researchers, such as in [30, 47], and [29].

The classifier repository is composed of LR, SVM, DT, KNN, XGBoost, GBDT, RF, and LightGBM, and default parameters are used. In this research, the number of classifiers selected during the classifier selection stage was set to four. In the proposed FSCM-GA, the number of iterations was set to 10, and the size of the population was set to 10. The crossover probability and mutation probability are set to 0.8 and 0.2 respectively. In the proposed EVM, because four classifiers are selected in this study, the minimum and maximum values of the interval are calculated as 0.25 and 0.75, respectively. The classifiers LR, SVM, DT, KNN, XGBoost, GBDT, RF, and LightGBM are imported from the specified Python package. In the next section, the experimental results and analysis are discussed.

5 Experimental results and comparative analysis

The candidate classifiers for the classifier repository are evaluated in this section, for which the high-performing classifiers in each run are selected to determine the optimal feature subset. Three public credit datasets and five performance indicators are applied to validate the prediction competence and effectiveness of the candidate baseline classifiers as well as the proposed credit scoring model. To decrease the effect of randomness on the experiments, each experiment was run 20 times and the average results were evaluated. All experiments were conducted on a PC with a Microsoft Windows 10 operating system using Python Version 3.8.

5.1 Baseline classifier results

To demonstrate the effectiveness of the proposed hybrid model, the candidate classifiers were evaluated on three datasets and five performance indicators, and their performance was used as the baseline results. The following eight classifiers, LR, SVM, DT, KNN, XGBoost, GBDT, RF, and LightGBM were tested as baseline classifiers, as listed in Table 2. The classifiers with the highest performance are bolded for each evaluation indicator.

Table 2
Performance of baseline classifiers in different datasets

Dataset Base classifiers Accuracy AUC F-score Log loss Brier score

Australian LR 0.8551 0.9221 0.8446 0.3591 0.1073

SVM 0.8471 0.9085 0.8392 0.3832 0.1174

DT 0.7946 0.7947 0.7632 7.0956 0.2054

KNN 0.8283 0.8864 0.8469 1.5828 0.1304

XGBoost 0.8464 0.9285 0.8323 0.3481 0.1042

GBDT 0.8449 0.9275 0.8310 0.3523 0.1059

RF 0.8587 0.9298 0.8448 0.3888 0.1035

LightGBM 0.8522 0.9242 0.8352 0.4224 0.1131

German LR 0.7480 0.7705 0.8278 0.5240 0.1727

SVM 0.7480 0.7731 0.8310 0.5140 0.1708

DT 0.6815 0.6223 0.7720 10.8194 0.3185

KNN 0.7180 0.6932 0.8116 1.8166 0.1983

XGBoost 0.7470 0.7705 0.8274 0.5251 0.1735

GBDT 0.7415 0.7669 0.8228 0.5321 0.1760

RF 0.7480 0.7732 0.8306 0.5132 0.1711

LightGBM 0.7410 0.7587 0.8204 0.6333 0.1904

Japanese LR 0.8399 0.9109 0.8529 0.4017 0.1165

SVM 0.8428 0.9095 0.8547 0.3813 0.1172

DT 0.8087 0.8071 0.7850 6.6075 0.1913

KNN 0.8217 0.8766 0.8312 1.7361 0.1379

XGBoost 0.8551 0.9264 0.8667 0.3558 0.1049

GBDT 0.8500 0.9218 0.8612 0.3648 0.1062

RF 0.8652 0.9260 0.8781 0.3825 0.1079

LightGBM 0.8630 0.9254 0.8749 0.4245 0.1090

Dataset	Base classifiers	Accuracy	AUC	F-score	Log loss	Brier score
Australian	LR	0.8551	0.9221	0.8446	0.3591	0.1073
	SVM	0.8471	0.9085	0.8392	0.3832	0.1174
	DT	0.7946	0.7947	0.7632	7.0956	0.2054
	KNN	0.8283	0.8864	0.8469	1.5828	0.1304
	XGBoost	0.8464	0.9285	0.8323	0.3481	0.1042
	GBDT	0.8449	0.9275	0.8310	0.3523	0.1059
	RF	0.8587	0.9298	0.8448	0.3888	0.1035
	LightGBM	0.8522	0.9242	0.8352	0.4224	0.1131
German	LR	0.7480	0.7705	0.8278	0.5240	0.1727
	SVM	0.7480	0.7731	0.8310	0.5140	0.1708
	DT	0.6815	0.6223	0.7720	10.8194	0.3185
	KNN	0.7180	0.6932	0.8116	1.8166	0.1983
	XGBoost	0.7470	0.7705	0.8274	0.5251	0.1735
	GBDT	0.7415	0.7669	0.8228	0.5321	0.1760
	RF	0.7480	0.7732	0.8306	0.5132	0.1711
	LightGBM	0.7410	0.7587	0.8204	0.6333	0.1904
Japanese	LR	0.8399	0.9109	0.8529	0.4017	0.1165
	SVM	0.8428	0.9095	0.8547	0.3813	0.1172
	DT	0.8087	0.8071	0.7850	6.6075	0.1913
	KNN	0.8217	0.8766	0.8312	1.7361	0.1379
	XGBoost	0.8551	0.9264	0.8667	0.3558	0.1049
	GBDT	0.8500	0.9218	0.8612	0.3648	0.1062
	RF	0.8652	0.9260	0.8781	0.3825	0.1079
	LightGBM	0.8630	0.9254	0.8749	0.4245	0.1090

Note: significant values are in bold.

Table 2 shows that RF outperforms the other base classifiers in three datasets for most evaluation indicators. It is worth mentioning that LR and SVM also have a superior performance in the German dataset. However, the performance of the DT is poor in comparison to the other candidate classifiers in the three datasets.

5.2 Performance of FSCM-GA

To validate the effectiveness of the FSCM-GA, several typical classifiers (GBDT, LightGBM, and LR) were combined with FSCM-GA and a normal GA feature selection method to compare the performance on the same dataset and evaluation metrics. The prediction competence of the classifiers combined with the normal GA feature selection method was evaluated on three datasets and five performance indicators. The results are presented in Table 3, where “N-GBDT,” “N-LGBM,” “N-LR” represent the GBDT, LightGBM, and LR combined with the normal GA feature selection method, respectively. For the same datasets and evaluation metrics, the evaluation metric values of the selected classifiers are shown in bold if the normal GA feature selection method is employed to improve the performance in comparison to the baseline results. As shown in Table 3, most of the five evaluation indicators of the three selected classifiers improved after the normal GA feature selection method was employed. This result proved the effectiveness of the genetic algorithm in enhancing the predictive precision of the classification models.

Table 3
Performance of normal GA feature selection method using different datasets

Dataset Classifier Accuracy AUC F-score Log loss Brier score

Australian N-GBDT 0.8609 0.9291 0.8433 0.3366 0.0993

N-LGBM 0.8678 0.9300 0.8501 0.3929 0.1016

N-LR 0.8623 0.9158 0.8504 0.3558 0.1055

German N-GBDT 0.7468 0.7720 0.8304 0.5157 0.1700

N-LGBM 0.7480 0.7652 0.8286 0.6004 0.1834

N-LR 0.7483 0.7707 0.8301 0.5140 0.1704

Japanese N-GBDT 0.8558 0.9262 0.8707 0.3510 0.1054

N-LGBM 0.8601 0.9278 0.8751 0.3981 0.1072

N-LR 0.8565 0.9150 0.8614 0.3539 0.1065

Dataset	Classifier	Accuracy	AUC	F-score	Log loss	Brier score
Australian	N-GBDT	0.8609	0.9291	0.8433	0.3366	0.0993
	N-LGBM	0.8678	0.9300	0.8501	0.3929	0.1016
	N-LR	0.8623	0.9158	0.8504	0.3558	0.1055
German	N-GBDT	0.7468	0.7720	0.8304	0.5157	0.1700
	N-LGBM	0.7480	0.7652	0.8286	0.6004	0.1834
	N-LR	0.7483	0.7707	0.8301	0.5140	0.1704
Japanese	N-GBDT	0.8558	0.9262	0.8707	0.3510	0.1054
	N-LGBM	0.8601	0.9278	0.8751	0.3981	0.1072
	N-LR	0.8565	0.9150	0.8614	0.3539	0.1065

Note: bold indicates significant values.

To demonstrate the superiority of the FSCM-GA over the normal GA feature selection method, the FSCM-GA was employed to combine the classifiers to test the same datasets and evaluation metrics. The test results are listed in Table 4, where “F-GBDT,” “F-LGBM,” and “F-LR” represent the GBDT, LightGBM, and LR combined with FSCM-GA, respectively. Compared with the normal GA feature selection method, the bold fonts of the evaluation metrics indicate that the performance of the selected classifiers has been improved after the FSCM-GA is employed. The results in Table 4 prove that the proposed FSCM-GA has a better effect on improving the classification model performance compared to the normal GA feature selection method.

Table 4

Performance of FSCM-GA using different datasets

Dataset	Classifier	Accuracy	AUC	F-score	Log loss	Brier score
Australian	F-GBDT	0.8685	0.9372	0.8512	0.3219	0.0989
	F-LGBM	0.8714	0.9313	0.8550	0.3916	0.0994
	F-LR	0.8634	0.9237	0.8498	0.3463	0.1023
German	F-GBDT	0.7548	0.7761	0.8354	0.5104	0.1687
	F-LGBM	0.7490	0.7679	0.8302	0.6006	0.1824
	F-LR	0.7498	0.7775	0.8305	0.5092	0.1687
Japanese	F-GBDT	0.8587	0.9338	0.8713	0.3326	0.1008
	F-LGBM	0.8696	0.9325	0.8778	0.3843	0.1018
	F-LR	0.8620	0.9206	0.8651	0.3477	0.1047

Note: bold indicates significant values.

Figure 4 shows the performance trend of each classifier after a normal GA and FSCM-GA are applied as feature selection methods. Three evaluation indicators, the accuracy, AUC, and F-score, were selected and applied to three datasets to illustrate the results. The time cost of the GA in different datasets are showed in Fig. 5.

Fig. 4

Performance comparison between normal GA and FSCM-GA in three datasets.

Fig. 5

Time cost of the GA in three datasets.

5.3 Performance of the EVM

In this section, the proposed EVM is compared with the soft voting method using five evaluation metrics to demonstrate its effectiveness. Similarly, SVM, DT, KNN, GBDT, RF, LightGBM, and LR were selected as the base classifiers, and different combinations were tested. The corresponding performances under the datasets are presented in Table 5. If one of the EVM and soft voting methods performs better, the evaluation metric values are indicated in bold. The results in Table 5 indicate that the overall performance of the five evaluation measures obtained by the EVM is better than that of compared to the soft voting method.

Table 5
Performance of the soft voting and EVM in different datasets

Dataset Classifier Ensemble method Accuracy AUC F-score Log loss Brier score

Australian SVM+DT+GBDT Soft voting 0.8449 0.9200 0.8288 0.3758 0.1157

EVM 0.8536 0.9227 0.8429 0.3679 0.1127

GBDT+LGBM+LR Soft voting 0.8598 0.9318 0.8483 0.3293 0.0989

EVM 0.8623 0.9320 0.8496 0.3292 0.0989

SVM+DT+LR+KNN Soft voting 0.8573 0.9156 0.8672 0.3684 0.1108

EVM 0.8609 0.9159 0.8700 0.3672 0.1108

DT+LR+GBDT+RF Soft voting 0.8554 0.9253 0.8360 0.3415 0.1040

EVM 0.8634 0.9278 0.8463 0.3365 0.1023

German SVM+DT+GBDT Soft voting 0.7295 0.7668 0.8148 0.5212 0.1757

EVM 0.7568 0.7682 0.8359 0.5208 0.1755

GBDT+LGBM+LR Soft voting 0.7593 0.7991 0.8385 0.4826 0.1590

EVM 0.7600 0.7996 0.8378 0.4816 0.1588

SVM+DT+LR+KNN Soft voting 0.7478 0.7722 0.8307 0.5089 0.1693

EVM 0.7520 0.7718 0.8331 0.5095 0.1697

DT+LR+GBDT+RF Soft voting 0.7403 0.7710 0.8226 0.5161 0.1736

EVM 0.7523 0.7709 0.8335 0.5162 0.1734

Japanese SVM+DT+GBDT Soft voting 0.8442 0.9190 0.8575 0.3761 0.1156

EVM 0.8522 0.9187 0.8611 0.3768 0.1157

GBDT+LGBM+LR Soft voting 0.8587 0.9283 0.8707 0.3427 0.1021

EVM 0.8573 0.9287 0.8691 0.3425 0.1020

SVM+DT+LR+KNN Soft voting 0.8453 0.9081 0.8574 0.3808 0.1142

EVM 0.8536 0.9086 0.8662 0.3789 0.1130

DT+LR+GBDT+RF Soft voting 0.8500 0.9224 0.8632 0.3545 0.1079

EVM 0.8554 0.9219 0.8700 0.3550 0.1077

Dataset	Classifier	Ensemble method	Accuracy	AUC	F-score	Log loss	Brier score
Australian	SVM+DT+GBDT	Soft voting	0.8449	0.9200	0.8288	0.3758	0.1157
		EVM	0.8536	0.9227	0.8429	0.3679	0.1127
	GBDT+LGBM+LR	Soft voting	0.8598	0.9318	0.8483	0.3293	0.0989
		EVM	0.8623	0.9320	0.8496	0.3292	0.0989
	SVM+DT+LR+KNN	Soft voting	0.8573	0.9156	0.8672	0.3684	0.1108
		EVM	0.8609	0.9159	0.8700	0.3672	0.1108
	DT+LR+GBDT+RF	Soft voting	0.8554	0.9253	0.8360	0.3415	0.1040
		EVM	0.8634	0.9278	0.8463	0.3365	0.1023
German	SVM+DT+GBDT	Soft voting	0.7295	0.7668	0.8148	0.5212	0.1757
		EVM	0.7568	0.7682	0.8359	0.5208	0.1755
	GBDT+LGBM+LR	Soft voting	0.7593	0.7991	0.8385	0.4826	0.1590
		EVM	0.7600	0.7996	0.8378	0.4816	0.1588
	SVM+DT+LR+KNN	Soft voting	0.7478	0.7722	0.8307	0.5089	0.1693
		EVM	0.7520	0.7718	0.8331	0.5095	0.1697
	DT+LR+GBDT+RF	Soft voting	0.7403	0.7710	0.8226	0.5161	0.1736
		EVM	0.7523	0.7709	0.8335	0.5162	0.1734
Japanese	SVM+DT+GBDT	Soft voting	0.8442	0.9190	0.8575	0.3761	0.1156
		EVM	0.8522	0.9187	0.8611	0.3768	0.1157
	GBDT+LGBM+LR	Soft voting	0.8587	0.9283	0.8707	0.3427	0.1021
		EVM	0.8573	0.9287	0.8691	0.3425	0.1020
	SVM+DT+LR+KNN	Soft voting	0.8453	0.9081	0.8574	0.3808	0.1142
		EVM	0.8536	0.9086	0.8662	0.3789	0.1130
	DT+LR+GBDT+RF	Soft voting	0.8500	0.9224	0.8632	0.3545	0.1079
		EVM	0.8554	0.9219	0.8700	0.3550	0.1077

Note: bold indicates significant values.

5.4 Performance of the proposed model

Because the performance of the candidate classifiers is affected by the datasets, their performance rankings vary by dataset. The final prediction result of the proposed credit scoring model was obtained using the ensemble model composed of the selected optimal classifier subset and the feature set on the test set, as shown in Table 6. Compared to the baseline results mentioned above, the proposed credit scoring model showed significant improvements in all evaluation metrics on the three datasets. This proves that the proposed credit scoring model has a high prediction competence and provides considerable help for credit risk management of credit institutions to reduce economic losses, bring in higher returns, and help to better develop under fierce competition conditions.

Table 6
Performance of the proposed model in different datasets

Dataset Accuracy AUC F-score Log loss Brier score

Australian 0.8804 0.9466 0.8643 0.2982 0.0880

German 0.7748 0.8050 0.8505 0.4792 0.1566

Japanese 0.8797 0.9369 0.8926 0.3130 0.0908

Dataset	Accuracy	AUC	F-score	Log loss	Brier score
Australian	0.8804	0.9466	0.8643	0.2982	0.0880
German	0.7748	0.8050	0.8505	0.4792	0.1566
Japanese	0.8797	0.9369	0.8926	0.3130	0.0908

5.5 Comparison and analysis of the performance of the proposed model

The prediction competence of the model proposed in this paper is compared with the model proposed in previous studies [30 , 47–50], as shown in Table 7. Each model was tested on the datasets applied in this study with different evaluation indicators.

Table 7
Performance of the proposed model compared with other models

Dataset Ensemble models Accuracy AUC F-score Log loss Brier score

Australian [48] 0.8798 0.9404 / / 0.0920

[49] / 0.9340 0.8502 0.3319 /

[30] 0.8754 0.9370 / / 0.0938

[50] 0.8590 0.9140 / / /

[47] / 0.9398 / / /

The proposed model 0.8804 0.9466 0.8643 0.2982 0.0880

German [48] 0.7772 0.8023 / / 0.1577

[49] / 0.8002 0.8444 0.4937 /

[30] 0.7682 0.8029 / / 0.1603

[50] 0.6760 0.6860 / / /

[47] / 0.7796 / / /

The proposed model 0.7748 0.8050 0.8505 0.4792 0.1566

Japanese [48] 0.8788 0.9328 / / 0.0946

[49] / 0.9306 0.8700 0.3396 /

[30] 0.8720 0.9387 / / 0.0947

[47] / 0.9245 / / /

The proposed model 0.8797 0.9369 0.8926 0.3130 0.0908

Dataset	Ensemble models	Accuracy	AUC	F-score	Log loss	Brier score
Australian	[48]	0.8798	0.9404	/	/	0.0920
	[49]	/	0.9340	0.8502	0.3319	/
	[30]	0.8754	0.9370	/	/	0.0938
	[50]	0.8590	0.9140	/	/	/
	[47]	/	0.9398	/	/	/
	The proposed model	0.8804	0.9466	0.8643	0.2982	0.0880
German	[48]	0.7772	0.8023	/	/	0.1577
	[49]	/	0.8002	0.8444	0.4937	/
	[30]	0.7682	0.8029	/	/	0.1603
	[50]	0.6760	0.6860	/	/	/
	[47]	/	0.7796	/	/	/
	The proposed model	0.7748	0.8050	0.8505	0.4792	0.1566
Japanese	[48]	0.8788	0.9328	/	/	0.0946
	[49]	/	0.9306	0.8700	0.3396	/
	[30]	0.8720	0.9387	/	/	0.0947
	[47]	/	0.9245	/	/	/
	The proposed model	0.8797	0.9369	0.8926	0.3130	0.0908

Note: bold indicates significant values; “/” indicates that the corresponding evaluation indicators are not presented in the previous work.

Here, “/” indicates an evaluation metric that is not used in previous studies, and bold indicates a specific evaluation metric of the proposed model that outperformed the other models on the same dataset. The comparison results presented in Table 7 demonstrate the superior prediction competence of the proposed model.

5.6 Significance test results

In this section, the Friedman test [51] applied to prove the significance of the proposed model is described. Seventeen different models were tested, including candidate baseline classifiers, classifiers combining normal GA feature selection methods, and classifiers applying the proposed FSCM-GA, the ensemble model with soft voting, the classifier ensemble model with the proposed EVM, and the proposed model.

The Friedman test is a typical non-parametric statistical test, used in this study to compare the predictive competence of the models for each evaluation metric. The ranking of each model for each evaluation metric on all datasets is calculated, and the average ranking of all evaluation metrics was further calculated according to [14], as shown in Table 8. This indicates the comprehensive ranking results for each model. The significant test results indicate that the proposed model is better than the other models, and its rankings are shown in bold. In addition, both the statistical value and the p-value of each evaluation metric were calculated using the Friedman test and placed at the end of Table 8.

Table 8
Classifier ranking results of Friedman test

Method Classifier Accuracy AUC F-score Log loss Brier score AvgRank

Baseline LR 11.7 12.3 12.7 11.3 12.3 12

SVM 12.7 12.3 10.7 9.3 12.7 11.5

DT 17 17 17 17 17 17

KNN 16 16 13.7 16 16 15.5

XGBoost 12.7 9 12.7 8.7 9.3 10.5

GBDT 14 11.3 14.3 9.7 11 12.1

RF 7.7 7.3 6 9.7 10 8.1

LightGBM 10.3 12 11.3 15 14 12.5

Normal GA feature selection method GBDT 10.7 8 9 6.7 6.3 8.1

LGBM 6 8.7 6.7 13.3 10.7 9.1

LR 7.7 12 8.7 8 9.3 9.1

FSCM-GA GBDT 5 3.3 4.3 3 2.8 3.7

LGBM 3.3 5.3 4.7 13 7.3 6.7

LR 5 9.3 8 5 6.3 6.7

Soft voting method GBDT+LGBM+LR 6.3 4.3 6 3.7 3.5 4.8

EVM GBDT+LGBM+LR 6 3.3 6.3 2.7 3.3 4

The proposed model 1 1.3 1 1 1 1.1

Statistics of the Friedman test 26.94 29.06 30.12 26.94 29.06

p-value 0.000 0.000 0.000 0.000 0.000

Method	Classifier	Accuracy	AUC	F-score	Log loss	Brier score	AvgRank
Baseline	LR	11.7	12.3	12.7	11.3	12.3	12
	SVM	12.7	12.3	10.7	9.3	12.7	11.5
	DT	17	17	17	17	17	17
	KNN	16	16	13.7	16	16	15.5
	XGBoost	12.7	9	12.7	8.7	9.3	10.5
	GBDT	14	11.3	14.3	9.7	11	12.1
	RF	7.7	7.3	6	9.7	10	8.1
	LightGBM	10.3	12	11.3	15	14	12.5
Normal GA feature selection method	GBDT	10.7	8	9	6.7	6.3	8.1
	LGBM	6	8.7	6.7	13.3	10.7	9.1
	LR	7.7	12	8.7	8	9.3	9.1
FSCM-GA	GBDT	5	3.3	4.3	3	2.8	3.7
	LGBM	3.3	5.3	4.7	13	7.3	6.7
	LR	5	9.3	8	5	6.3	6.7
Soft voting method	GBDT+LGBM+LR	6.3	4.3	6	3.7	3.5	4.8
EVM	GBDT+LGBM+LR	6	3.3	6.3	2.7	3.3	4
	The proposed model	1	1.3	1	1	1	1.1
Statistics of the Friedman test		26.94	29.06	30.12	26.94	29.06
p-value		0.000	0.000	0.000	0.000	0.000

Note: bold indicates the ranking values of the proposed model; the alpha value is 0.05; the Chi-square critical value is 5.99.

The test results in Table 8 indicate that the statistical values of each evaluation metric calculated using the Friedman test were higher than the Chi-square critical value, and the p-values were all lower than the alpha value (0.05). Hence, the hypothesis of the statistical significance test is invalid, and the prediction competence and robustness of the proposed model are verified.

6 Conclusions and future work

The credit scoring system has always been an important tool for the development of financial institutions. The gradual maturity of artificial intelligence technology has brought immeasurable assistance to the financial sector, and various credit scoring models have been constructed to reduce the risk of credit defaults. Researchers have conducted a variety of important explorations in classifier selection, feature selection, and ensemble methods. Despite research conducted on the combination of the above methods, there still retains the problem of an optimal ensemble that needs further exploration. Herein, a hybrid model is proposed with a new feature selection method and an enhanced voting method to achieve a conspicuous improvement in credit scoring, which includes the classifier selection stage, feature selection stage, and classifier ensemble stage. Based on the performance of each classifier used in the training set, a subset of the optimal classifiers was obtained. Furthermore, FSCM-GA is proposed to obtain a subset of optimal features, using the optimal classifiers obtained through the classifier selection stage as an evaluation function. Finally, the proposed EVM is applied to integrate the classifiers to output the final result with the two subsets mentioned above. The predictive competence of the proposed model was proven on credit datasets, and five performance indicators were applied as the performance measures. The experimental results indicate that the prediction competence of the proposed FSCM-GA is better than that of the normal GA feature selection method. Moreover, compared with the soft voting method applying several different combinations of base classifiers, the proposed EVM achieves a superior prediction performance. The Friedman test results prove that the performance of the proposed model is significantly enhanced compared to the benchmark models, which will inspire researchers for future research on machine learning models.

This study has certain limitations that need to be addressed. For classifier selection, only one evaluation indicator was evaluated to obtain the optimal classifier subset. In the future, an attempt will be made to add more evaluation indicators for selecting the classifiers. In addition, more credit datasets need to be used to validate the effects of the model. Moreover, it is important to validate the efficiency of the heuristic algorithm. Finally, this study can be applied to other fields of financial forecasting.

Footnotes

Acknowledgments

The research in this article has been supported by the National Natural Science Foundation of China under Grant No. 71801187.

References

Antanas

, Zivile

and Marija

, causkieneAdas, Gelzinis, Hybrid and ensemble-based soft computing techniques in bankruptcy prediction: a survey, Soft Computing A Fusion of Foundations Methodologies & Applications, (2010).

Kirkos

, Assessing methodologies for intelligent bankruptcy prediction, Artificial Intelligence Review 43 (2015), 83–123.

Alaka

H.A.

, Oyedele

L.O.

, Owolabi

H.A.

, Kumar

, Ajayi

S.O.

, Akinade

O.O.

and Bilal

, Systematic review of bankruptcy prediction models: Towards a framework for tool selection, Expert Systems with Applications 94 (2018), 164–184.

Chen

F.L.

and Li

F.C.

, Combination of feature selection approaches with SVM in credit scoring, Expert Systems with Applications 37 (2010), 4902–4909.

Hajek

and Michalak

, Feature selection in corporate credit rating prediction, Knowledge-Based Systems 51 (2013), 72–84.

Maldonado

, Pérez

and Bravo

, Cost-based feature selection for Support Vector Machines: An application in credit scoring, European Journal of Operational Research 261 (2017), 656–665.

Oreski

and Oreski

, Genetic algorithm-based heuristic for feature selection in credit risk assessment, Expert Systems with Applications 41 (2014), 2052–2064.

Wang

, Zhang

, Bai

and Mao

, A hybrid system with filter approach and multiple population genetic algorithm for feature selection in credit scoring, Journal of Computational and Applied Mathematics (2017), S0377042717302078.

Sagi

and Rokach

, Ensemble learning: A survey, WIREs Data Mining and Knowledge Discovery 8 (2018).

10.

Freund

, Experiments with a new boosting algorithm, icml, (1996).

11.

Friedman

J.H.

, Greedy Function Approximation: A Gradient Boosting Machine, Annals of Statistics 29 (2001), 1189–1232.

12.

Chen

and Guestrin

, XGBoost: A Scalable Tree Boosting System, the 22nd ACM SIGKDD International Conference, 2016.

13.

Meng

, LightGBM:AHighly Efficient Gradient Boosting Decision Tree, (2018).

14.

Lessmann

, Baesens

B.U.

, Seow

H.V.

and Thomas

, Benchmarking state-of-the-art classification algorithms for credit scoring: A ten-year update, elsevier, (2015).

15.

Xia

, Liu

and Bowen

, and Fangming, A novel heterogeneous ensemble credit scoring model based on bstacking approach, Expert Systems with Application 93 (2018), 182–199.

16.

Singh

, Sharma

and Singh

, Nature-inspired algorithms for Wireless Sensor Networks: A comprehensive survey, Computer Science Review 39 (2021).

17.

Elsonbaty

, Sabir

, Ramaswamy

and Adel

, Dynamical analysis of a novel discrete fractional sitrs model for covid-19, Fractals (2021), 2140035.

18.

Snchez

Y.G.

, Sabir

and Guirao

J.L.G.

, Design of a nonlinear sitr fractal model based on the dynamics of a novel coronavirus (COVID-19), Fractals, (2020).

19.

Umar

, Sabir

, Raja

, Aguilar

and Shoaib

, Neuro-swarm intelligent computing paradigm for nonlinear HIV infection model with CD4+ T-cells, Mathematics and Computers in Simulation 188 (2021).

20.

Sabir

, Raja

, Le

D.N.

and Aly

A.A.

, A neuroswarming intelligent heuristic for second-order nonlinear Lane–Emden multi-pantograph delay differential system, Complex & Intelligent Systems (2021), 1–14.

21.

Holland

, Adaptation in natural and artificial systems : an introductory analysis with application to biology, Control & Artificial Intelligence, (1975).

22.

Sabir

, Khalique

C.M.

, Raja

and Baleanu

, Evolutionary computing for nonlinear singular boundary value problems using neural network, genetic algorithm and active-set algorithm, European Physical Journal Plus 136 (2021).

23.

Nisar

, Sabir

, Raja

M.A.Z.

, Ibrahim

A.A.A.

and Rawat

D.B.

, Evolutionary Integrated Heuristic with Gudermannian Neural Networks for Second Kind of Lane–Emden Nonlinear Singular Models, Applied Sciences 11 (2021), 4725.

24.

Umar

, Sabir

, Raja

M.A.Z.

, Baskonus

H.M.

, Yao

S.-W.

and Ilhan

, A novel study of Morlet neural networks to solve the nonlinear HIV infection system of latently infected cells, Results in Physics 25 (2021).

25.

Nisar

, Sabir

, Raja

, Ibrahim

and Rawat

D.B.

, Design of Morlet Wavelet Neural Network for Solving a Class of Singular Pantograph Nonlinear Differential Models, IEEE Access PP (2021), 1–1.

26.

Zhang

and Yang

, An ensemble of classifiers with genetic algorithmBased Feature Selection, IEEE Intelligent Informatics Bulletin 9 (2008), 18–24.

27.

F.N.K. A, H.S. B and M.K. C, A hybrid data mining model of feature selection algorithms and ensemble learning classifiers for credit scoring, Journal of Retailing and Consumer Services 27 (2015), 11–23.

28.

Chou

C.H.

, Hsieh

S.C.

and Qiu

C.J.

, Hybrid genetic algorithm and fuzzy clustering for bankruptcy prediction, Applied Soft Computing 56 (2017), 298–316.

29.

Yang

, Zhang

, Wu

, Ablanedo-Rosas

J.H.

, Yang

and Yu

, A novel multi-stage ensemble model with fuzzy clustering and optimized classifier composition for corporate bankruptcy prediction, Journal of Intelligent & Fuzzy Systems 40 (2021), 4169–4185.

30.

Zhang

, He

and Zhang

, A novel multi-stage hybrid model with enhanced multi-population niche genetic algorithm: An application in credit scoring, Expert Systems with Applications 121 (2018), 221–232.

31.

Tripathi

, Edla

D.R.

, Cheruku

, Thampi

S.M.

, El-Alfy

E.-S.M.

, Mitra

and Trajkovic

, Hybrid credit scoring model using neighborhood rough set and multi-layer ensemble classification, Journal of Intelligent & Fuzzy Systems 34 (2018), 1543–1549.

32.

Lei

and Liu

, Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution, Machine Learning, Proceedings of the Twentieth International Conference (ICML 2003), August 21-24, 2003, Washington, DC, USA, 2003.

33.

Nalić

, Martinović

and Žagar

, New hybrid data mining model for credit scoring based on feature selection algorithm and ensemble classifiers, Advanced Engineering Informatics 45 (2020).

34.

D.L. A, C.F.T. B and H.T.W, The effect of feature selection on financial distress prediction, Knowledge-Based Systems 73 (2015), 289–297.

35.

Breiman

, Bagging predictors” Machine Learning, Machine Learning 24 (1996).

36.

Schapire

R.E.

, The strength of weak learnability, Proceedings of the Second Annual Workshop on Computational Learning Theory 5 (1989), 197–227.

37.

Wolpert

D.H.

, Stacked Generalization, Neural Networks 5 (1992), 241–259.

38.

Dietterich

T.G.

, Ensemble Methods in Machine Learning, InternationalWorkshop on Multiple Classifier Systems, 2000.

39.

Mencía

, Park

S.H.

and Fürnkranz

, Efficient voting prediction for pairwise multilabel classification, Neurocomputing 73 (2010), 1164–1176.

40.

Galar

, Fernández

, Barrenechea

, Bustince

and Herrera

, An overview of ensemble methods for binary classifiers in multi-class problems: Experimental study on one-vs-one and one-vs-all schemes, Pattern Recognition 44 (2011), 1761–1776.

41.

Stehman

S.V.

, Selecting and interpreting measures of thematic classification accuracy, Remote Sensing of Environment 62 (1997), 77–89.

42.

Cawley

G.C.

and Talbot

, On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation, Journal of Machine Learning Research 11 (2010), 2079–2107.

43.

Asuncion

, UCI Machine Learning Repository, University of California, Irvine, School of Information and Computer Sciences, http://www.ics.uci.edu/ mlearn/MLRepository.html, (2007).

44.

Abramson

, Braverman

D.J.

and Sebestyen

G.S.

, Pattern Recognition and Machine Learning, Publications of the American Statistical Association 103 (2006), 886–887.

45.

Brier

G.W.

, Verification of Forecasts Expressed in Terms of Probability,” Monthly Weather Review 78:1-3, Monthly Weather Review 78 (1950), 1–3.

46.

Schaffer

, Selecting a classification method by cross-validation, Machine Learning 13 (1993), 135–143.

47.

Jin

, Liu

, Zhang

and Lou

, A novel multi-stage ensemble model with multiple K-means-based selective undersampling: An application in credit scoring, Journal of Intelligent & Fuzzy Systems 40 (2021), 9471–9484.

48.

Ala’raj

and Abbod

M.F.

, A new hybrid ensemble credit scoring model based on classifiers consensus system approach, Expert Systems with Applications 64 (2016), 36–55.

49.

Hongliang

and Wenyu

, A novel ensemble method for credit scoring: Adaption of different imbalance ratios, Expert Systems with Application (2018).

50.

Lan

, Xu

, Ma

and Li

, Multivariable data imputation for the analysis of incomplete credit data, Expert Systems with Applications 141 (2020).

51.

Friedman

, A Comparison of Alternative Tests of Significance for the Problem of $m$ Rankings, Annals of Mathematical Statistics 11 (1940), 86–92.

A hybrid model with novel feature selection method and enhanced voting method for credit scoring

Abstract

Keywords

1 Introduction

2 Related work

2.1 Genetic algorithm

2.2 Feature selection methods

2.3 Ensemble methods

3 Modeling

3.2 FSCM-GA

4.1 Dataset description

Table 1 Description of the datasets Dataset Sample size Good samples Bad samples Total features (numerical /categorical) Australian 690 307 383 14 (6/8) German 1000 700 300 20 (7/13) Japanese 690 307 383 15 (6/9)

4.4 Experiment parameter settings

5 Experimental results and comparative analysis

5.1 Baseline classifier results

Table 6 Performance of the proposed model in different datasets Dataset Accuracy AUC F-score Log loss Brier score Australian 0.8804 0.9466 0.8643 0.2982 0.0880 German 0.7748 0.8050 0.8505 0.4792 0.1566 Japanese 0.8797 0.9369 0.8926 0.3130 0.0908

Footnotes

Acknowledgments

References

Table 1
Description of the datasets

Dataset Sample size Good samples Bad samples Total features (numerical /categorical)

Australian 690 307 383 14 (6/8)

German 1000 700 300 20 (7/13)

Japanese 690 307 383 15 (6/9)

Table 6
Performance of the proposed model in different datasets

Dataset Accuracy AUC F-score Log loss Brier score

Australian 0.8804 0.9466 0.8643 0.2982 0.0880

German 0.7748 0.8050 0.8505 0.4792 0.1566

Japanese 0.8797 0.9369 0.8926 0.3130 0.0908