A novel multi-stage ensemble model with fuzzy clustering and optimized classifier composition for corporate bankruptcy prediction

Abstract

With the rapid development of commercial credit mechanisms, credit funds have become fundamental in promoting the development of manufacturing corporations. However, large-scale, imbalanced credit application information poses a challenge to accurate bankruptcy predictions. A novel multi-stage ensemble model with fuzzy clustering and optimized classifier composition is proposed herein by combining the fuzzy clustering-based classifier selection method, the random subspace (RS)-based classifier composition method, and the genetic algorithm (GA)-based classifier compositional optimization method to achieve accuracy in predicting bankruptcy among corporates. To overcome the inherent inflexibility of traditional hard clustering methods, a new fuzzy clustering-based classifier selection method is proposed based on the mini-batch k-means algorithm to obtain the best performing base classifiers for generating classifier compositions. The RS-based classifier composition method was applied to enhance the robustness of candidate classifier compositions by randomly selecting several subspaces in the original feature space. The GA-based classifier compositional optimization method was applied to optimize the parameters of the promising classifier composition through the iterative mechanism of the GA. Finally, six datasets collected from the real world were tested with four evaluation indicators to assess the performance of the proposed model. The experimental results showed that the proposed model outperformed the benchmark models with higher predictive accuracy and efficiency.

Keywords

Bankruptcy prediction ensemble learning fuzzy mini-batch clustering heterogeneous model construction genetic algorithm

1 Introduction

Corporates are cells of the national economy. They exhibit characteristics of substantial production cost inputs, long investment return cycles, and numerous jobs, and are fundamental in driving the transition of human society to the industrial age. With the development of the credit system and the improvement in capital utilization efficiency, credit funds from financial institutions have become the engine of rapid corporate expansion. However, credit funds sometimes become a huge source of risk causing losses to financial institutions and banks worldwide [37]—the 2008 world financial crisis is a prime example. Measuring the risk and probability of default wherein corporate customers are unable to repay their debts has thus garnered the attention of all risk management levels, bank legislators, and regulators [51], making it critical to be able to predict bankruptcy in corporations.

Bankruptcy prediction is a typical binary classification problem. It divides companies that apply for loans into two categories: those that are well run and those that have potentially high bankruptcy risks. Generally, experience-based and audit-based credit review modes are labor intensive and require considerable material resources; even after extensive effort in these modes, credit fraud prevention remains poor [28]. Hence, classic bankruptcy prediction models—such as the Z-score [4]—have been used to measure a corporation’s financial health. The Basel Committee on Banking Supervision [40] has recommended that banks estimate risks during the entire loan period. Nevertheless, the accuracy of medium- and long-term bankruptcy predictions remains difficult to improve.

In general, the bankruptcy prediction model predicts corporate bankruptcy based on the corporate’s relevant information (e.g. financial statements of corporates) at a given time [16]. Various bankruptcy prediction models have been designed using artificial intelligence technology, including gradient boosting decision tree (GBDT) [22], linear discriminant analysis (LDA) [20], random forest (RF) [9], extreme gradient boosting (XGBoost) [12], and light gradient boosting machine (LightGBM) [29] to support the financial institutions in making the correct business decisions [43]. In particular, the ensemble model has been recognized as a powerful tool for predicting corporate bankruptcy [53]; it has gradually garnered attention, thus becoming more and more mainstream in recent times.

Herein, a novel multi-stage ensemble model with fuzzy clustering and optimized classifier composition is proposed to achieve good corporate bankruptcy prediction. First, a new fuzzy clustering method based on the mini-batch k-means algorithm [42] is proposed to generate amplified clusters, which supports the selection of the best performing base classifiers for generating classifier compositions. Eleven candidate base classifiers were trained and validated on the amplified clusters, and the five best performing classifiers were obtained at this stage. Subsequently, candidate classifier compositions were generated by permutation and combination based on the five best performing classifiers. Random subspace (RS) [7] was applied to train candidate classifier compositions by randomly selecting several subspaces in the original feature space to enhance the robustness of the candidate classifier compositions, which is a prelude to developing an effective ensemble model. This helped to achieve good performance on the candidate classifier compositions under different feature dimensions. A promising classifier composition was obtained from the candidate classifier compositions after evaluation on the validation set. Finally, the genetic algorithm (GA)-based classifier compositional optimization method was applied to optimize the parameters of the promising classifier composition through the iterative mechanism of the GA [38]. The optimal classifier composition was output by the GA-based classifier compositional optimization method to form a stacking-based heterogeneous ensemble model. The performance of the GA in feature selection has been shown to be effective [35]. In this study, the GA was also shown to perform well in the parametric optimization of classifier composition, thus facilitating the construction of an ensemble model with improved performance.

The construction process of the proposed model is adaptive. The selected five best performing base classifiers, the intermediate promising classifier composition, and the final optimal classifier composition are not fixed but depend on the characteristics of different datasets, considering that different base classifiers, different combinations, or different parameterizations yield different classification performances on different datasets.

The remainder of the paper is organized as follows: Section 2 reviews previous literature related to the proposed model; Section 3 elaborates on the main characteristics of the proposed model; Section 4 introduces the experimental datasets, evaluation indicators, and parameter setting; Section 5 analyzes the experimental results; and Section 6 presents the conclusion and discusses future research directions.

2 Related work

In recent years, the prediction of corporate bankruptcy has garnered considerable attention. Consequently, numerous studies regarding the relevant technologies and methods used to predict corporate bankruptcy have been conducted. Recent literature reviews on bankruptcy prediction have been performed by [3 , 47]. The most popular methods, including feature engineering, RS, and ensemble learning, are briefly reviewed in this section.

2.1 Feature engineering method

Feature engineering integrates a series of engineering methods to screen better data features from the original data and improve the training effect of the classification model. Reliable data and features are prerequisites for models and algorithms to achieve improved performance. In recent years, the focus on feature processing for corporate bankruptcy prediction has increased. Feature processing typically includes data preprocessing, feature selection, dimension reduction, and other processes. Tsai [46] developed a hybrid ensemble model based on combining hard clustering techniques such as self-organizing maps [32] with feature engineering to predict financial distress. Feuerriegel & Gordon [19] presented a methodology that extends lag variables with unstructured data in the form of financial news that suggests a projection of words onto latent semantic structures as a means of feature engineering. The implementation of the model and the analysis of the most descriptive variables provide insightful information regarding the most critical features of distressed banks relative to non-distressed banks.

Kim et al. [30] used a GA-based optimization approach to select appropriate features, which improved predictive performance through information extraction. Hu [27] developed a multivariate gray prediction model for bankruptcy prediction by sifting the relevant features that had the strongest relationship with the class feature. However, methods to extract informative patterns from similar sample points and the analysis of how sample points could be endowed with abundant information to facilitate reliable predictions have rarely been considered. Extracting filtered and optimized information from features has proven to be effective for improving the performance of classification models.

In our previous work, Zhang et al. [53] proposed a multi-stage hybrid model that combined feature selection and classifier selection to obtain the optimal feature and classifier subsets. Subsequently, a classifier ensemble was used to improve the predictive performance based on these two optimal subsets. Feature extraction can reduce computational complexity and simultaneously cause feature information loss. Hence, clustering feature engineering can be applied to organize the sample points according to the principle of information similarity, which not only reduces computing time but also enriches feature information. Therefore, in the current study, a new fuzzy clustering-based classifier selection method is proposed, which demonstrates a good clustering feature engineering effect on classification problems.

2.2 Random subspace method

The RS method improves predictive accuracy by synthesizing the ability of multiple classifiers. This method is based on the stochastic discriminant theory and relies on a pseudo-random process to generate an attribute subset. The generation of each sub-model is independent. Abellán & Mantas [1] used the RS-based ensemble method to improve the performance of classifiers for bankruptcy and credit scoring. Zhu et al. [54] proposed a hybrid ensemble learning approach by incorporating two classic ensemble learning approaches, RS and MultiBoosting [49], to improve the accuracy in forecasting small-medium enterprises’ credit risk.

Previous studies seldom involved the optimization effect of the RS method on classifier composition. Ekinci & Erdal [18] proposed an attribute-based ensemble method combining various RSs for bank failure prediction. Wang et al. [48] proposed an approach for bankruptcy prediction that incorporates sentiment and textual information into the RS method. García et al. [23] investigated the potential relationship between the performance of classifier sets including RSs and positive sample types based on different types of empirical samples.

When building an ensemble model, a parallel combination of classifiers based on RS not only expedites the learning process and reduces the operation time, but also enhances the independence between base classifiers. In the proposed model, the concept of RS is primarily applied to eliminate the mutual interference between base classifiers and enhance the robustness of candidate classifier compositions.

2.3 Ensemble models

In corporate bankruptcy predictions, large datasets pose major challenges to traditional classification prediction models. In recent studies, ensemble learning has been proven to attain higher accuracy and stability than base classifiers on large datasets. Le et al. [33] proposed squared logistics loss with graphics processing unit (GPU)-based extreme gradient boosting for bankruptcy forecasting. Tripathi et al. [45] developed a hybrid model that combined feature selection and a multilayer ensemble classifier framework to improve the predictive performance of credit scoring.

However, the single ensemble learning algorithm is disadvantageous in terms of generalization and stability. Hence, Tsai [46] developed a hybrid ensemble model based on classification techniques such as logistic regression (LR) [10], multilayer perceptron (MLP) [6], and the decision tree (DT) [34]. Du Jardin [17] proposed a hybrid financial distress model based on the clustering and ensemble method to estimate the decision boundary between failed and non-failed firms. Kim et al. [30] examined the effectiveness of a hybrid ensemble method by combining the clustering technique and GA based on an artificial neural network model to balance the proportion between minority and majority classes. Zięba et al. [55] proposed an approach for bankruptcy prediction based on extreme gradient boosting (EXGB). Choi et al. [13] proposed voting-based ensemble models that predicted the financial distress of contractors two and three years ahead of the prediction point using the finance-based definition of pecuniary distress.

Previous studies have examined classifier composition as an ensemble method, but few have investigated classifier composition through parametric optimization, especially by applying heuristic-based optimization algorithms. In our previous work, He et al. [25] developed a stacking-based ensemble model for credit scoring by adapting the model to different imbalanced ratio datasets and obtained superior predictive performance. Nevertheless, stacking-based ensemble models can be improved by addressing the challenges in base classifier screening and exploring the compositional optimization of base classifiers. The concept of applying a heuristic algorithm to perform parametric optimization on classifier composition is helpful. Therefore, the GA was applied to the proposed model to optimize the parameters of the promising classifier composition through the iterative mechanism of the GA. The optimal classifier composition was output by the GA-based classifier compositional optimization method to form a stacking-based heterogeneous ensemble model.

3 Modeling

In this work, a novel multi-stage ensemble model with fuzzy clustering and optimized classifier composition for corporate bankruptcy prediction is proposed. It comprises three main stages: fuzzy clustering-based classifier selection, RS-based classifier composition, and GA-based classifier compositional optimization. The architecture of the proposed model is shown in Fig. 1.

Fig. 1

Architecture of the proposed model.

3.1 Fuzzy clustering-based classifier selection

Considering that extremely imbalanced data in datasets could affect the performance of the model, the synthetic minority oversampling technique (SMOTE) [11] algorithm was applied to increase the minority class samples; this could result in an increase in noise data in the minority class samples. Increase in noise data is detrimental to the model performance, but clustering technology can be used to divide the entire dataset into several subsets so that the classifiers can learn multiple times using small batches. In this study, to manage the possible negative impact of the increase in noise data on the model performance, the clustering technique was adopted to aggregate the data, including noise data, into different clusters to improve the learning efficiency of classifiers in handling different types of data.

The mini-batch clustering technique [42] was applied to the proposed model; thus, we could achieve better clustering results on large datasets at a lower computational cost. Compared with the classic k-means algorithm [2], the mini-batch k-means algorithm reduces the computation cost by an order of magnitude [42] and is more suitable for large datasets. After the mini-batch k-means algorithm was executed, the generated clusters were combined into “amplified clusters” based on the core concept of “k-nearest neighbors” (KNN) [15]. To form an amplified cluster, the Euclidean distances between all clusters were calculated separately; subsequently, the three nearest clusters with the shortest Euclidean distance were merged into one amplified cluster. The amplified clusters had a larger sample size and richer information than the clusters generated by the mini-batch clustering technique directly, and therefore provided more sample points for training.

The amplified clusters were separated into sub-training and sub-validation sets: 80% of the amplified clusters were used as the sub-training set, and the remaining 20% were used as the sub-validation set. With the amplified clusters formed, the sub-training set was enlarged for each candidate base classifier, which was beneficial for reducing the misclassification rate of each cluster during training. The candidate base classifiers were trained by the sub-training set composed of amplified clusters instead of the clusters generated by the mini-batch clustering technique directly.

Subsequently, the fuzzy membership degree of the sample points is allocated to the clusters. Each sample point x belongs to distinct clusters with different degrees. The sample point x is assigned to n clusters, and the distance from the sample point x to the nearest clustering center is measured using the Euclidean distance. The fuzzy membership M_i of each sample point x is defined in Equation (1) as follows: $M_{i} = (∥ x - C_{i} ∥ / \sum_{i = 1}^{n} ∥ x - C_{i} ∥) \times 100 %$ (1) where C_i represents the center of a cluster to which point x belongs (i = 1, 2, . . . , n), and ∥x - C_i∥ represents the Euclidean distance from sample point x to the clustering center C_i. The sum of the fuzzy membership degrees of each sample point x is one, which is represented as $\sum_{i = 1}^{n} M_{i} = 1$ . The fuzzy membership degree of each sample point is used as an additional feature to provide more information for the prediction.

The selection of base classifiers affects the performance of the candidate classifier compositions. Eleven popular base classifiers, including KNN, LR, RF, GBDT, LDA, support vector machines (SVM) [14], MLP, AdaBoost, DT, XGBoost, and LightGBM, were evaluated. At this stage, the 11 candidate base classifiers were validated on the sub-training set composed of amplified clusters, and the five best performing classifiers were subsequently obtained.

The five best performing classifiers (i.e., GBDT, AdaBoost, DT, XGBoost, and LightGBM) with higher predictive accuracy were selected based on the area under the receiver operating characteristics (ROC) curve (AUC) [24] and the performance of each base classifier in the first stage. The fuzzy clustering-based classifier selection process was both adaptive and flexible. The selected five best performing base classifiers were not fixed but depended on the characteristics of the different datasets, considering that different base classifiers yielded different performances on different datasets.

3.2 RS-based classifier composition method

In heterogeneous ensemble models, composing the best performing base classifiers does not directly result in the best performing heterogeneous ensemble model; it may even cause model overfitting [25]. To decrease the risk of overfitting and provide a contingency classifier composition, RS was applied to train candidate classifier compositions by randomly selecting several subspaces in the original feature space to enhance their robustness.

To maintain the diversity of candidate classifier compositions in a heterogeneous ensemble model, the candidate classifier compositions were composed of the permutations of the five best performing classifiers output from the last stage. Theoretically, the more best performing base classifiers selected to compose the ensemble model, the more candidate classifier compositions are produced, resulting in improved composition performance and increased computational complexity. Therefore, to deal with the tradeoff between them, the five best performing classifiers were output from the last stage through a trial run. For example, a candidate classifier composition can be composed of two or three classifiers belonging to the best performing classifiers. The total number of candidate classifier compositions is 26 and can be represented as $\sum_{i = 2}^{5} C_{5}^{i}$ . The RS-based classifier composition method applies the candidate classifier compositions to the randomly selected part of the sub-training set. Therefore, good performance can be achieved for the candidate classifier compositions under different feature dimensions. This implies that the candidate classifier compositions will be equipped to address different feature combinations. The candidate classifier compositions are configured through the stacking method [51], which combines the prediction of several base classifiers through the soft voting strategy [41] to improve the predictive accuracy of the ensemble model. As shown in Fig. 2, the candidate classifier compositions were trained and validated using the RS-based classifier composition method in the amplified clusters. A promising classifier composition was obtained from the candidate classifier compositions after evaluation on the validation set.

Fig. 2

Schematic diagram of the RS-based classifier composition method.

The RS-based classifier composition process is adaptive. The output promising classifier composition is not fixed but depends on the characteristics of the different datasets, considering that different combinations of classifiers yield different performances on different datasets. The promising classifier composition can be further enhanced using the GA-based classifier compositional optimization method in the next stage.

3.3 GA-based classifier compositional optimization method

In heterogeneous ensemble models, the instability of model performance caused by randomly selecting the parameters of the involved base classifiers must be avoided. In the GA-based classifier compositional optimization method, the GA is applied to optimize the parameters of the promising classifier composition by simulating the natural selection of Darwinian biological evolution based on chromosome representation, selection operation, crossover operation, and mutation operation. The fitness function was set to the AUC of the promising classifier composition. The main steps of the GA are as follows: initially, the fitness of individuals corresponding to each chromosome was measured. Next, the two individuals with the best fitness were selected as the parents to produce offspring through the crossover operation; subsequently, the chromosomes of the offspring were mutated; and finally, the aforementioned operations were repeated until a new population comprising the optimal individual was generated. As shown in Fig. 3, Parent 1and Parent 2are the individuals corresponding to the promising classifier composition output from the second stage with different parametric settings. Offspring 1 and Offspring 2 are the offspring generated by the crossover of Parent 1 and Parent 2, and Offspring 2 is the offspring generated by the mutation of Parent 1.

Fig. 3

Schematic diagram of the GA-based classifier compositional optimization method.

The promising classifier composition is composed of several base classifiers. Each base classifier is determined by several parameters, including the number of estimators and the learning rate [26]. Consequently, the number of estimators corresponding to each base classifier influences the predictive performance of the promising classifier composition. The learning rate determines the loss function of the promising classifier composition [39]. The number of estimators is a positive integer, and the learning rate is a real number between zero and one. The parameters corresponding to all base classifiers of the promising classifier composition form a chromosome. For example, $P_{1}^{1}$ and $P_{2}^{1}$ represent the parameters in a set of parameters of the promising classifier composition. The GA-based classifier compositional optimization method aims to maintain the diversity of the ensemble model and optimize the parameters of the promising classifier composition. The optimal classifier composition is output by the GA-based classifier compositional optimization method to form a stacking-based heterogeneous ensemble model.

The GA-based classifier compositional optimization process is adaptive. The output optimal classifier composition is not fixed but depends on the characteristics of the different datasets, considering that different parameterizations of classifiers yield different performances on different datasets.

As a powerful ensemble method, stacking is applied to the proposed model, combining the prediction of best performing base classifiers via a meta-classifier. Experimentally, LR proved to be the better choice for the meta-classifier in the proposed model.

4 Experiment

4.1 Dataset description and data preprocessing

In this experiment, the datasets were from the UC Irvine (UCI) machine learning repository [55], which contain the real-world financial indicators of Polish manufacturing corporations from 2007 to 2011. The datasets were separated into five parts (each part representing a fiscal year) that described the period from the 1st year (2007 fiscal year) to the 5th year (2011 fiscal year), which corresponded to five different bankruptcy cycles. The class labels (“0” for operating and “1” for bankruptcy) of the datasets were determined using the bankruptcy status of the enterprise in 2012. Furthermore, another larger real-world dataset (i.e., Creator dataset) that was published by a Chinese intelligent government services provider called Creator Information Technology Co., Ltd 1 in 2019, was also adopted. The Creator dataset included company management information for 35960 Chinese companies.

Data preprocessing is crucial in the model. Some basic data preprocessing techniques, including dummy coding, data normalization, and correlation analysis, were applied to process the original datasets and yield an effective classification. Numerical features were standardized by removing the mean and unit variance. Subsequently, dummy coding and polynomial processing were applied to optimize the classification characteristics. Dummy coding can transform a continuous input variable into several dichotomous features, and polynomial processing can increase the diversity of features. Feature correlation analysis was also applied. For any two explanatory features whose correlation was greater than 0.97, only one was considered. The datasets were standardized and normalized to scale the data to within a unified, specified range.

Owing to the extreme imbalance between the number of corporations with financial risks and that of well-run corporations in the market economy, the SMOTE algorithm was applied first to increase the number of minority class samples. As shown in Table 1, six imbalanced datasets existed (named 1st-year, 2nd-year, 3rd-year, 4th-year, 5th-year, and Creator). Well-run corporations were abbreviated as WRCs, bankrupt corporations as BCs, and the imbalance ratio as IR. After the SMOTE process, the datasets became balanced; subsequently, the proportion between the positive and negative samples was 1:1.

Table 1
Description of datasets with imbalance ratios

Dataset Sample size WRCs BCs IR

1st-year 7027 6756 271 24.93

2nd-year 10173 9773 400 24.43

3rd-year 10503 10008 495 20.22

4th-year 9792 9277 515 18.01

5th-year 5910 5500 410 13.41

Creator 35960 21959 14001 1.59

Dataset	Sample size	WRCs	BCs	IR
1st-year	7027	6756	271	24.93
2nd-year	10173	9773	400	24.43
3rd-year	10503	10008	495	20.22
4th-year	9792	9277	515	18.01
5th-year	5910	5500	410	13.41
Creator	35960	21959	14001	1.59

4.2 Evaluation indicators

In this study, four evaluation indicators were adopted, including accuracy (ACC), AUC, F-score, and logistic loss [8]. These indicators reflected the performance of models well, each holding different emphases. The evaluation indicators of the model, such as ACC and AUC, were determined by the value of true positive (TP), true negative (TN), false positive (FP), and false negative (FN) indicators. The confusion matrix shown in Table 2 is the foundation for various evaluation indicators typically used in classification prediction. The predictive accuracy is defined in Equation (2).

Table 2
Confusion matrix

Predicted results

Positive Negative

Real result Positive True positives (TP) False negatives (FN)

Negative False positives (FP) True negatives (TN)

	Predicted results
Real result	Positive	True positives (TP)	False negatives (FN)
	Negative	False positives (FP)	True negatives (TN)

$Accuracy = \frac{TP + TN}{TP + FP + TN + FN}$ (2)

AUC is a statistical indicator of the advantages and disadvantages of classifiers typically used in binary classification tasks. In general, the actual dataset is imbalanced, with more positive samples than negative samples (or vice versa). AUC is less sensitive to imbalanced data. Although the predictive accuracies of classification models are similar, those with a higher AUC value will have better classification ability.

The F-score (also called F-measure) is defined in Equation (3). Precision is defined in Equation (4); it represents the proportion of TP samples classified by the classifier in all positive samples. The F-score is the harmonic average of precision and recall, in which the best value (perfect precision and recall) is one and the worst value is zero. $F - score = \frac{2 \times Precision \times Recall}{Precision + Recall}$ (3) $Precision = \frac{TP}{TP + FP}$ (4)

Log loss is known as the cross-entropy loss function, which is a loss measurement of the classification model. As is depicted in Equation (5), i represents the sequence number of the predicted sample and i∈ { 0, n }; n represents the number of samples; y_i and p_i represent the real value and the probability prediction, respectively; and y_i∈ { 0, 1 }. $L_{logistic} = - \frac{1}{n} \sum_{i = 1}^{n} (y_{i} log (p_{i}) + (1 - y_{i}) log (1 - p_{i}))$ (5)

4.3 Experimental parameter settings

The raw dataset was divided as follows: 20% of the total data was used as the test data; the remaining 80% was used as the training data, which was further separated into two parts, where 80% was used as the training set and 20% as the validation set. Therefore, the ratio of the training set, validation set, and test data was 0.64:0.16:0.20. In each dataset, default parameters were utilized for the base classifiers before the GA-based classifier compositional optimization method was applied. The base classifiers were imported from the Python module “sklearn.” The SMOTE algorithm was imported from the Python module “imblearn.” In the mini-batch k-means algorithm, the number of clustering centers was set to eight, and the size of the mini-batches was set to 80. The number of adjacent clusters for each cluster was set to two (so that each amplified cluster was a union of three original clusters). When the GA was applied to optimize the parameters of the promising classifier composition, the number of genetic iterations was set to 30, the number of individuals in each generation to 30, the crossover rate to 0.8, the mutation rate to 0.05, and the fitness function to the AUC value. Experimental results and comparative analysis are elaborated in the next section.

5 Experimental analysis

Four indicators were adopted to evaluate the performance of the baseline classifiers and ensemble models. To enhance the robustness of the experiments and reduce fortuity, each experiment was performed 10 times, and the average values were calculated for evaluation.

5.1 Baseline results

To verify the performance of the proposed model, several baseline results are presented first for comparison. In the baseline experiment, 11 popular base classifiers, KNN, LR, RF, GBDT, LDA, SVM, MLP, AdaBoost, DT, XGBoost, and LightGBM, were applied. The performance was evaluated based on the AUC and F-score. For the same dataset, the values of the performance indicators are shown in bold if the base classifier performs better with SMOTE processing, as shown in Table 3. In Table 3, imbalanced datasets without SMOTE processing are represented as “raw dataset,” and balanced datasets with SMOTE processing are represented as “balanced dataset.” SMOTE processing improves the performance of base classifiers on evaluation indicators, but a fair comparison between before and after SMOTE processing is not affected because the SMOTE only processes the training set rather than the validation and test datasets. It is noteworthy that in raw datasets without SMOTE processing, the F-score values of many classifiers were approximately zero. This was because a severely imbalanced raw dataset prompted the classification preference to a larger proportion of the class during the prediction. The experimental results indicate that the base classifier performed better on the balanced dataset.

Table 3
Baseline results

Dataset Base classifiers Raw dataset Balanced dataset

AUC F-score AUC F-score

1st-year KNN 0.5146 0.0588 0.8248 0.7915

LR 0.5150 0.0597 0.7179 0.6385

RF 0.5000 0.0000 0.8682 0.8449

GBDT 0.6836 0.5111 0.9734 0.9708

LDA 0.5143 0.0580 0.6504 0.5431

SVM 0.5000 0.0000 0.6648 0.5087

MLP 0.5315 0.1176 0.8897 0.8714

Adaboost 0.5964 0.3200 0.9562 0.9505

DT 0.6699 0.3622 0.9228 0.9094

XGBoost 0.6836 0.5111 0.9778 0.9756

LightGBM 0.6675 0.4773 0.9777 0.9756

2nd-year KNN 0.4997 0.0000 0.8072 0.7835

LR 0.4997 0.0000 0.7186 0.6731

RF 0.5000 0.0000 0.8142 0.7944

GBDT 0.5613 0.2020 0.9671 0.9640

LDA 0.4997 0.0000 0.6964 0.6483

SVM 0.5000 0.0000 0.5861 0.3600

MLP 0.4990 0.0000 0.8379 0.8185

Adaboost 0.4997 0.0000 0.9363 0.9290

DT 0.5970 0.2143 0.8713 0.8565

XGBoost 0.6203 0.3689 0.9709 0.9686

LightGBM 0.6133 0.3462 0.9695 0.9669

3rd-year KNN 0.4993 0.0000 0.8333 0.8189

LR 0.4978 0.0000 0.7417 0.7152

RF 0.5000 0.0000 0.8299 0.8245

GBDT 0.6469 0.3972 0.9764 0.9751

LDA 0.4978 0.0000 0.7105 0.6718

SVM 0.5000 0.0000 0.7514 0.7235

MLP 0.4988 0.0000 0.9133 0.9077

Adaboost 0.5531 0.1869 0.9538 0.9508

DT 0.5996 0.2120 0.9089 0.9030

XGBoost 0.6593 0.4380 0.9776 0.9763

LightGBM 0.6704 0.4638 0.9793 0.9783

4th-year KNN 0.5067 0.0313 0.8388 0.8325

LR 0.5028 0.0159 0.7549 0.7466

RF 0.5000 0.0000 0.8252 0.8296

GBDT 0.7004 0.5355 0.9742 0.9743

LDA 0.5023 0.0156 0.7102 0.6999

SVM 0.4997 0.0000 0.7540 0.7489

MLP 0.5418 0.1507 0.9115 0.9131

Adaboost 0.6189 0.3718 0.9561 0.9563

DT 0.6679 0.3797 0.9021 0.9040

XGBoost 0.7014 0.5475 0.9825 0.9826

LightGBM 0.6928 0.5281 0.9836 0.9837

5th-year KNN 0.5420 0.1524 0.8759 0.8936

LR 0.5194 0.0825 0.8413 0.8662

RF 0.5906 0.3030 0.8718 0.8966

GBDT 0.7990 0.6897 0.9769 0.9800

LDA 0.5424 0.1538 0.7302 0.7901

SVM 0.5000 0.0000 0.8393 0.8674

MLP 0.5851 0.2703 0.9254 0.9357

Adaboost 0.7301 0.5652 0.9664 0.9710

DT 0.7779 0.5537 0.8951 0.9098

XGBoost 0.7647 0.6567 0.9739 0.9780

LightGBM 0.7624 0.6331 0.9804 0.9835

Creator KNN 0.6396 0.5439 0.7181 0.6253

LR 0.7350 0.7220 0.9292 0.9026

RF 0.7356 0.7345 0.9921 0.9909

GBDT 0.7351 0.7340 0.9850 0.9837

LDA 0.5743 0.4692 0.6775 0.5366

SVM 0.7339 0.7115 0.9011 0.8690

MLP 0.7353 0.7271 0.9276 0.9022

Adaboost 0.7356 0.7342 0.9815 0.9801

DT 0.7323 0.7314 0.9636 0.9614

XGBoost 0.7354 0.7345 0.9843 0.9830

LightGBM 0.7356 0.7345 0.9917 0.9903

Dataset	Base classifiers	Raw dataset	Balanced dataset
1st-year	KNN	0.5146	0.0588	0.8248	0.7915
	LR	0.5150	0.0597	0.7179	0.6385
	RF	0.5000	0.0000	0.8682	0.8449
	GBDT	0.6836	0.5111	0.9734	0.9708
	LDA	0.5143	0.0580	0.6504	0.5431
	SVM	0.5000	0.0000	0.6648	0.5087
	MLP	0.5315	0.1176	0.8897	0.8714
	Adaboost	0.5964	0.3200	0.9562	0.9505
	DT	0.6699	0.3622	0.9228	0.9094
	XGBoost	0.6836	0.5111	0.9778	0.9756
	LightGBM	0.6675	0.4773	0.9777	0.9756
2nd-year	KNN	0.4997	0.0000	0.8072	0.7835
	LR	0.4997	0.0000	0.7186	0.6731
	RF	0.5000	0.0000	0.8142	0.7944
	GBDT	0.5613	0.2020	0.9671	0.9640
	LDA	0.4997	0.0000	0.6964	0.6483
	SVM	0.5000	0.0000	0.5861	0.3600
	MLP	0.4990	0.0000	0.8379	0.8185
	Adaboost	0.4997	0.0000	0.9363	0.9290
	DT	0.5970	0.2143	0.8713	0.8565
	XGBoost	0.6203	0.3689	0.9709	0.9686
	LightGBM	0.6133	0.3462	0.9695	0.9669
3rd-year	KNN	0.4993	0.0000	0.8333	0.8189
	LR	0.4978	0.0000	0.7417	0.7152
	RF	0.5000	0.0000	0.8299	0.8245
	GBDT	0.6469	0.3972	0.9764	0.9751
	LDA	0.4978	0.0000	0.7105	0.6718
	SVM	0.5000	0.0000	0.7514	0.7235
	MLP	0.4988	0.0000	0.9133	0.9077
	Adaboost	0.5531	0.1869	0.9538	0.9508
	DT	0.5996	0.2120	0.9089	0.9030
	XGBoost	0.6593	0.4380	0.9776	0.9763
	LightGBM	0.6704	0.4638	0.9793	0.9783
4th-year	KNN	0.5067	0.0313	0.8388	0.8325
	LR	0.5028	0.0159	0.7549	0.7466
	RF	0.5000	0.0000	0.8252	0.8296
	GBDT	0.7004	0.5355	0.9742	0.9743
	LDA	0.5023	0.0156	0.7102	0.6999
	SVM	0.4997	0.0000	0.7540	0.7489
	MLP	0.5418	0.1507	0.9115	0.9131
	Adaboost	0.6189	0.3718	0.9561	0.9563
	DT	0.6679	0.3797	0.9021	0.9040
	XGBoost	0.7014	0.5475	0.9825	0.9826
	LightGBM	0.6928	0.5281	0.9836	0.9837
5th-year	KNN	0.5420	0.1524	0.8759	0.8936
	LR	0.5194	0.0825	0.8413	0.8662
	RF	0.5906	0.3030	0.8718	0.8966
	GBDT	0.7990	0.6897	0.9769	0.9800
	LDA	0.5424	0.1538	0.7302	0.7901
	SVM	0.5000	0.0000	0.8393	0.8674
	MLP	0.5851	0.2703	0.9254	0.9357
	Adaboost	0.7301	0.5652	0.9664	0.9710
	DT	0.7779	0.5537	0.8951	0.9098
	XGBoost	0.7647	0.6567	0.9739	0.9780
	LightGBM	0.7624	0.6331	0.9804	0.9835
Creator	KNN	0.6396	0.5439	0.7181	0.6253
	LR	0.7350	0.7220	0.9292	0.9026
	RF	0.7356	0.7345	0.9921	0.9909
	GBDT	0.7351	0.7340	0.9850	0.9837
	LDA	0.5743	0.4692	0.6775	0.5366
	SVM	0.7339	0.7115	0.9011	0.8690
	MLP	0.7353	0.7271	0.9276	0.9022
	Adaboost	0.7356	0.7342	0.9815	0.9801
	DT	0.7323	0.7314	0.9636	0.9614
	XGBoost	0.7354	0.7345	0.9843	0.9830
	LightGBM	0.7356	0.7345	0.9917	0.9903

Note: In the same dataset, the value of performance indicators is shown in bold if the base classifier performs better with SMOTE processing.

5.2 Performance evaluation of the fuzzy clustering-based classifier selection method

To prove the effectiveness of the fuzzy clustering-based classifier selection method on the balanced datasets, the ACC, AUC, F-score, and log loss results on six datasets after the fuzzy clustering-based classifier selection method was applied were compared, as shown in Table 4. In Table 4, the balanced datasets without the fuzzy clustering-based classifier selection method applied are represented as “balanced dataset,” and the balanced datasets with the fuzzy clustering-based classifier selection method applied are represented as “clustered dataset.” For the same dataset, the values of performance indicators are shown in bold if the base classifiers perform better or the same after the fuzzy clustering-based classifier selection method is applied. As shown in Table 4, among the 11 base classifiers (except LDA), most of the ACC, AUC, F-score, and log loss values were improved after the fuzzy clustering-based classifier selection method was applied. As a variance-sensitive algorithm suitable for processing normally distributed datasets [20], LDA encounters a performance degradation after the fuzzy clustering-based classifier selection method was performed because the fuzzy clustering weakens the normal distribution of the sample points.

Table 4
Performance evaluation of the fuzzy clustering-based classifier selection method

Dataset Base classifiers Balanced dataset Clustered dataset

ACC AUC F-score Log loss ACC AUC F-score Log loss

1st-year KNN 0.8397 0.8248 0.7915 5.5363 0.8487 0.8376 0.8079 5.2246

LR 0.7503 0.7179 0.6385 8.6236 0.7542 0.7167 0.6291 8.4900

RF 0.8754 0.8682 0.8449 4.3044 0.8758 0.8680 0.8449 4.2896

GBDT 0.9764 0.9734 0.9708 0.8163 0.9764 0.9736 0.9708 0.8163

LDA 0.6854 0.6504 0.5431 10.864 0.6210 0.5564 0.3000 13.091

SVM 0.7211 0.6648 0.5087 9.6329 0.7254 0.6701 0.5199 9.4844

MLP 0.8977 0.8897 0.8714 3.5326 0.9132 0.9068 0.8916 2.9982

Adaboost 0.9600 0.9562 0.9505 1.3804 0.9600 0.9573 0.9508 1.3804

DT 0.9261 0.9228 0.9094 2.5530 0.9287 0.9260 0.9128 2.4639

XGBoost 0.9802 0.9778 0.9756 0.6828 0.9811 0.9792 0.9767 0.6531

LightGBM 0.9802 0.9777 0.9756 0.6828 0.9820 0.9799 0.9778 0.6234

2nd-year KNN 0.8109 0.8072 0.7835 6.5314 0.8380 0.8334 0.8123 5.5954

LR 0.7287 0.7186 0.6731 9.3698 0.6701 0.6333 0.4615 11.394

RF 0.8144 0.8142 0.7944 6.4093 0.8206 0.8198 0.8000 6.1957

GBDT 0.9685 0.9671 0.9640 1.0886 0.9744 0.9733 0.9708 0.8851

LDA 0.7060 0.6964 0.6483 10.153 0.6068 0.5566 0.2252 13.581

SVM 0.6271 0.5861 0.3600 12.879 0.6150 0.5658 0.2519 13.296

MLP 0.8406 0.8379 0.8185 5.5039 0.8730 0.8685 0.8522 4.3848

Adaboost 0.9376 0.9363 0.9290 2.1568 0.9434 0.9412 0.9350 1.9533

DT 0.8719 0.8713 0.8565 4.4255 0.9040 0.9025 0.8910 3.3166

XGBoost 0.9726 0.9709 0.9686 0.9461 0.9773 0.9759 0.9740 0.7834

LightGBM 0.9711 0.9695 0.9669 0.9970 0.9806 0.9790 0.9777 0.6714

3rd-year KNN 0.8359 0.8333 0.8189 5.6683 0.8465 0.8443 0.8315 5.3032

LR 0.7455 0.7417 0.7152 8.7899 0.7484 0.7423 0.7073 8.6895

RF 0.8272 0.8299 0.8245 5.9695 0.8277 0.8294 0.8224 5.9513

GBDT 0.9770 0.9764 0.9751 0.7941 0.9810 0.9804 0.9794 0.6572

LDA 0.7164 0.7105 0.6718 9.7940 0.6942 0.6845 0.6223 10.560

SVM 0.7558 0.7514 0.7235 8.4340 0.7566 0.7516 0.7220 8.4066

MLP 0.9133 0.9133 0.9077 2.9939 0.9273 0.9264 0.9214 2.5101

Adaboost 0.9545 0.9538 0.9508 1.5700 0.9598 0.9593 0.9567 1.3874

DT 0.9088 0.9089 0.9030 3.1491 0.9168 0.9160 0.9103 2.8752

XGBoost 0.9781 0.9776 0.9763 0.7576 0.9802 0.9797 0.9786 0.6846

LightGBM 0.9799 0.9793 0.9783 0.6937 0.9826 0.9819 0.9811 0.6024

4th-year KNN 0.8383 0.8388 0.8325 5.5858 0.8491 0.8495 0.8453 5.2134

LR 0.7544 0.7549 0.7466 8.4812 0.7210 0.7215 0.7126 9.6356

RF 0.8253 0.8252 0.8296 6.0327 0.8278 0.8275 0.8332 5.9489

GBDT 0.9741 0.9742 0.9743 0.8937 0.9755 0.9755 0.9757 0.8472

LDA 0.7097 0.7102 0.6999 10.026 0.6881 0.6886 0.6796 10.771

SVM 0.7536 0.7540 0.7489 8.5091 0.7542 0.7546 0.7485 8.4905

MLP 0.9116 0.9115 0.9131 3.0536 0.9261 0.9262 0.9265 2.5509

Adaboost 0.9561 0.9561 0.9563 1.5175 0.9542 0.9543 0.9544 1.5827

DT 0.9022 0.9021 0.9040 3.3794 0.9054 0.9053 0.9074 3.2677

XGBoost 0.9825 0.9825 0.9826 0.6051 0.9822 0.9823 0.9823 0.6144

LightGBM 0.9836 0.9836 0.9837 0.5679 0.9863 0.9863 0.9864 0.4748

5th-year KNN 0.8793 0.8759 0.8936 4.1699 0.8824 0.8794 0.8961 4.0626

LR 0.8467 0.8413 0.8662 5.2962 0.8513 0.8454 0.8709 5.1353

RF 0.8793 0.8718 0.8966 4.1699 0.8797 0.8721 0.8970 4.1565

GBDT 0.9775 0.9769 0.9800 0.7777 0.9783 0.9775 0.9807 0.7509

LDA 0.7453 0.7302 0.7901 8.7957 0.7279 0.7071 0.7834 9.3991

SVM 0.8463 0.8393 0.8674 5.3096 0.8467 0.8394 0.8680 5.2962

MLP 0.9274 0.9254 0.9357 2.5073 0.9328 0.9319 0.9401 2.3196

Adaboost 0.9674 0.9664 0.9710 1.1263 0.9682 0.9671 0.9717 1.0995

DT 0.8979 0.8951 0.9098 3.5263 0.9165 0.9141 0.9262 2.8827

XGBoost 0.9752 0.9739 0.9780 0.8581 0.9810 0.9801 0.9831 0.6570

LightGBM 0.9814 0.9804 0.9835 0.6436 0.9825 0.9817 0.9845 0.6034

Creator KNN 0.7640 0.7181 0.6253 8.1497 0.7910 0.7835 0.7391 7.2181

LR 0.9145 0.9292 0.9026 2.9535 0.9833 0.9861 0.9794 0.5763

RF 0.9928 0.9921 0.9909 0.2497 0.9971 0.9926 0.9963 0.1009

GBDT 0.9872 0.9850 0.9837 0.4418 0.9961 0.9964 0.9951 0.1345

LDA 0.7392 0.6775 0.5366 9.0093 0.8170 0.7806 0.7239 6.3200

SVM 0.8806 0.9011 0.8690 4.1253 0.9547 0.9625 0.9459 1.5656

MLP 0.9153 0.9276 0.9022 2.9247 0.9875 0.9890 0.9844 0.4322

Adaboost 0.9844 0.9815 0.9801 0.5379 0.9968 0.9974 0.9960 0.1105

DT 0.9704 0.9636 0.9614 1.0229 0.9942 0.9939 0.9926 0.2017

XGBoost 0.9867 0.9843 0.9830 0.4610 0.9971 0.9976 0.9963 0.1009

LightGBM 0.9924 0.9917 0.9903 0.2641 0.9971 0.9976 0.9963 0.1009

Dataset	Base classifiers	Balanced dataset	Clustered dataset
1st-year	KNN	0.8397	0.8248	0.7915	5.5363	0.8487	0.8376	0.8079	5.2246
	LR	0.7503	0.7179	0.6385	8.6236	0.7542	0.7167	0.6291	8.4900
	RF	0.8754	0.8682	0.8449	4.3044	0.8758	0.8680	0.8449	4.2896
	GBDT	0.9764	0.9734	0.9708	0.8163	0.9764	0.9736	0.9708	0.8163
	LDA	0.6854	0.6504	0.5431	10.864	0.6210	0.5564	0.3000	13.091
	SVM	0.7211	0.6648	0.5087	9.6329	0.7254	0.6701	0.5199	9.4844
	MLP	0.8977	0.8897	0.8714	3.5326	0.9132	0.9068	0.8916	2.9982
	Adaboost	0.9600	0.9562	0.9505	1.3804	0.9600	0.9573	0.9508	1.3804
	DT	0.9261	0.9228	0.9094	2.5530	0.9287	0.9260	0.9128	2.4639
	XGBoost	0.9802	0.9778	0.9756	0.6828	0.9811	0.9792	0.9767	0.6531
	LightGBM	0.9802	0.9777	0.9756	0.6828	0.9820	0.9799	0.9778	0.6234
2nd-year	KNN	0.8109	0.8072	0.7835	6.5314	0.8380	0.8334	0.8123	5.5954
	LR	0.7287	0.7186	0.6731	9.3698	0.6701	0.6333	0.4615	11.394
	RF	0.8144	0.8142	0.7944	6.4093	0.8206	0.8198	0.8000	6.1957
	GBDT	0.9685	0.9671	0.9640	1.0886	0.9744	0.9733	0.9708	0.8851
	LDA	0.7060	0.6964	0.6483	10.153	0.6068	0.5566	0.2252	13.581
	SVM	0.6271	0.5861	0.3600	12.879	0.6150	0.5658	0.2519	13.296
	MLP	0.8406	0.8379	0.8185	5.5039	0.8730	0.8685	0.8522	4.3848
	Adaboost	0.9376	0.9363	0.9290	2.1568	0.9434	0.9412	0.9350	1.9533
	DT	0.8719	0.8713	0.8565	4.4255	0.9040	0.9025	0.8910	3.3166
	XGBoost	0.9726	0.9709	0.9686	0.9461	0.9773	0.9759	0.9740	0.7834
	LightGBM	0.9711	0.9695	0.9669	0.9970	0.9806	0.9790	0.9777	0.6714
3rd-year	KNN	0.8359	0.8333	0.8189	5.6683	0.8465	0.8443	0.8315	5.3032
	LR	0.7455	0.7417	0.7152	8.7899	0.7484	0.7423	0.7073	8.6895
	RF	0.8272	0.8299	0.8245	5.9695	0.8277	0.8294	0.8224	5.9513
	GBDT	0.9770	0.9764	0.9751	0.7941	0.9810	0.9804	0.9794	0.6572
	LDA	0.7164	0.7105	0.6718	9.7940	0.6942	0.6845	0.6223	10.560
	SVM	0.7558	0.7514	0.7235	8.4340	0.7566	0.7516	0.7220	8.4066
	MLP	0.9133	0.9133	0.9077	2.9939	0.9273	0.9264	0.9214	2.5101
	Adaboost	0.9545	0.9538	0.9508	1.5700	0.9598	0.9593	0.9567	1.3874
	DT	0.9088	0.9089	0.9030	3.1491	0.9168	0.9160	0.9103	2.8752
	XGBoost	0.9781	0.9776	0.9763	0.7576	0.9802	0.9797	0.9786	0.6846
	LightGBM	0.9799	0.9793	0.9783	0.6937	0.9826	0.9819	0.9811	0.6024
4th-year	KNN	0.8383	0.8388	0.8325	5.5858	0.8491	0.8495	0.8453	5.2134
	LR	0.7544	0.7549	0.7466	8.4812	0.7210	0.7215	0.7126	9.6356
	RF	0.8253	0.8252	0.8296	6.0327	0.8278	0.8275	0.8332	5.9489
	GBDT	0.9741	0.9742	0.9743	0.8937	0.9755	0.9755	0.9757	0.8472
	LDA	0.7097	0.7102	0.6999	10.026	0.6881	0.6886	0.6796	10.771
	SVM	0.7536	0.7540	0.7489	8.5091	0.7542	0.7546	0.7485	8.4905
	MLP	0.9116	0.9115	0.9131	3.0536	0.9261	0.9262	0.9265	2.5509
	Adaboost	0.9561	0.9561	0.9563	1.5175	0.9542	0.9543	0.9544	1.5827
	DT	0.9022	0.9021	0.9040	3.3794	0.9054	0.9053	0.9074	3.2677
	XGBoost	0.9825	0.9825	0.9826	0.6051	0.9822	0.9823	0.9823	0.6144
	LightGBM	0.9836	0.9836	0.9837	0.5679	0.9863	0.9863	0.9864	0.4748
5th-year	KNN	0.8793	0.8759	0.8936	4.1699	0.8824	0.8794	0.8961	4.0626
	LR	0.8467	0.8413	0.8662	5.2962	0.8513	0.8454	0.8709	5.1353
	RF	0.8793	0.8718	0.8966	4.1699	0.8797	0.8721	0.8970	4.1565
	GBDT	0.9775	0.9769	0.9800	0.7777	0.9783	0.9775	0.9807	0.7509
	LDA	0.7453	0.7302	0.7901	8.7957	0.7279	0.7071	0.7834	9.3991
	SVM	0.8463	0.8393	0.8674	5.3096	0.8467	0.8394	0.8680	5.2962
	MLP	0.9274	0.9254	0.9357	2.5073	0.9328	0.9319	0.9401	2.3196
	Adaboost	0.9674	0.9664	0.9710	1.1263	0.9682	0.9671	0.9717	1.0995
	DT	0.8979	0.8951	0.9098	3.5263	0.9165	0.9141	0.9262	2.8827
	XGBoost	0.9752	0.9739	0.9780	0.8581	0.9810	0.9801	0.9831	0.6570
	LightGBM	0.9814	0.9804	0.9835	0.6436	0.9825	0.9817	0.9845	0.6034
Creator	KNN	0.7640	0.7181	0.6253	8.1497	0.7910	0.7835	0.7391	7.2181
	LR	0.9145	0.9292	0.9026	2.9535	0.9833	0.9861	0.9794	0.5763
	RF	0.9928	0.9921	0.9909	0.2497	0.9971	0.9926	0.9963	0.1009
	GBDT	0.9872	0.9850	0.9837	0.4418	0.9961	0.9964	0.9951	0.1345
	LDA	0.7392	0.6775	0.5366	9.0093	0.8170	0.7806	0.7239	6.3200
	SVM	0.8806	0.9011	0.8690	4.1253	0.9547	0.9625	0.9459	1.5656
	MLP	0.9153	0.9276	0.9022	2.9247	0.9875	0.9890	0.9844	0.4322
	Adaboost	0.9844	0.9815	0.9801	0.5379	0.9968	0.9974	0.9960	0.1105
	DT	0.9704	0.9636	0.9614	1.0229	0.9942	0.9939	0.9926	0.2017
	XGBoost	0.9867	0.9843	0.9830	0.4610	0.9971	0.9976	0.9963	0.1009
	LightGBM	0.9924	0.9917	0.9903	0.2641	0.9971	0.9976	0.9963	0.1009

Note: In the same dataset, the value of performance indicators is shown in bold if the base classifiers perform better or at least the same after the fuzzy clustering-based classifier selection method is applied.

5.3 Performance comparison between RS- and GA-based optimized classifier composition methods

To prove the effectiveness of the RS- and GA-based optimized classifier composition methods, Tables 5 and 6 compare the ACC, AUC, F-score, and log loss results on six datasets after the RS-based classifier composition and GA-based classifier compositional optimization methods were applied. Through RS-based classifier composition, the promising classifier composition for each dataset was optimally selected from the 26 candidate classifier compositions generated by permutation and combination based on the five best performing classifiers, rather than the simple combination of the best performing classifiers, as elaborated in Subsection 3.2. In Table 5, the promising classifier compositions output by the RS-based classifier composition method corresponding to each dataset are shown, where Com 1 represents the promising classifier composition corresponding to the 1st-year dataset, Com 2 represents the promising classifier composition corresponding to the 2nd-year dataset, and so on.

Table 5
Promising classifier composition for each dataset

Dataset Classifier composition GBDT DT Adaboost XGBoost LightGBM

1st-year Com 1 • •

2nd-year Com 2 • • • •

3rd-year Com 3 • •

4th-year Com 4 • •

5th-year Com 5 • • •

Creator Com 6 • • •

Dataset	Classifier composition	GBDT	DT	Adaboost	XGBoost	LightGBM
1st-year	Com 1	•			•
2nd-year	Com 2	•	•		•	•
3rd-year	Com 3				•	•
4th-year	Com 4				•	•
5th-year	Com 5	•			•	•
Creator	Com 6	•		•	•

After the promising classifier composition of each dataset was obtained, the GA-based classifier compositional optimization method was applied to optimize the parameters of the promising classifier composition. In Table 6, Optimal com 1 represents the optimal classifier composition corresponding to the 1st-year dataset, Optimal com 2 represents the optimal classifier composition corresponding to the 2nd-year dataset, and so on. The values of the performance indicators are shown in bold if the optimal classifier composition performed better than the promising classifier composition on each dataset. The results show that the optimal classifier composition always performed slightly better than the promising classifier composition with default parameters. The quantitative increase in the thousandth is indicated by a plus sign with the quantitative value in parentheses in Table 6. Although the quantitative increase in performance was marginal through single GA-based classifier compositional optimization, the combined quantitative increase of performance through fuzzy clustering-based classifier selection, RS-based classifier composition, and GA-based classifier compositional optimization will be significant, as will be shown in the next experiment discussed in Subsection 5.4.

Table 6

Performance comparison after GA-based classifier compositional optimization method is applied

Dataset	Classifier compositions	ACC	AUC	F-score	Log loss
1st-year	Optimal com 1	0.9860 (+2.2‰)	0.9852 (+2.5‰)	0.9838 (+2.6‰)	0.4822 (+153.9‰)
	Com 1	0.9838	0.9827	0.9812	0.5564
1st-year	Optimal com 2	0.9812 (+0.8‰)	0.9792 (+0.9‰)	0.9776 (+0.9‰)	0.6485 (+39.2‰)
	Com 2	0.9804	0.9783	0.9767	0.6739
3rd-year	Optimal com 3	0.9824 (+1.3‰)	0.9821 (+1.4‰)	0.9812 (+1.4‰)	0.6047 (+75.4‰)
	Com 3	0.9811	0.9807	0.9798	0.6503
4th-year	Optimal com 4	0.9804 (+2.0‰)	0.9803 (+2.0‰)	0.9800 (+2.1‰)	0.6749 (+69.8‰)
	Com 4	0.9784	0.9783	0.9779	0.7447
5th-year	Optimal com 5	0.9839 (+2.8‰)	0.9840 (+3.0‰)	0.9859 (+2.5‰)	0.5530 (+103.4‰)
	Com 5	0.9811	0.9810	0.9834	0.6535
Creator	Optimal com 6	0.9971 (+0.2‰)	0.9976 (+0.2‰)	0.9963 (+0.2‰)	0.1008 (+47.6‰)
	Com 6	0.9969	0.9974	0.9961	0.1056

Note: The value of performance indicators is shown in bold if the optimal classifier composition performs better than the promising classifier composition on each dataset.

A performance comparison through the ACC, AUC, F-score, and log loss after the GA-based classifier compositional optimization method was applied on six datasets, as shown in Figs. 4 and 5, respectively. They show that the optimal classifier compositions performed better than the promising classifier compositions on the same dataset.

Fig. 4

Performance comparison through the ACC, AUC, and F-score on six datasets.

Fig. 5

Performance comparison through the log loss on six datasets.

5.4 Performance comparison between the optimal classifier composition and the benchmark ensemble model

As shown in Table 7, the optimal classifier composition corresponding to each dataset was compared with a benchmark ensemble model for bankruptcy prediction based on the EXGB model proposed by Zięba et al. [55]. In the study by Zięba et al. [55], only AUC was adopted to evaluate the model performance, whereas ACC, AUC, F-score, and log loss were adopted for model evaluation in this study. The value of AUC is shown in bold if the optimal classifier composition performed better than the benchmark EXGB model for each dataset. The results showed that the optimal classifier composition performed better than the EXGB model for each dataset. The detailed quantitative increase in the thousandth is indicated by a plus sign with the quantitative value in parentheses in Table 7.

Table 7
Performance comparison between the optimal classifier composition and the benchmark ensemble model

Dataset Model ACC AUC F-score Log loss

1st-year Optimal com 1 0.9860 0.9852 (+26.2‰) 0.9838 0.4822

EXGB / 0.9590 / /

2nd-year Optimal com 2 0.9812 0.9792 (+35.2‰) 0.9776 0.6485

EXGB / 0.9440 / /

3rd-year Optimal com 3 0.9824 0.9821 (+42.1‰) 0.9812 0.6047

EXGB / 0.9400 / /

4th-year Optimal com 4 0.9804 0.9803 (+39.3‰) 0.9800 0.6749

EXGB / 0.9410 / /

5th-year Optimal com 5 0.9839 0.9840 (+29.0‰) 0.9859 0.5530

EXGB / 0.9550 / /

Dataset	Model	ACC	AUC	F-score	Log loss
1st-year	Optimal com 1	0.9860	0.9852 (+26.2‰)	0.9838	0.4822
	EXGB	/	0.9590	/	/
2nd-year	Optimal com 2	0.9812	0.9792 (+35.2‰)	0.9776	0.6485
	EXGB	/	0.9440	/	/
3rd-year	Optimal com 3	0.9824	0.9821 (+42.1‰)	0.9812	0.6047
	EXGB	/	0.9400	/	/
4th-year	Optimal com 4	0.9804	0.9803 (+39.3‰)	0.9800	0.6749
	EXGB	/	0.9410	/	/
5th-year	Optimal com 5	0.9839	0.9840 (+29.0‰)	0.9859	0.5530
	EXGB	/	0.9550	/	/

Note: The value of AUC is shown in bold if the optimal classifier composition performs better than the benchmark EXGB model on each dataset. “/” indicates that the corresponding evaluation indicators are not presented in the study by Zięba et al. [55].

A performance comparison of the AUC between the optimal classifier composition and the benchmark EXGB model on five datasets is shown in Fig. 6. This shows that the optimal classifier composition performed better than the benchmark EXGB model on the same dataset.

Fig. 6

Performance comparison of the AUC between the optimal classifier composition and the EXGB model on five datasets.

The average running times (in seconds) and time standard deviations of the fuzzy clustering-based classifier selection method, RS-based classifier composition method, and GA-based classifier compositional optimization method for each dataset are shown in Table 8, indicating the small difference in running times among the 10 runs.

Table 8

The average running times of the proposed model for each dataset

Dataset	Method	Average running time (seconds)	Time_std (seconds)
1st-year	Fuzzy clustering-based classifier selection method	772.55	36.65
	RS-based classifier composition method	885.14	44.26
	GA-based classifier compositional optimization method	1596.15	101.65
2nd-year	Fuzzy clustering-based classifier selection method	811.39	72.69
	RS-based classifier composition method	954.08	30.86
	GA-based classifier compositional optimization method	2569.75	198.35
3rd-year	Fuzzy clustering-based classifier selection method	760.98	78.01
	RS-based classifier composition method	920.26	55.16
	GA-based classifier compositional optimization method	2467.14	204.63
4th-year	Fuzzy clustering-based classifier selection method	735.811	38.30
	RS-based classifier composition method	858.47	55.18
	GA-based classifier compositional optimization method	2374.88	120.46
5th-year	Fuzzy clustering-based classifier selection method	637.09	57.19
	RS-based classifier composition method	731.47	39.17
	GA-based classifier compositional optimization method	1731.46	98.24
Creator	Fuzzy clustering-based classifier selection method	681.91	49.51
	RS-based classifier composition method	114.49	19.64
	GA-based classifier compositional optimization method	2237.49	113.94

6 Conclusions & future work

With the development of the international credit system and the improvements in capital efficiency, the credit funds of financial institutions have become the engine of the rapid expansion of manufacturing corporates. To improve the security of issuing credit and identifying high risk credit requests, the development of an efficient classification model has been encouraged. However, large-scale imbalanced data remains a barrier to bankruptcy prediction.

In this study, a novel multi-stage ensemble model with fuzzy clustering and optimized classifier composition to achieve good corporate bankruptcy prediction was proposed, combining the fuzzy clustering-based classifier selection method, the RS-based classifier composition method, and the GA-based classifier compositional optimization method. The proposed model was adaptive and outperformed benchmark models across all indicators studied. The performance of the proposed model was evaluated using four evaluation indicators: ACC, AUC, F-score, and log loss. The experimental results demonstrated the superior performance of the proposed model.

In future studies, it is essential to reduce the overall complexity and computational cost of the model. Additionally, large corporate financial data, real-time transmissions, and highly concurrent data flow pose major challenges to real-time analysis and predictive ability [44]. Therefore, a more effective and efficient financial prediction model of corporate bankruptcy that exhibits high performance and low time complexity must be developed.

Data availability statement

The experimental data used to support the findings of this study have been deposited in the Figshare repository (https://doi.org/10.6084/m9.figshare.12911336).

Conflicts of interest

The authors declare that there is no conflict of interest regarding the publication of this article.

Footnotes

Acknowledgments

This work was supported by National Natural Science Foundation of China (No. 51875503, No. 51975512), National Social Science Foundation of China (No. 18AJY013), and Zhejiang Natural Science Foundation of China (No. LZ20E050001).

References

Abellán

and Mantas

C.J.

, Improving experimental studies about ensembles of classifiers for bankruptcy prediction and credit scoring, Expert Systems with Applications 41(8) (2014), 3825–3830.

Aha

D.W.

, Kibler

and Albert

M.K.

, Instance-based learning algorithms, Machine Learning 6(1) (1991), 37–66.

Alaka

H.A.

, Oyedele

L.O.

, Owolabi

H.A.

, Kumar

, Ajayi

S.O.

, Akinade

O.O.

and Bilal

, Systematic review of bankruptcy prediction models: Towards a framework for tool selection, Expert Systems with Applications 94 (2018), 164–184.

Altman

E.I.

, Financial ratios, discriminant analysis and the prediction of corporate bankruptcy, The Journal of Finance 23(4) (1968), 589–609.

Appiah

K.O.

, Chizema

and Arthur

, Predicting corporate failure: a systematic literature review of methodological issues, International Journal of Law and Management 57(5) (2015), 461–485.

Back

, Laitinen

and Sere

, Neural networks and genetic algorithms for bankruptcy predictions, Expert Systems with Applications 11(4) (1996), 407–413.

Barandiaran

, The random subspace method for constructing decision forests, IEEE Transactions on Pattern Analysis and Machine Intelligence 20(8) (1998), 832–844.

Bishop

C.M.

, Pattern Recognition and Machine Learning. New York: Springer. (2006).

Breiman

, Random forests, Machine Learning 45(1) (2001), 5–32.

10.

Cessie

S.L.

and Houwelingen

J.C.V.

, Ridge estimators in logistic regression, Journal of the Royal Statistical Society 41(1) (1992), 191–201.

11.

Chawla

N.V.

, Bowyer

K.W.

, Hall

L.O.

and Kegelmeyer

W.P.

, SMOTE: Synthetic minority over-sampling technique, Journal of Artificial Intelligence Research 16 (2002), 321–357.

12.

Chen

T.Q.

and Guestrin

, XGBoost:Ascalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining San Francisco, USA, (2016), 785–794.

13.

Choi

H.C.

, Son

H.J.

and Kim

C.W.

, Predicting financial distress of contractors in the construction industry using ensemble learning, Expert Systems with Applications 110 (2018), 1–10.

14.

Cortes

and Vapnik

, Support-vector networks, Machine Learning 20(3) (1995), 273–297.

15.

Cover

and Hart

, Nearest neighbor pattern classification, IEEE Transactions on Information Theory 13(1) (1967), 21–27.

16.

K.W.

European Journal of Operational Research 285(2) (2020), 612–630 Bock, K. Coussement and S. Lessmann, Cost-sensitive business failure prediction when misclassification costs are uncerta, A heterogeneous ensemble selection aroach.

17.

Du Jardin

, A two-stage classification technique for bankruptcy prediction, European Journal of Operational Research 254(1) (2016), 236–252.

18.

Ekinci

and Erdal

H.İ.

, Forecasting bank failure: Base learners, ensembles and hybrid ensembles, Computational Economics 49(4) (2017), 677–686.

19.

Feuerriegel

and Gordon

, News-based forecasts of macroeconomic indicators: A semantic path model for interpretable predictions, European Journal of Operational Research 272(1) (2019), 162–175.

20.

Fisher

R.A.

, The use of multiple measurements in taxonomic problems, Annals of Human Genetics 7(2) (2012), 179–188.

21.

Freund

and Schapire

R.E.

, Experiments with a new boosting algorithm. In Proceedings of the 13th International Conference on Machine Learning, Bari, Italy, (1996), 148–156.

22.

Friedman

J.H.

, Greedy function approximation: A gradient boosting machine, The Annals of Statistics 29(5) (2001), 1189–1232.

23.

García

, Marqués

A.I.

and Sánchez

J.S.

, Exploring the synergetic effects of sample types on the performance of ensembles for credit risk and corporate bankruptcy prediction, Information Fusion 47 (2019), 88–101.

24.

Hanley

J.A.

and McNeil

B.J.

, The meaning and use of the area under a receiver operating characteristic (ROC) curve, Radiology 143(1) (1982), 29–36.

25.

H.L.

, Zhang

W.Y.

and Zhang

, A novel ensemble method for credit scoring: Adaption of different imbalance ratios, Expert Systems with Applications 98 (2018), 105–117.

26.

Hewitt

, Sprague

, Yearout

, Lisnerski

and Sparks

, The effects of unequal relearning rates on estimating forgetting parameters associated with performance curves, International Journal of Industrial Ergonomics 10(3) (1992), 217–224.

27.

Y.C.

, A multivariate grey prediction model with grey relational analysis for bankruptcy prediction problems, Soft Computing 24(6) (2020), 4259–4268.

28.

Jain

, Alzubi

J.A.

, Jain

and Joshi

, Assessing risk in life insurance using ensemble learning, Journal of Intelligent & Fuzzy Systems 37(2) (2019), 2969–2980.

29.

G.L.

, Meng

, Finley

, Wang

T.F.

, Chen

, Ma

W.D.

, Ye

Q.W.

and Liu

T.Y.

, LightGBM: A highly efficient gradient boosting decision tree. In Proceedings of Annual 2017 Conference on Neural Information Processing Systems, California, USA, (2017), 3146–3154.

30.

Kim

H.J.

, Jo

N.O.

and Shin

K.S.

, Optimization of cluster-based evolutionary undersampling for the artificial neural networks in corporate bankruptcy prediction, Expert Systems with Applications 59 (2016), 226–234.

31.

Kirkos

, Assessing methodologies for intelligent bankruptcy prediction, Artificial Intelligence Review 43(1) (2015), 83–123.

32.

Kohonen

, Self-organized formation of topologically correct feature maps, Biological Cybernetics 43(1) (1982), 59–69.

33.

, Vo

, Fujita

, Nguyen

N.T.

and Baik

S.W.

, A fast and accurate approach for bankruptcy forecasting using squared logistics loss with GPU-based extreme gradient boosting, Information Sciences 494 (2019), 294–310.

34.

, Ying

W.Y.

, Tuo

J.Y.

, Li

and Liu

W.H.

, Applications of classification trees to consumer credit scoring methods in commercial banks, In Proceedings of IEEE International Conference on Systems, Man and Cybernetics, Hague, Netherlands, (2004), 4112–4117.

35.

Lim

M.K.

and Sohn

S.Y.

, Cluster-based dynamic scoring model, Expert Systems with Applications 32(2) (2007), 427–431.

36.

Lin

W.Y.

, Hu

Y.H.

and Tsai

C.F.

, Machine learning in financial crisis prediction: A survey, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 42(4) (2011), 421–436.

37.

Maldonado

, Peters

and Weber

, Credit scoring using three-way decisions with probabilistic rough sets, Information Sciences 507 (2020), 700–714.

38.

Mitchell

, An Introduction to Genetic Algorithms, MIT press. (1998).

39.

Murphy

K.P.

, Machine learning: a probabilistic perspective. MIT press. (2012).

40.

Peihani

, Basel committee on banking supervision, Brill Research Perspectives in International Banking & Securities Law 89(1) (2016), 335–347.

41.

Schapire

, A brief introduction to boosting. In Proceedings of the 16th International Joint Conference on Artificial Intelligence, Stockholm, Sweden, (1999), 1401–1406.

42.

Sculley

, Web-scale k-means clustering. In Proceedings of the 19th International Conference on World Wide Web, North Carolina, USA, (2010), 1177–1178.

43.

Sivasankar

, Selvi

and Mahalakshmi

, Rough set-based feature selection for credit risk prediction using weight-adjusted boosting ensemble method, Soft Computing 24(6) (2020), 3975–3988.

44.

, Shang

C.J.

and Shen

, A hierarchical fuzzy cluster ensemble approach and its application to big data clustering, Journal of Intelligent & Fuzzy Systems 28(6) (2015), 2409–2421.

45.

Tripathi

, Edla

D.R.

, Cheruku

and Kuppili

, A novel hybrid credit scoring model based on ensemble feature selection and multilayer ensemble classification, Computational Intelligence 35(2) (2019), 371–394.

46.

Tsai

C.F.

, Combining cluster analysis with classifier ensembles to predict financial distress, Information Fusion 16 (2014), 46–58.

47.

Verikas

, Kalsyte

, Bacauskiene

and Gelzinis

, Hybrid and ensemble-based soft computing techniques in bankruptcy prediction: a survey, Soft Computing 14(9) (2010), 995–1010.

48.

Wang

, Chen

and Chu

, A new random subspace method incorporating sentiment and textual information for financial distress prediction, Electronic Commerce Research and Applications 29 (2018), 30–49.

49.

Webb

G.I.

, Multiboosting: A technique for combining boosting and wagging, Machine Learning 40(2) (2000), 159–196.

50.

Wei

, Yang

D.Q.

, Zhang

W.Y.

and Zhang

, A novel noise-adapted two-layer ensemble model for credit scoring based on backflow learning, IEEE Access 7 (2019), 99217–99230.

51.

Wolpert

, Stacked generalization, Neural Networks 5(2) (1992), 241–259.

52.

Zhang

H.T.

, He

H.L.

and Zhang

W.Y.

, Classifier selection and clustering with fuzzy assignment in ensemble model for credit scoring, Neurocomputing 316 (2018), 210–221.

53.

Zhang

W.Y.

, He

H.L.

and Zhang

, A novel multi-stage hybrid model with enhanced multi-population niche genetic algorithm: An application in credit scoring, Expert Systems with Applications 121 (2019), 221–232.

54.

Zhu

, Zhou

, Xie

, Wang

G.J.

and Nguyen

T.V.

, Forecasting SMEs’ credit risk in supply chain finance with an enhanced hybrid ensemble machine learning approach, International Journal of Production Economics 211 (2019), 22–33.

55.

Zięba

, Tomczak

S.K.

and Tomczak

J.M.

, Ensemble boosted trees with synthetic features generation in application to bankruptcy prediction, Expert Systems with Applications 58 (2016), 93–101.

A novel multi-stage ensemble model with fuzzy clustering and optimized classifier composition for corporate bankruptcy prediction

Abstract

Keywords

1 Introduction

2 Related work

2.1 Feature engineering method

2.2 Random subspace method

2.3 Ensemble models

3 Modeling

4.1 Dataset description and data preprocessing

Table 1 Description of datasets with imbalance ratios Dataset Sample size WRCs BCs IR 1st-year 7027 6756 271 24.93 2nd-year 10173 9773 400 24.43 3rd-year 10503 10008 495 20.22 4th-year 9792 9277 515 18.01 5th-year 5910 5500 410 13.41 Creator 35960 21959 14001 1.59

Table 2 Confusion matrix Predicted results Positive Negative Real result Positive True positives (TP) False negatives (FN) Negative False positives (FP) True negatives (TN)

5 Experimental analysis

5.1 Baseline results

Table 5 Promising classifier composition for each dataset Dataset Classifier composition GBDT DT Adaboost XGBoost LightGBM 1st-year Com 1 • • 2nd-year Com 2 • • • • 3rd-year Com 3 • • 4th-year Com 4 • • 5th-year Com 5 • • • Creator Com 6 • • •

Data availability statement

Conflicts of interest

Footnotes

Acknowledgments

References

Table 1
Description of datasets with imbalance ratios

Dataset Sample size WRCs BCs IR

1st-year 7027 6756 271 24.93

2nd-year 10173 9773 400 24.43

3rd-year 10503 10008 495 20.22

4th-year 9792 9277 515 18.01

5th-year 5910 5500 410 13.41

Creator 35960 21959 14001 1.59

Table 2
Confusion matrix

Predicted results

Positive Negative

Real result Positive True positives (TP) False negatives (FN)

Negative False positives (FP) True negatives (TN)

Table 5
Promising classifier composition for each dataset

Dataset Classifier composition GBDT DT Adaboost XGBoost LightGBM

1st-year Com 1 • •

2nd-year Com 2 • • • •

3rd-year Com 3 • •

4th-year Com 4 • •

5th-year Com 5 • • •

Creator Com 6 • • •