A multi-layer multi-view stacking model for credit risk assessment

Abstract

Credit risk assessment plays a key role in determining the banking policies and commercial strategies of financial institutions. Ensemble learning approaches have been validated to be more competitive than individual classifiers and statistical techniques for default prediction. However, most researches focused on improving overall prediction accuracy rather than improving the identification of actual defaulted loans. In addition, model interpretability has not been paid enough attention in previous studies. To fill up these gaps, we propose a Multi-layer Multi-view Stacking Integration (MLMVS) approach to predict default risk in the P2P lending scenario. As the main innovation, our proposal explores multi-view learning and soft probability outputs to produce multi-layer integration based on stacking. An interpretable artificial intelligence tool LIME is embedded for interpreting the prediction results. We perform a comprehensive analysis of MLMVS on the Lending Club dataset and conduct comparative experiments to compare it with a number of well-known individual classifiers and ensemble classification methods, which demonstrate the superiority of MLMVS.

Keywords

Credit risk assessment ensemble learning stacking multi-view learning interpretability

1. Introduction

With the development of informatization in the personal consumer finance market, a large amount of data related to personal credit performance has been collected, which plays an important role in promoting the business innovation and development. The accumulated big data also provides a good foundation for the application of data-driven techniques in credit risk assessment. More and more scholars pay attention to the enormous social value in personal credit big data and make research in terms of credit risk assessment.

Credit risk assessment, a procedure to calculate the risk associated with credit products using applicant’ s credentials (such as annual income, job status and residential status), aims to develop models that can reduce financial risks and increase the related profits which is critical for the survival of financial and non-financial institutions. In recent years, the number of personal credit defaults has increased rapidly, mainly due to the emerging of social lending platform, known as Peer-to-Peer (P2P) lending [1]. The P2P lending industry, as a form of financial innovations, has shown a booming and global trend, providing a convenient way for individuals to borrow and directly invest online without complicated procedures. However, it confronts great challenges. The inherent information asymmetry, in which lenders know limited information about borrowers while borrowers know considerably more about their own risk levels, attracts riskier borrowers and misleads lenders to fund them, leading to higher default rates compared with bank loans [2].

Constructing an effective credit risk assessment model to predict the probability of loan default has become a crucial task which serves for P2P lending field [3]. The statistical methods and machine learning methods applied to credit risk assessment have experienced a period of maturity. Recently, ensemble learning is increasingly being selected to address the problem due to its superiority. In general, the ensemble method consists of three sequential steps including pool generation, model selection and result combination. As a result, the most of existing literature does research on improving the credit risk ensemble models from the aforementioned three aspects but have limitations on imbalance dataset and generation ability [4]. Resampling is a widely used method to address class imbalance problem [5], which changes the distribution of instances among classes by randomly or purposefully taking samples from the original training set. In addition, most studies focus on improving the accuracy of overall predictions rather than the ability to identify actual defaulted loans, and some features significantly associated with default rates have not been exploited in previous studies. Interpretable model is essential in the area of credit risk assessment, which can provide explanation for lending platforms to justify credit denials. Further, a model with interpretability could reduce lenders’ suspicion existing in statistical techniques. In this paper, we propose a novel stacking integration model named Multi-layer Multi-view Stacking Integration to perform credit risk assessment, which is an imbalanced classification problem in nature. In addition, our integrated fusion strategy focuses on two aspects, i.e., accuracy and interpretability. Extensive experiments are carried out to demonstrate MLMVS’s generalization ability to identify default loans and the superiority over other popular methods. The proposed integration model for credit risk assessment is verified to be able to handle the default prediction. For clarity, the contributions of this study are summarized as follows:

C1
We design a multi-layer integration framework to generate an effective learner. Soft probabilities as the output form of the first-layer ensemble learners are integrate into the second-layer, which can make full use of the implicit information lurked in the decision-making process of the first-layer.
C2
Multi-view learning is introduced to depict instances from different perspectives, promote diversity for the outermost integrations of integration strategy and reduce complexity in the integrated model. To capture the nuances of individual consumers, each view generates different base learners through a dynamic weight selection strategy. Besides, it paves the way for the interpretability of our model.
C3
Interpretable credit risk assessment visualization consisting of LIME plot is realized to give interpretable analysis and provide data-driven decision-making support for finance administrators.

The rest of the paper is structured as follows. Section 2 conducts a critical literature review about credit risk ensemble models. Section 3 presents the details of the proposed interpretable MLMVS approach. Section 4 describes data preparation, experimental design, experimental results and analysis in detail. Section 5 concludes the paper and proposes future work.
2. Related work

The main idea of credit risk assessment is to build a quantitative model based on a set of explanatory variables to estimate the credit of an applicant [6]. Credit risk classification is often described as a binary classification problem to distinguish between good credit, which indicates that the borrower can repay the debt, and bad credit, which indicates default or other unwanted status. Estimating the probability of default is the core mission and it can be formulated to be a general population classification task. Various individual classification algorithms based on traditional statistical methods or machine learning methods have been used for credit risk assessment in the last decades. Especially, ensemble learning is more flexibility to represent various functions and is widely used in the field of credit risk assessment [7, 8, 9, 10, 11, 12]. In this section, a review of recent ensemble strategies is presented, discussing approaches in terms of improving accuracy and interpretability.

We review existing approaches to improving the accuracy of ensemble learning in terms of class imbalance and diversity. Considering the fact that the losses of bad clients far outweigh the gains of good ones [11], researchers have noticed imbalance data studies that have became a vital issue in credit classification. Niu et al. [13] propose a novel resampling integration model based on data distribution for imbalanced credit risk assessment in P2P lending in 2020. Shen et al. [14] propose an improved SMOTE method for imbalanced data processing as well as integrated LSTM network and AdaBoost algorithm into an integrated framework. Meanwhile, increasing the diversity of integrations has been shown to be an effective way to significantly improve the predictive power of integration methods for unbalanced datasets [15].

The most common strategy to enhance diversity is inducing multiple data partitions to train different models based on one learning algorithm and combine the models subsequently; both instance-partitioning methods and feature-partitioning methods can be used. Joseph [16] present a variety of approaches to optimize combinations based on integration and show that there is a natural tension between the diversity of pairs of portfolio members and individual accuracy. In order to guarantee the performance of the ensemble, Campos et al. [17] point out that classifiers should keep complementarity from each other in some degree. Another notable issue is that the increase of base classifiers’ diversity may cause the decrease of accuracy [18]. Therefore, as described in Bhowan et al. [19], both accuracy and variety should be taken into account during the generation procedure of ensemble to gain the advantage over base classifiers. Xia et al. [20] develop an overfitting-cautious integration selection strategy by fully considering the overfitting problem in the integration selection phase. As a part of data preprocessing, feature selection algorithms have been shown to improve the performance metrics of data mining models. Nalic et al. [21] propose a new method combining five different feature selection algorithms. Diversity can also be induced using different algorithms trained on the same dataset, or using a single algorithm with different parameters applied to the same data. Many classifiers such as DT, SVM, Naive Bayes (NB) and NN-based classifiers have thus been proposed. Diwakar [22] combines feature selection and a multilayer integrated classifier framework to develop a hybrid model and shows that classifiers usually perform well for a specific dataset. As a consequence, using an ensemble classifier is a strong approach to get near to the optimal classifier for any dataset. Diversity could also be enhanced in the ensemble stage of base classifiers: the more diverse merged learners are the higher the accuracy and the smaller the complexity of the final model. Xia et al. [23] propose a novel heterogeneous integration model with majority voting fusion for credit assessment, which show effectiveness and accuracy on P2P-B dataset. Xie et al. [24] design a hybrid model that combined deep learning and stacking integration strategy. The traditional algorithms are fused by using stacking integration strategy to compensate the shortcomings of a single algorithm and achieve better results.

In addition to accuracy, the interpretability of models is also crucial in the field of credit risk assessment. There has been literature attempting to reconcile interpretability with high classification accuracy in credit risk assessment. Due to the complexity of the models, these studies have mainly focused on estimating variable importance scores rather than true knowledge. The practitioners have been aware of this vital issue and need an effective tool to solve this problem. Therefore, developing an interpretable integration model with high accuracy is one of the most important research topics in the future of credit risk assessment. The proposal by Hsieh and Hung [25] is one of the first that deal with balance between accuracy and interpretability in ensemble models. They propose a multi-stage model that hierarchically combines two sources of diversity, bagging and multi-classifier systems. Tomczak and Zieba [26] use a variant of the Boltzmann Machine to generate weights for binary feature inputs to a simple relevance-based rating scale. The Classification-constrained Boltzmann Machine (ClassRBM) is first trained as a stand-alone classifier with the ability to predict credit status but without the inclusion of interpretable structure. To obtain an understandable model, the ClassRBM is used to evaluate the relevance of each binary feature, and these values are then used to create a rating scale (score card). And the superior interpretability of the generated score sheets is explicitly reported in the presence of more complex models. In order to exploit the potential of machine learning in the credit scoring domain, the issue of interpretability must be addressed. Giorgi et al. [27] in 2022 propose to apply LIME on top of a well-performing black box algorithm. This approach preserves the enhanced predictive power of machine learning while providing meaningful explanations to the applicants involved as well as to the regulators.

3. Methodology

Most ensemble credit risk prediction models mainly include two steps, i.e., base learner generation and base model fusion. In previous studies, ensemble models can be applied to solve the credit risk identification problem and become the most popular machine learning method in the field of credit risk prediction. However, as a practical problem with extensive attention, in addition to combining multiple algorithms that process different hypotheses to form a better hypothesis and to make good predictions, we should attach importance to the interpretability of the model, which is the weakness of current ensemble learning. Consequently, the fundamental issues considered in our research include (1) how to generate the base model pool, (2) how to select base learners and meta-classifier, (3) how to make better prediction by fusing different base models, and (4) how to improve the interpretability of the integrated model. The framework of the proposed method is illustrated in Fig. 1.

Figure 1.

Framework of the proposed model.

3.1 Multi-view partitioning

Data quality is a crucial part for model training. Previous credit risk prediction models describe that the different types of data of borrowers collected from different sources is connected to a view, and then directly train with machine learning algorithms as input. The obvious disadvantage of this process is that it ignores the statistical characteristics of the data from different views, meanwhile too many features selected as inputs to the underlying model will affect the quality of training model.

Credit loan datasets are typical multi-view datasets, in which features fall into several groups [28]. Each feature category represents a detailed description of loan records or loan applicants from a specific perspective. This characteristic of credit loan data provides a firm foundation for the application of multi-view learning in credit risk assessment tasks. Hence, we propose the idea of multi-view partitioning, aiming to make full use of data features from different view. For credit loan data $D$ , we suppose that features can be divided into $V$ views, denote matrix $X_{n*d}=\{X^{1},X^{2},\ldots,X^{v}\}$ as data $D$ , in which $n$ is the sample size and $d$ stands for the data dimension. Similarly, let $X^{v}_{n*d(v)}$ be the data matrix of the $v$ -th view with d( $v$ ) dimensions. Note that credit loan datasets from different financial institutions are not exactly alike, different views also vary in name. The multi-view partitioning results of the dataset utilized in this study is shown in Section 4.2.

At the core of multi-view learning is multiple independent feature subspaces, which can provide compatible and complementary information and depict instances from different perspectives, making it an effective method to promote diversity when building integrated models [29]. So, we divide data into four complementary subspace to train different models and further verify that. In addition, enhancing diversity has been proved to be a way to improve the ability of ensemble learning models to deal with imbalanced classification problems. Based on this, we utilize the idea of diversity and propose multi-view learning to solve the imbalanced ensemble classification issue.

3.2 Multi-layer ensemble model

To further improve the prediction performance of our model, we propose a method for constructing a multi-layer ensemble classifier. The framework of multi-layer ensemble classifier is to aggregate multiple heterogeneous classifiers into one classifier through multiple layers as shown in Fig. 2. The base classifiers generated in phase 2 located on the first layer, where multiple heterogeneous classifiers in the first layer are required to select the best classifier with high accuracy and high diversity based on stacking integration rules. Logistic regression acts as a meta-learner in the second layer, which aggregates the results obtained in the first layer and outputs them in the form of soft probabilities. In fact, we do not use the test dataset to estimate any parameters for ensemble pruning, but combine each test sample from the four views according to the classification results of the ordered and trained classifiers on the test dataset with different classifiers. Finally, the meta-classifier outputs from the phase 2 are fed into the third layer for multi-view integration. Overall, the same training dataset is used to train different base classifiers, and then aggregate the outputs of these classifiers together to form the final predicted output for each sample.

To sum up, we propose multi-layer ensemble classifier as the model construction. The following subsections explain the critical phase of the multi-layer ensemble model.

Figure 2.

Multi-layer ensemble model.

3.2.1 The base model pool training

The basic principles and optimization functions of the machine learning methods are diverse. Therefore, considering both the diversity and accuracy, this study employs multiple machine learning methods to generate the base classifiers for credit risk assessment. We build a set of different base classifiers for each view by using generation processes of multiple classification models with different parameters or training datasets, denoted as $C1,C2,\ldots,Cn$ represents. The data characteristics, data quantity and data attributes are different in each view. We train $n$ base learners for each feature subspace, where $n\geqslant$ 2, and conduct comparing them with the models without view partitioning. In order to ensure the diversity of base learners, we employ eight different algorithms as base classifiers to train including Logistic Regression (LR), K-Nearest Neighbor (KNN), Multi-layer Perceptron (MLP), Gaussian Naive Bayes (GNB), Decision Tree (DT), Random Forest (RF), Extreme Gradient Boosting Decision Tree (XGBoost) and Light Gradient Boosting Machine (LightGBM).

•
LR: Logistic regression [30] belongs to supervised machine learning models. Besides, it could distinguish different classes (or categories) as a discriminative model.
•
KNN: Neighbors-based classification is a type of instance-based learning or non-generalizing learning. K-Nearest Neighbor [31] implements learning based on the $k$ nearest neighbors of each query point, where $k$ is an integer value specified by the user.
•
MLP: A multilayer perceptron [32] is a fully connected class of feedforward artificial neural network (ANN) with a large number of network nodes interconnected, linking feature values in various ways through linear and non-linear combinations to obtain the final recognition result.
•
GNB: Gaussian Naive Bayes is a classification technique for machine learning based on probabilistic methods and Gaussian distributions. The category of the sample to be tested is determined by maximizing the posterior probability.
•
DT: Decision Trees [33] are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.
•
RF: Random forests [34] is an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time. For classification tasks, the output of the random forest is the class selected by most trees. Random forests correct for decision trees’ habit of overfitting to their training set.
•
XGBoost: Extreme Gradient Boosting Decision Tree [35] is an implementation of machine learning algorithms in the Gradient Boosting framework, providing parallel tree boosting that can solve many data science problems quickly and accurately.
•
LightGBM: Light Gradient Boosting Machine [36] is also an improved implementation of the GBDT framework, which can be seen as an optimization of XGBoost. Its optimization is mainly to improve the training speed of the model, it can process high-dimensional big data, and improve the efficiency and scalability.

The base learners we use all have hyper-parameters which have a significant influence on the performance of the model. Therefore, grid search must be carefully performed to optimize the hyper-parameters effectively. Following the aforementioned process, the grid search iteratively selects the current optimal hyper-parameters and do evaluation until reaching a pre-set iteration number. In this study, to deal with the imbalanced loan data, the simplest under sampling and oversampling technique (randomly resampling the minority class till the numbers of both classes are equal) are chosen as the base algorithm to train base learners for each view.
3.2.2 Ensemble based on stacking

Based on the work done in the phase 1, a set of base classifiers are trained and the next phase is to combine these base learners into an aggregated classifier using an appropriate ensemble strategy. In this subsection, we aim at obtaining a group of base classifiers (the first layer) and a meta-classifier (the second layer) for each view based on their corresponding sub datasets.

Base classifiers were selected based on the average accuracy, sensitivity, specificity and precision produced by the phase 1 to construct the ranking criteria. Details on the metrics are shown in Section 4.3. In subsequent experiments without resampling, we found that the scores of sensitivity and specificity are about 5:1, whereby Eq. (1) shows the ranking criteria we constructed. Finally, we can obtain a score for each base classifier based on the entire validation set, and select a set of base classifiers with the highest score as candidate. Moreover, the stacking model is a heterogeneous ensemble, which requires the base learner to be more accurate and more differentiated. Therefore model diversity should also be considered.

$\displaystyle\textit{Score}=\textit{Accuracy}+\textit{Precision}+\frac{1}{6}% \textit{Sensitivity}+\frac{5}{6}\textit{Specificity}$ (1)

In the first layer, we selected $k_{v}$ base classifiers for each view based on the criteria, and each base classifier will be evaluated using repeated 10-fold cross-validation. Stacking is more prone to overfitting than other ensemble learning due to the use of complex nonlinear transformations in the extracted features. To reduce the risk of overfitting, we prefer to use a simple model (logistic regression) as the meta-classifier. Using the dataset obtained from the first layer, a model is further trained by the meta-classifier to obtain the final prediction score of the second stage.

3.2.3 View integration using soft probability

Hard probability are used as the output form of each view classifier according to existing research, which may lead to some hidden information to be lost during the model ensemble process [37]. Therefore, we adopt soft probability as the output form of the phase 2 and hard probability as the output of the final integration. In addition, we have to determine the final classifier, because it will affect the accuracy of the whole model, and after several attempts, we choose random forest as the integrator for phase 3. Based on the accuracy of the output of the phase 2, we assign each view a corresponding weight and feed the results into the phase 3 of the integrated classifier (the third layer). In a binary classification problem, after obtaining the probability of each category, the final classes of instances need to be decided. The threshold is commonly set to 0.5 in most research for loan default judgment. However, it’s necessary to choose an optimal threshold based on the precision when dealing with an imbalanced dataset. To tackle this issue, this study achieves the final classification results by comparing the probabilities from different categories.

3.3 Model interpretability

In recent years, machine learning models, especially ensemble learning models, have improved in prediction accuracy and elapsed time. However, these methods present a major drawback: it is very difficult to understand what grounds the algorithm to take the decision. Explaining the models with feature importance and simple variable analysis visualization alone is not enough to assess the trust level of the whole model. To address this issue we combine the LIME (Local Interpretable Model-agnostic Explanations) method with our proposed algorithm and check its stability.

LIME [38] is a method to explain the black box model. It locally linearizes the machine learning model and uses the approximate linear model for local interpretation. Each model is specific to an input point: only in its adjacent region can the prediction of the explainable model be guaranteed to be very close to the prediction of the black box model. Specifically, we want to use simple models to explain complex models. The simple model here, the explainable model that behaves very similar to the original model, can be a linear model, because we can explain the model by checking at the coefficient of the linear model. LIME will generate a new dataset (this dataset is obtained by transforming specific sample data) and then we train a simple model (easy to explain) on this new dataset. We hope that the predictions of this simple model on the new dataset to be similar to the predictions of the complex model on that dataset. We can formulate our problem as the following:

$\displaystyle\textit{expression(x)}=\textit{arg}\mathop{\textit{min}}_{\textsl% {g}\in G}L(\textsl{f},\textsl{g},{\pi}_{x})+\Omega(\textsl{g})$ (2)

where f stands for the original black-box model, the model needs to be explained, and g stands for the explainable model. ${\pi}_{x}$ stands for the weight assigned according to $x$ . $\Omega(\textsl{g})$ is the complexity of g.

The disadvantage of LIME is that it is very sensitive to dataset dimensionality, i.e., when it is used to explain a model built with a large number of variables, the local explanation cannot distinguish between correlated and irrelevant features. In this study, although the dimension of the original data is very large, we use the sub-view to build the model. On the premise of ensuring that the data information is as complete as possible, the dimension of the feature variable will not be too large.

4. Experimental study

4.1 Data preparation

We use data provided by Lending Club, the world’s largest P2P lending platform, to conduct experiments. The main analysis is on loans issued from January 2015 to September 2018. Before 2015, Lending Club basically maintained a 100% annualized revenue growth. After that, due to the negative impact of merger failures and illegal operations, the company’s performance declined rapidly and the revenue growth rate decreased, which caused a huge loss. It was not until 2018 that the corporate profitability began to show signs of recovery. In total, original dataset includes 1,048,576 loan records and 151 attributes. We did a series of data preparation work due to the complexity of the data. The detailed process is described in the following sections.

4.1.1 Data cleaning

Data cleaning consider and eliminate for the missing values of dataset and samples of dataset. There are 1,048,576 loan records in the raw data. These instances have seven final states: “Fully paid”, “Current”, “Default”, “Charged off”, “Late (16–30 days)”, “Late (31–120 days)” and “In grace period”. We removed samples with “Current” loan status because their final outcome cannot be inferred. For value diagnosis, 12 blank records are deleted directly. In addition, there are 73,841 data missing to varying degrees. It is worth noting that because of the large amount of original data, we did not replace the missing values with mean or random values but we discarded all samples with missing values, which would not affect the final result. Besides, we eliminated records with obvious errors. After these operations, the number of dataset records changes from 1,048,576 to 936,982, of which about 23.28% are default loan records.

4.1.2 Data transformation

Some attributes need further transformation to make the data suitable for algorithms coping with numerical variables. (a) “loan_condition” represents the final status of the issued loans. As a binary classification task, we converted the status of “Fully Paid” to 0 that indicates good loans and set other categories to 1, which means bad loans. (b) “issue_d” indicates date of loan application, to make better use of such time data, we convert it to the number of days passed since application issuance and named it “issueDays”. (c) In addition, the same operation is performed on “emp_length” and “emp_title”. The value of “emp_title” is converted to 0 and 1, which indicates that the borrower has filled in the occupational information and not respectively. The value of “emp_length” is coded from 0 to 11, where 0 means that the borrower had been working for an employer for less than 1 year when he or she obtained this loan, and 10 means the borrower hadn’t changed job for 10 years or more. The “11” indicates missing employment length information. (d) Furthermore, the values of attributes including “grade”, “sub_grade”, “addr_state”, “purpose”, “home_ownship” and “verification_status” are non-numerical, which need to be converted to integers respectively.

4.1.3 Redundant feature removal

In this study, general approach and domain knowledge gleaned from various sources are used to identify redundant features. The following features are excluded in this study. (a) Features with more than 40% deletion ratio such as “revol_bal_joint” and “mth_since_las_record” were deleted directly. (b)Attributes like “url” which lack useful information were removed. (c) According to the data dictionary provided by Lenging Club, certain features contain the same or similar information. For example, “title” also represents the loan purpose, so we deleted it. (d) Sensitive personal or corporate information should be removed to ensure data security. (e) Post-loan variables such as “total_pymnt” were deleted to prevent model information from being disclosed in advance. (f) Finally, according to the field research experience, “settlement_term” and other meaningless features were deleted.

4.1.4 Feature selection

In credit loan data, a borrower often has many attributes (hereinafter referred to as features), and these features can be roughly divided into three main types: relevant features, irrelevant features and redundant features. The main purpose of feature selection is to select features that are beneficial to the learning algorithm. Moreover, in practical applications, the problem of dimensional disaster often occurs. Only selecting some of the features to build a model can greatly reduce the running time of the learning algorithm and increase the interpretability of the model. Previous studies have shown that a single feature selection method cannot handle all classifiers and datasets well. Therefore, this paper uses several different feature ranking methods based on the idea of ensemble learning, namely information value, pearson correlation coefficient method and feature-importance scores of XGBoost. The final ranking of features is calculated from three perspectives: filtering high contributing features, merging redundant features and calculating feature importance. The four methods are as follows:

•
Information value: Information Value (IV), which is used to indicate the degree of contribution of the feature to the target prediction, i.e., the predictive power of the feature. In general, the higher the IV is, the stronger the predictive power of the feature and the higher the degree of information contribution will be.
•
Pearson correlation coefficient: Pearson product-moment correlation is used to select features that are highly correlated with the class label and have low correlation with other features. Here we only consider feature-to-feature correlations and the features with higher correlations are combined into one feature. The equation is as follows, where $X$ and $Y$ represent the features and $\bar{X}$ and $\bar{Y}$ represent the mean of the samples respectively. Meanwhile, we set the threshold to 0.7 because $pv>0.7$ is usually considered to be a strong correlation between features.

$\displaystyle\textit{pv}=\frac{\sum\limits_{i=1}^{n}(X_{i}-{\bar{X}})(Y_{i}-{% \bar{Y}})}{\sqrt{\sum\limits_{i=1}^{n}(X_{i}-{\bar{X}})^{2}*(Y_{i}-{\bar{Y})}^% {2}}}$ (3)
•
Feature-importance scores of XGBoost: Similar to other tree-based classifiers, XGBoost provides feature-importance scores that measure the average objective reduction after taking the specific variables for splitting. A variable with a higher score has a higher importance in tree building. Given that XGBoost is one of the base learners employed in this study, we employ feature-importance scores as a rule.

We first deleted the IV variables with low values. Next, we calculated the pearson correlation coefficient of the remaining 106 features and selected the pairwise correlation variables with the threshold of pearson correlation coefficient above 0.7. For example, we combined “funded_amnt”, “funded_amnt_inv”, “installment” with a correlation of “1” into one feature. Finally, we got the feature importance score of each feature based on trained XGBoost model. Overall, we scored the features according to the above four criteria and retained 65 vital features from the original 151 features, which include 32 categorical variables and 33 continuous variables.
4.2 Multi-view partitioning

According to the official data dictionary and professional analysis of practitioners, the characteristics of Lending Club data can be artificially divided into four categories: Personal Information, Loan Description, Behavioral Information and Credit History. Each category is considered as a separate view and each feature can be assigned to only one view. Specifically, “Personal Information” (view 1) refers to the borrower’s job status, family status, asset status, etc, of which “annual income” and “debt income” reflect the borrower’s repayment ability. “Loan Description” (view 2) that describes the loan information includes “loan purpose” and “loan amount” features. “Behavioral Information” (view 3) describes consumer behavior of loan applicant from external sources. “Credit History” (view 4) refers to borrower’s bad credit history over the past few years. Finally, after multi-view partitioning, we obtained the final four view partitions, which contain the number of features 9, 14, 18 and 24 respectively. The feature importance in each view is ranked according to the final score obtained in Section 4.1.4. Details of feature partition and feature importance are provided in Fig. 3, where the larger the area is occupied by the feature, the more important it is.

Figure 3.

Multi-view partitioning.

4.3 Evaluation metrics

We use overall accuracy to evaluate credit evaluation model and AUC to evaluate the correctness of label prediction. However, the overall accuracy is not appropriate in imbalanced credit data. We expect these evaluation metrics would provide a comprehensive evaluation of credit scoring models. To compare the performance of our model and benchmarks comprehensively, six representative evaluation measures are used: accuracy, sensitivity, specificity, precision, F1-score and area under the ROC curve (AUC). A brief explanation of these metrics is described as follows:

•
Accuracy: accuracy measures the overall validity of the model. Among the classification metrics, the most natural one to think of is the accuracy rate, i.e., the percentage of correct predicted outcomes in the total sample. Due to the sample imbalance, the obtained highly-accurate results lack convincingness, so, in this case, it is necessary to calculate the precision, sensitivity and specificity.

$\displaystyle\textit{ACC}=\frac{\textit{TN}+\textit{TP}}{\textit{TN}+\textit{% FP}+\textit{FN}+\textit{TP}}$ (4)
•
Sensitivity: sensitivity (recall of positive samples) represents the probability of being classified correctly in a sample of actually good borrowers.

$\displaystyle\textit{Sensitivity}=\frac{\textit{TP}}{\textit{FN}+\textit{TP}}$ (5)
•
Specificity: specificity (recall of negative samples) represents the proportion of bad borrowers that are correctly predicted as bad borrowers. Correctly identifying bad borrowers can greatly reduce the losses of lending institutions and banks. Therefore, to increase the revenue of financial institutions, it is necessary not only to improve the classification ability of the model, but also to reduce the misclassification cost of the model.

$\displaystyle\textit{Specificity}=\frac{\textit{TN}}{\textit{FP}+\textit{TN}}$ (6)
•
Precision: precision represents the class consistency of the data labels with the good borrowers given by the model, which, as well as accuracy, can both be used to measure the classification ability of the model. Higher precision means that an algorithm returns more relevant results than irrelevant ones.

$\displaystyle\textit{Precision}=\frac{\textit{TP}}{\textit{FP}+\textit{TP}}$ (7)
•
F1-score: The F1-score combines the precision and recall of a classifier into a single metric by taking their harmonic mean. It is primarily used to compare the performance of two classifiers. The F1-score of a classification model is calculated as follows:

$\displaystyle\textit{F1-score}=2\frac{\textit{Precision}\cdot\textit{Recall}}{% (\textit{Precision}+\textit{Recall})}$ (8)
•
AUC: AUC is a comprehensive evaluation metric. It is defined as the area under the receiver operating characteristic (ROC) curve. The larger value, the better performance of the proposed model.

4.4 Ablation experiment

Our model MLMVS synthesizes multi-view learning and multi-layer integration construction which is an integration of ensembles. We consider both soft probability and view weight in the integration strategy. We utilize the ablation test by removing the component of MLMVS to prove its effectiveness. Three key components are identified in MLMVS, namely multi-view partition, multi-layer integration and soft probability.

•
To verify the effect of the multi-view integration, the MLMVS is compared with the single-view method named Multi-layer Single-view Ensemble (MLSVE) that does not perform view partitioning in the process of the base classifier generating and puts all features into the model, equivalent to shortening the process of view integration with soft probability. Table 1 shows the performance results of MLMVS, MLSVE and the four views with different base learner ensembles. It should be noted that to better illustrate the role of view integration, we also show the results of putting the four feature subspaces data into the model separately and without integration. In terms of accuracy, the metric performance for all four views are not as good as the single-view ensemble. The possible reason is that the single-view ensemble trains a dataset containing all 65 features, which is more beneficial to the accuracy of the training results than the four views trained separately. However, after integrating the four views, the accuracy, sensitivity, precision, F1-score and AUC for multiple experiments of MLMVS are higher than the MLSVE. Although the specificity of MLMVS is slightly lower than that of MLSVE, its sensitivity is much higher than that of MLSVE. The reason is that the specificity and sensitivity represent the recall of negative and positive samples, respectively, and it is obvious that the improvement of sensitivity is at the expense of specificity. Meanwhile, we seek a model with a balance of accuracy and interpretability, and multi-view partition integration not only improves model performance but also meets the LIME requirement of not using too many input features. In addition to that, more useful information can be retained by training the base classifier from different feature perspectives, and better decisions can be made in terms of prediction accuracy and generalization ability. Thus prediction results are more convincing, revealing that the joint positive effect of multi-view learning can improve the interpretability of integrated classification models.

Table 1
Comparison experiments of MLMVS and MLSVE

Method Accuracy Sensitivity Specificity Precision F1-score AUC

View 1 0.7398 0.73 0.75 0.74 0.74 0.8172

View 2 0.8064 0.82 0.79 0.81 0.81 0.8875

View 3 0.6926 0.68 0.70 0.69 0.69 0.7602

View 4 0.7880 0.85 0.72 0.80 0.79 0.8666

MLSVE 0.8193 0.75 0.85 0.81 0.79 0.9016

MLMVS 0.8471 0.89 0.80 0.85 0.87 0.9227

•
To verify the effectiveness of the multi-layer integration, we replace the first-layer ensemble with LightGBM, a classifier with best training results in the base training pool, and name this model as LightGBM based Multi-view Stacking (lgb-MVS). LightGBM (Light Gradient Boosting Machine) is an improved implementation of GBDT framework. It is a fast, distributed and high-performance GBDT framework based on the decision tree, which can handle high-dimensional big data with improved efficiency and scalability compared to GBDT. Table 2 summarizes the average of 10 trials of the proposed model (MLMVS) and its comparison method (lgb-MVS). It is clearly seen that MLMVS outperforms lgb-MVS in all measures. Multi-layer integration not only cleverly combines the effectiveness of the base classifier and algorithm, but also integrates the results of multiple views, paving the way for interpretability.

Table 2
Comparison experiments of MLMVS and lgb-MVS

Method Accuracy Sensitivity Specificity Precision F1-score AUC

Lgb-MVS 0.8379 0.84 0.79 0.82 0.82 0.9191

MLMVS 0.8471 0.89 0.80 0.85 0.87 0.9227

•
To verify the effect of soft probability and weight assignment on the final result, experiments are conducted on the MLMVS and Multi-layer Multi-view Bagging Integration (MLMVB) that employs an integration strategy consisting of hard probability and majority voting. By comparing the two methods in Table 3 , we noted that MLMVS outperforms MLMVB and the improvement is obviously on the metrics of accuracy, sensitivity, precision, F1-score and AUC, which reveals the advantages of soft probability and weight allocation over hard probability and majority voting strategies. Similar to MLSVE, MLMVB achieves a higher specificity than MLMVS, but its sensitivity is much lower than that of MLMVS. The reason is that the specificity and sensitivity represent the recall of negative and positive samples, respectively, and it is obvious that the improvement of sensitivity is at the expense of specificity. Moreover, soft probability-based predictions better retain useful information and make decisions more convincing. By this token, the soft probability will have positive impact on the model.

Table 3
Comparison experiments of MLMVS and MLMVB

Method Accuracy Sensitivity Specificity Precision F1-score AUC

MLMVB 0.7505 0.65 0.85 0.76 0.75 0.8426

MLMVS 0.8471 0.89 0.80 0.85 0.87 0.9227

4.5 Comparison experiments

Method	Accuracy	Sensitivity	Specificity	Precision	F1-score	AUC
View 1	0.7398	0.73	0.75	0.74	0.74	0.8172
View 2	0.8064	0.82	0.79	0.81	0.81	0.8875
View 3	0.6926	0.68	0.70	0.69	0.69	0.7602
View 4	0.7880	0.85	0.72	0.80	0.79	0.8666
MLSVE	0.8193	0.75	0.85	0.81	0.79	0.9016
MLMVS	0.8471	0.89	0.80	0.85	0.87	0.9227

Method	Accuracy	Sensitivity	Specificity	Precision	F1-score	AUC
Lgb-MVS	0.8379	0.84	0.79	0.82	0.82	0.9191
MLMVS	0.8471	0.89	0.80	0.85	0.87	0.9227

Method	Accuracy	Sensitivity	Specificity	Precision	F1-score	AUC
MLMVB	0.7505	0.65	0.85	0.76	0.75	0.8426
MLMVS	0.8471	0.89	0.80	0.85	0.87	0.9227

In this section, we summarized and compared the results of the proposed method with the benchmark methods that use original data, oversampling data and undersampling data. The classification results are shown in Tables 4 and 5, where symbol “ $+$ ” signifies the proposed method significantly outperforms that model, “ $-$ ” signifies our model is poor than it and “ $=$ ” denotes that there is no significant difference between our method and benchmark models. In order to prove the validity and superiority of our method, we first compared it with several popular benchmark classification models without resampling. Three of these are ensemble learning methods: RF, XGBoost and LightGBM. The other five are single-classification methods: DT, LR, KNN, GNB and MLP. The predictions were obtained with two parameters of 10 for the number of sub-base learners for RF and 50 for the number of sub-base learners for XGBoost and LightGBM.

Table 4
Result of our method and benchmark techniques without resampling

Method	Accuracy	Sensitivity	Specificity	Precision	F1-score	AUC
LR	0.7825( $+$ )	0.96( $=$ )	0.19( $+$ )	0.68( $-$ )	0.58( $+$ )	0.7462( $+$ )
DT	0.8042( $=$ )	0.98( $=$ )	0.21( $+$ )	0.78( $=$ )	0.62( $+$ )	0.7622( $+$ )
RF	0.8066( $=$ )	0.98( $=$ )	0.21( $+$ )	0.80( $=$ )	0.62( $+$ )	0.7763( $+$ )
XGB	0.8062( $=$ )	0.90( $+$ )	0.20( $+$ )	0.81( $-$ )	0.55( $+$ )	0.7710( $+$ )
LGBM	0.8086( $=$ )	0.98( $=$ )	0.23( $+$ )	0.79( $=$ )	0.62( $+$ )	0.7786( $+$ )
KNN	0.7640( $+$ )	0.96( $=$ )	0.10( $+$ )	0.61( $+$ )	0.51( $+$ )	0.6273( $+$ )
MLP	0.8048( $=$ )	0.97( $=$ )	0.26( $+$ )	0.75( $+$ )	0.63( $+$ )	0.7644( $+$ )
GNB	0.7967( $=$ )	0.99( $-$ )	0.15( $+$ )	0.79( $=$ )	0.57( $+$ )	0.7379( $+$ )
MLMVS	0.8082	0.96	0.29	0.78	0.65	0.7802

From Table 4, we can see that our method achieves the second highest accuracy, which represents the true default identification rate in credit risk assessment. Furthermore, the MLMVS shows the best classification ability as it achieves the highest statistical significance AUC value. In terms of precision, XGB achieves the best value of 0.81 but XGB has the lowest sensitivity, indicating that this method achieves the best precision at the cost of misclassifying negative samples. It is worth noting that when the specificity (recall rate of negative samples) is increased, it inevitably leads to a decrease in the sensitivity (recall rate of positive samples). For extremely imbalanced data with a minority of negative samples, the improvement of specificity may significantly reduce the accuracy since correct classification in some positive samples may come at the expense of misclassification of negative samples. Absolute high accuracy is not desirable when dealing with highly imbalanced classification problems. What we pursue is good comprehensive prediction performance and satisfactory generalization ability.

Although the specificity is improved in our proposed method, we want to obtain more reliable performance of our model to reduce the risk associated with the imbalance problem, so we further test the studied method using oversampling and undersampling. The results are shown in Table 5, where “OS” for SMOTE oversampling and “US” for undersampling. We obtain the following observations:

Table 5

Results of our method and benchmark methods with oversampling or undersampling

Method	Accuracy	Sensitivity	Specificity	Precision	F1-score	AUC
LR $+$ US	0.6915( $+$ )	0.72( $=$ )	0.66( $+$ )	0.69( $+$ )	0.69( $=$ )	0.7667( $+$ )
DT $+$ US	0.6852( $+$ )	0.72( $=$ )	0.65( $+$ )	0.69( $+$ )	0.69( $=$ )	0.7586( $+$ )
RF $+$ US	0.6963( $+$ )	0.70( $+$ )	0.69( $=$ )	0.70( $=$ )	0.70( $=$ )	0.7722( $+$ )
XGB $+$ US	0.6939( $+$ )	0.70( $+$ )	0.69( $=$ )	0.70( $=$ )	0.69( $=$ )	0.7689( $+$ )
LGBM $+$ US	0.6991( $+$ )	0.71( $+$ )	0.69( $=$ )	0.70( $=$ )	0.70( $=$ )	0.7763( $+$ )
KNN $+$ US	0.5966( $+$ )	0.63( $+$ )	0.57( $+$ )	0.60( $+$ )	0.60( $+$ )	0.6353( $+$ )
MLP $+$ US	0.6911( $+$ )	0.72( $=$ )	0.66( $+$ )	0.69( $+$ )	0.67( $+$ )	0.7638( $+$ )
GNB $+$ US	0.5690( $+$ )	0.99( $-$ )	0.15( $+$ )	0.74( $-$ )	0.48( $+$ )	0.7344( $+$ )
MLMVS $+$ US	0.7001	0.72	0.68	0.70	0.70	0.7779
LR $+$ OS	0.7300( $+$ )	0.72( $+$ )	0.74( $+$ )	0.73( $+$ )	0.73( $+$ )	0.8021( $+$ )
DT $+$ OS	0.7760( $+$ )	0.84( $+$ )	0.72( $+$ )	0.78( $+$ )	0.77( $+$ )	0.8678( $+$ )
RF $+$ OS	0.8236( $+$ )	0.86( $+$ )	0.79( $=$ )	0.83( $+$ )	0.83( $+$ )	0.9057( $+$ )
XGB $+$ OS	0.8262( $+$ )	0.86( $+$ )	0.79( $=$ )	0.83( $+$ )	0.83( $+$ )	0.9063( $+$ )
LGBM $+$ OS	0.8344( $+$ )	0.87( $=$ )	0.79( $=$ )	0.84( $=$ )	0.86( $=$ )	0.9130( $+$ )
KNN $+$ OS	0.7182( $+$ )	0.53( $+$ )	0.91( $-$ )	0.76( $+$ )	0.71( $+$ )	0.8325( $+$ )
MLP $+$ OS	0.7243( $+$ )	0.71( $+$ )	0.74( $+$ )	0.73( $+$ )	0.73( $+$ )	0.7927( $+$ )
GNB $+$ OS	0.5319( $+$ )	1.00( $-$ )	0.07( $+$ )	0.75( $+$ )	0.40( $+$ )	0.7600( $+$ )
MLMVS $+$ OS	0.8471	0.89	0.80	0.85	0.87	0.9227

•

The result indicates that MLMVS $+$ OS consistently outperforms all benchmark methods with oversampling and undersampling in terms of accuracy, precision, F1-score and AUC. Reaching the highest statistical significance AUC value and accuracy proves that MLMVS+OS has the best classification ability, and the best F1-score prove that our model has a better balance level. Higher precision means that an algorithm returns more relevant results than irrelevant ones.

•

Both oversampling and undersampling can deal with the problem of data imbalance to a certain extent, which greatly improves the specificity, and oversampling shows better advantages in various aspects. The specificity, although not at the highest level, was significantly improved compared to the data without resampling, with an improvement by 175.86% and 134.48% for oversampling and undersampling, respectively.

•

In terms of sensitivity, GNB showed the best results for both oversampling and undersampling, but obtained the worst specificity and accuracy. As is expected, the improvement of sensitivity comes at the cost of the reduced accuracy and specificity, but this does not affect our pursuit of good comprehensive prediction performance.

•

In our experiments, almost all metrics combined with the oversampling method are higher than those with the undersampling method. After several attempts, the oversampling results are more stable for rebalancing credit data and SMOTE works best.

4.6 Analysis of model interpretability

We test LIME on several data points selected randomly, with the purpose of understanding the logic hidden into the stacking model employed. In Fig. 4, we report LIME explanations for the 20th user (point 20) as an example of a “good” borrower, which has been correctly predicted by the MLMVS model. In the figure, “Intercept” indicates the intercept of the model after local linearization, “Prediction_local” is the prediction given by the model after local linearization for the explanation points and “Right” is the prediction given by the original machine learning model for the explanation points. Different LIME settings are employed kernel $=$ 3 and 1.3 to demonstrate how a wrong choice of the parameters may yield inconsistent explanations and how the indices are able to spot the instability.

Figure 4.

Visual analysis by LIME for point 20.

To calculate the index, LIME is applied 10 times in each view, but the available implementations allow setting the required number of repetitions. Here is an example of point 20 (a good borrower chosen at random) to explain our model using LIME. It is worth noting that LIME results for the stable explanation make sense from an economic and financial standpoint: for example, in view 1, the key regressors are the verification status, namely whether annual income is verified and home ownership status provided by the borrower at the time of registration or obtained on a credit report. Specifically, the user is considered a good borrower because the annual income information verification has been verified and the household members are two, i.e., married and childless. In view 2, the borrower’s loan interest rate is less than 0.18 and the borrower is working with a debt settlement company, so it is reasonable to be considered a good borrower by the MLMVS model in terms of loan description. We also note that a “hardship_flag” of 0 signifies that the borrower is in a difficult program, so there is a 13% probability that the user is considered as a bad borrower. Meanwhile, the main reason why the model MLMVS predicts this user to be a good borrower in terms of behavioral information in view 3 is that the current average balance of all accounts is greater than $16,969 and the amount which can be fully purchased on the revolving bank card exceeds $12,288. Similarly, for credit information, the main reason why view 4 predicts this user to be a good borrower is that the percentage of transactions that have never been delinquent was 0.8 and the number of accounts that were 90 days or more past due in the last 24 months was 2.

5. Conclusion

In this study, a novel multi-layer multi-view stacking integration (MLMVS) classification method for credit risk assessment has been proposed. The model comprises the diversity creation strategy grounded in multi-view learning, the ensemble members are integrated with a soft probability method and multi-layer integration based on stacking strategy. In addition, we did interpretability analysis of MLMVS based on LIME.

We concentrate on balance accuracy and interpretability for a credit risk prediction in social lending platforms and verify the feasibility and interpretability by evaluating the model on the real-world P2P loan data from Lending Club. Our experimental results show that the MLMVS not only achieves good accuracy and generalization ability of real default loan identification, which is, sensitivity, but also can be explained according to the actual demand. With the rapid development of the market in credit cards, the accuracy and interpretability of credit risk assessment are critically important to financial institutions’ profitability. Therefore, the proposed interpretable ensemble classification method is an effective and promising method for credit risk assessment.

There are a number of avenues of future work that we would like to explore. Since our results are based on a well-known but unique credit risk dataset, a large number of datasets from other domains should be examined to assess the performance of the model. On top of that, we are going to devote to improving the evaluation, investigating other strategies for addressing data imbalances problem and further optimizing the ensemble learning mechanisms to achieve the best equilibrium between accuracy and interpretability. Finally, improving the robustness and efficiency of integration methods is a potential direction.

Footnotes

Acknowledgments

This work was supported by the National Natural Science Foundation of China under Grant No. 61873279 and Fundamental Research Funds for the Central Universities under Grant No. 20CX05003B.

References

Chernobai

A.S.

Rachev

S.T.

and Fabozzi

F.J.

, Operational risk: a guide to Basel II capital requirements, models, and analysis, Vol. 180, John Wiley & Sons, 2008.

Chen

and Han

, A comparative study of online P2P lending in the USA and China, Journal of Internet Banking and Commerce 17(2) (2012), 1.

Hand

D.J.

and Henley

W.E.

, Statistical classification methods in consumer credit scoring: a review, Journal of the Royal Statistical Society: Series A (Statistics in Society) 160(3) (1997), 523–541.

Namvar

Siami

Rabhi

and Naderpour

, Credit risk prediction in an imbalanced social lending environment, International Journal of Computational Intelligence Systems (2018).

Cao

Liu

Zhang

Zhao

Huang

and Zaiane

, 2, 1 norm regularized multi-kernel based joint nonlinear feature selection and over-sampling for imbalanced data classification, Neurocomputing 234 (2017), 38–57.

Crook

J.N.

Edelman

D.B.

and Thomas

L.C.

, Recent developments in consumer credit risk assessment, European Journal of Operational Research 183(3) (2007), 1447–1465.

Marqués

A.I.

García

and Sánchez

J.S.

, Exploring the behaviour of base classifiers in credit scoring ensembles, Expert Systems with Applications 39(11) (2012), 10244–10250.

Abellán

and Mantas

C.J.

, Improving experimental studies about ensembles of classifiers for bankruptcy prediction and credit scoring, Expert Systems with Applications 41(8) (2014), 3825–3830.

Wang

Huang

and Xu

, Two credit scoring models based on dual strategy ensemble trees, Knowledge-Based Systems 26 (2012), 61–68.

10.

Xiao

and Wang

, Ensemble classification based on supervised clustering for credit scoring, Applied Soft Computing 43 (2016), 73–86.

11.

Zhou

Tang

and Chen

, A DBN-based resampling SVM ensemble learning paradigm for credit classification with imbalanced data, Applied Soft Computing 69 (2018), 192–202.

12.

Chen

Ribeiro

and Chen

, Comparative study of classifier ensembles for cost-sensitive credit risk assessment, Intelligent Data Analysis 19(1) (2015), 127–144.

13.

Niu

Zhang

Liu

and Li

, Resampling ensemble model based on data distribution for imbalanced credit risk evaluation in P2P lending, Information Sciences 536 (2020), 120–134.

14.

Shen

Zhao

Kou

and Alsaadi

F.E.

, A new deep learning ensemble credit risk evaluation model with an improved synthetic minority oversampling technique, Applied Soft Computing 98 (2021), 106852.

15.

Song

Wang

Yin

and Wang

, Multi-view ensemble learning based on distance-to-model and adaptive clustering for imbalanced credit risk assessment in P2P lending, Information Sciences 525 (2020), 182–204.

16.

Johnson

and Giraud-Carrier

, Diversity, accuracy and efficiency in ensemble learning: An unexpected result, Intelligent Data Analysis 23(2) (2019), 297–311.

17.

Campos

Canuto

Salles

de Sá

C.C.

and Gonçalves

M.A.

, Stacking bagged and boosted forests for effective automated classification, in: Proceedings of the 40th international ACM SIGIR conference on research and development in information retrieval, 2017, pp. 105–114.

18.

Amasyali

M.F.

, Improved space forest: A meta ensemble method, IEEE Transactions on Cybernetics 49(3) (2018), 816–826.

19.

Bhowan

Johnston

Zhang

and Yao

, Evolving diverse ensembles using genetic programming for classification with unbalanced data, IEEE Transactions on Evolutionary Computation 17(3) (2012), 368–386.

20.

Xia

Zhao

and Niu

, A novel tree-based dynamic heterogeneous ensemble method for credit scoring, Expert Systems with Applications 159 (2020), 113615.

21.

Nalić

Martinović

and Žagar

, New hybrid data mining model for credit scoring based on feature selection algorithm and ensemble classifiers, Advanced Engineering Informatics 45 (2020), 101130.

22.

Tripathi

Edla

D.R.

Cheruku

and Kuppili

, A novel hybrid credit scoring model based on ensemble feature selection and multilayer ensemble classification, Computational Intelligence 35(2) (2019), 371–394.

23.

Xia

Liu

and Xie

, A novel heterogeneous ensemble credit scoring model based on bstacking approach, Expert Systems with Applications 93 (2018), 182–199.

24.

Xie

Pang

and Chen

, Hybrid recommendation model based on deep learning and Stacking integration strategy, Intelligent Data Analysis 24(6) (2020), 1329–1344.

25.

Hsieh

N.-C.

and Hung

L.-P.

, A data driven ensemble classifier for credit scoring analysis, Expert Systems with Applications 37(1) (2010), 534–545.

26.

Tomczak

J.M.

and Zięba

, Classification restricted Boltzmann machine for comprehensible credit scoring model, Expert Systems with Applications 42(4) (2015), 1789–1796.

27.

Visani

Bagli

Chesani

Poluzzi

and Capuzzo

, Statistical stability indices for LIME: Obtaining reliable explanations for machine learning models, Journal of the Operational Research Society 73(1) (2022), 91–101.

28.

Jadhav

and Jenkins

, Information gain directed genetic algorithm wrapper feature selection for credit rating, Applied Soft Computing 69 (2018), 541–553.

29.

Zhao

Xie

and Sun

, Multi-view learning overview: Recent progress and new challenges, Information Fusion 38 (2017), 43–54.

30.

Wright

R.E.

, Logistic regression, American Psychological Association (1995).

31.

Peterson

L.E.

, K-nearest neighbor, Scholarpedia 4(2) (2009), 1883.

32.

Taud

and Mas

, Multilayer perceptron (MLP), Geomatic approaches for modeling land change scenarios (2018), 451–455.

33.

Myles

A.J.

Feudale

R.N.

Liu

Woody

N.A.

and Brown

S.D.

, An introduction to decision tree modeling, Journal of Chemometrics: A Journal of the Chemometrics Society 18(6) (2004), 275–285.

34.

Rigatti

S.J.

, Random forest, Journal of Insurance Medicine 47(1) (2017), 31–39.

35.

Chen

Benesty

Khotilovich

Tang

Cho

Chen

, Xgboost: extreme gradient boosting, R package version 0.4-2 1(4) (2015), 1–4.

36.

Meng

Finley

Wang

Chen

and Liu

T.-Y.

, Lightgbm: a highly efficient gradient boosting decision tree, Advances in neural information processing systems 30 (2017).

37.

Feng

Xiao

Zhong

Qiu

and Dong

, Dynamic ensemble classification for credit scoring using soft probability, Applied Soft Computing 65 (2018), 139–151.

38.

Ribeiro

M.T.

Singh

and Guestrin

, “Why should i trust you?” Explaining the predictions of any classifier, in: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, 2016, pp. 1135–1144.

39.

Durand

, Risk elements in consumer installment financing, National Bureau of Economic Research, New York, 1941.

40.

Orgler

Y.E.

, A credit scoring model for commercial loans, Journal of Money, Credit and Banking 2(4) (1970), 435–445.

41.

Wiginton

J.C.

, A note on the comparison of logit and discriminant models of consumer credit behavior, Journal of Financial and Quantitative Analysis 15(3) (1980), 757–770.

42.

Steenackers

and Goovaerts

, A credit scoring model for personal loans, Insurance: Mathematics & Economics 8(1) (1989), 31–34.

43.

Tsai

C.-F.

and Chen

M.-L.

, Credit rating by hybrid machine learning techniques, Applied Soft Computing 10(2) (2010), 374–380.

44.

Makowski

, Credit scoring branches out, Credit World 75 (1985), 30–50.

45.

Tibshirani

R.J.

and Efron

, An introduction to the bootstrap, Monographs on Statistics and Applied Probability 57 (1993), 1–436.

46.

Huang

Y.Y.

You

D.C.

and Management

S.O.

, Application on Individual Credit Score for Bank of a Boosting-based Ensemble Learning Algorithm, Value Engineering (2017).

47.

Abdou

Pointon

and El-Masry

, Neural nets versus conventional techniques in credit scoring in Egyptian banking, Expert Systems with Applications 35(3) (2008), 1275–1292.

48.

Lessmann

Baesens

Seow

H.-V.

and Thomas

L.C.

, Benchmarking state-of-the-art classification algorithms for credit scoring: An update of research, European Journal of Operational Research 247(1) (2015), 124–136.

49.

Boiko Ferreira

L.E.

Barddal

J.P.

Gomes

H.M.

and Enembreck

, Improving Credit Risk Prediction in Online Peer-to-Peer (P2P) Lending Using Imbalanced Learning Techniques, in: 2017 IEEE 29th International Conference on Tools with Artificial Intelligence (ICTAI), 2017, pp. 175–181. doi: 10.1109/ICTAI.2017.00037.

50.

Hastie

Tibshirani

Friedman

J.H.

and Friedman

J.H.

, The elements of statistical learning: data mining, inference, and prediction, Vol. 2, Springer, 2009.

51.

Twala

, Multiple classifier application to credit risk assessment, Expert Systems with Applications 37(4) (2010), 3326–3336.

52.

Daubie

Levecq

and Meskens

, A comparison of the rough sets and recursive partitioning induction approaches: An application to commercial loans, International Transactions in Operational Research 9(5) (2002), 681–694.

53.

Yang

and Ren

, Entropy difference and kernel-based oversampling technique for imbalanced data learning, Intelligent Data Analysis 24(6) (2020), 1239–1255.

54.

Tzortzis

and Likas

, Kernel-based weighted multi-view clustering, 2012 IEEE 12th international conference on data mining (2012), 675–684.

A multi-layer multi-view stacking model for credit risk assessment

Abstract

Keywords

1. Introduction

3. Methodology

3.2 Multi-layer ensemble model

3.3 Model interpretability

4.1 Data preparation

4.1.1 Data cleaning

4.1.2 Data transformation

4.1.3 Redundant feature removal

4.1.4 Feature selection

Table 4 Result of our method and benchmark techniques without resampling

Footnotes

Acknowledgments

References

Table 4
Result of our method and benchmark techniques without resampling