Abstract
Credit risk assessment plays a key role in determining the banking policies and commercial strategies of financial institutions. Ensemble learning approaches have been validated to be more competitive than individual classifiers and statistical techniques for default prediction. However, most researches focused on improving overall prediction accuracy rather than improving the identification of actual defaulted loans. In addition, model interpretability has not been paid enough attention in previous studies. To fill up these gaps, we propose a Multi-layer Multi-view Stacking Integration (MLMVS) approach to predict default risk in the P2P lending scenario. As the main innovation, our proposal explores multi-view learning and soft probability outputs to produce multi-layer integration based on stacking. An interpretable artificial intelligence tool LIME is embedded for interpreting the prediction results. We perform a comprehensive analysis of MLMVS on the Lending Club dataset and conduct comparative experiments to compare it with a number of well-known individual classifiers and ensemble classification methods, which demonstrate the superiority of MLMVS.
Introduction
With the development of informatization in the personal consumer finance market, a large amount of data related to personal credit performance has been collected, which plays an important role in promoting the business innovation and development. The accumulated big data also provides a good foundation for the application of data-driven techniques in credit risk assessment. More and more scholars pay attention to the enormous social value in personal credit big data and make research in terms of credit risk assessment.
Credit risk assessment, a procedure to calculate the risk associated with credit products using applicant’ s credentials (such as annual income, job status and residential status), aims to develop models that can reduce financial risks and increase the related profits which is critical for the survival of financial and non-financial institutions. In recent years, the number of personal credit defaults has increased rapidly, mainly due to the emerging of social lending platform, known as Peer-to-Peer (P2P) lending [1]. The P2P lending industry, as a form of financial innovations, has shown a booming and global trend, providing a convenient way for individuals to borrow and directly invest online without complicated procedures. However, it confronts great challenges. The inherent information asymmetry, in which lenders know limited information about borrowers while borrowers know considerably more about their own risk levels, attracts riskier borrowers and misleads lenders to fund them, leading to higher default rates compared with bank loans [2].
Constructing an effective credit risk assessment model to predict the probability of loan default has become a crucial task which serves for P2P lending field [3]. The statistical methods and machine learning methods applied to credit risk assessment have experienced a period of maturity. Recently, ensemble learning is increasingly being selected to address the problem due to its superiority. In general, the ensemble method consists of three sequential steps including pool generation, model selection and result combination. As a result, the most of existing literature does research on improving the credit risk ensemble models from the aforementioned three aspects but have limitations on imbalance dataset and generation ability [4]. Resampling is a widely used method to address class imbalance problem [5], which changes the distribution of instances among classes by randomly or purposefully taking samples from the original training set. In addition, most studies focus on improving the accuracy of overall predictions rather than the ability to identify actual defaulted loans, and some features significantly associated with default rates have not been exploited in previous studies. Interpretable model is essential in the area of credit risk assessment, which can provide explanation for lending platforms to justify credit denials. Further, a model with interpretability could reduce lenders’ suspicion existing in statistical techniques. In this paper, we propose a novel stacking integration model named Multi-layer Multi-view Stacking Integration to perform credit risk assessment, which is an imbalanced classification problem in nature. In addition, our integrated fusion strategy focuses on two aspects, i.e., accuracy and interpretability. Extensive experiments are carried out to demonstrate MLMVS’s generalization ability to identify default loans and the superiority over other popular methods. The proposed integration model for credit risk assessment is verified to be able to handle the default prediction. For clarity, the contributions of this study are summarized as follows:
We design a multi-layer integration framework to generate an effective learner. Soft probabilities as the output form of the first-layer ensemble learners are integrate into the second-layer, which can make full use of the implicit information lurked in the decision-making process of the first-layer. Multi-view learning is introduced to depict instances from different perspectives, promote diversity for the outermost integrations of integration strategy and reduce complexity in the integrated model. To capture the nuances of individual consumers, each view generates different base learners through a dynamic weight selection strategy. Besides, it paves the way for the interpretability of our model. Interpretable credit risk assessment visualization consisting of LIME plot is realized to give interpretable analysis and provide data-driven decision-making support for finance administrators.
The rest of the paper is structured as follows. Section 2 conducts a critical literature review about credit risk ensemble models. Section 3 presents the details of the proposed interpretable MLMVS approach. Section 4 describes data preparation, experimental design, experimental results and analysis in detail. Section 5 concludes the paper and proposes future work.
The main idea of credit risk assessment is to build a quantitative model based on a set of explanatory variables to estimate the credit of an applicant [6]. Credit risk classification is often described as a binary classification problem to distinguish between good credit, which indicates that the borrower can repay the debt, and bad credit, which indicates default or other unwanted status. Estimating the probability of default is the core mission and it can be formulated to be a general population classification task. Various individual classification algorithms based on traditional statistical methods or machine learning methods have been used for credit risk assessment in the last decades. Especially, ensemble learning is more flexibility to represent various functions and is widely used in the field of credit risk assessment [7, 8, 9, 10, 11, 12]. In this section, a review of recent ensemble strategies is presented, discussing approaches in terms of improving accuracy and interpretability.
We review existing approaches to improving the accuracy of ensemble learning in terms of class imbalance and diversity. Considering the fact that the losses of bad clients far outweigh the gains of good ones [11], researchers have noticed imbalance data studies that have became a vital issue in credit classification. Niu et al. [13] propose a novel resampling integration model based on data distribution for imbalanced credit risk assessment in P2P lending in 2020. Shen et al. [14] propose an improved SMOTE method for imbalanced data processing as well as integrated LSTM network and AdaBoost algorithm into an integrated framework. Meanwhile, increasing the diversity of integrations has been shown to be an effective way to significantly improve the predictive power of integration methods for unbalanced datasets [15].
The most common strategy to enhance diversity is inducing multiple data partitions to train different models based on one learning algorithm and combine the models subsequently; both instance-partitioning methods and feature-partitioning methods can be used. Joseph [16] present a variety of approaches to optimize combinations based on integration and show that there is a natural tension between the diversity of pairs of portfolio members and individual accuracy. In order to guarantee the performance of the ensemble, Campos et al. [17] point out that classifiers should keep complementarity from each other in some degree. Another notable issue is that the increase of base classifiers’ diversity may cause the decrease of accuracy [18]. Therefore, as described in Bhowan et al. [19], both accuracy and variety should be taken into account during the generation procedure of ensemble to gain the advantage over base classifiers. Xia et al. [20] develop an overfitting-cautious integration selection strategy by fully considering the overfitting problem in the integration selection phase. As a part of data preprocessing, feature selection algorithms have been shown to improve the performance metrics of data mining models. Nalic et al. [21] propose a new method combining five different feature selection algorithms. Diversity can also be induced using different algorithms trained on the same dataset, or using a single algorithm with different parameters applied to the same data. Many classifiers such as DT, SVM, Naive Bayes (NB) and NN-based classifiers have thus been proposed. Diwakar [22] combines feature selection and a multilayer integrated classifier framework to develop a hybrid model and shows that classifiers usually perform well for a specific dataset. As a consequence, using an ensemble classifier is a strong approach to get near to the optimal classifier for any dataset. Diversity could also be enhanced in the ensemble stage of base classifiers: the more diverse merged learners are the higher the accuracy and the smaller the complexity of the final model. Xia et al. [23] propose a novel heterogeneous integration model with majority voting fusion for credit assessment, which show effectiveness and accuracy on P2P-B dataset. Xie et al. [24] design a hybrid model that combined deep learning and stacking integration strategy. The traditional algorithms are fused by using stacking integration strategy to compensate the shortcomings of a single algorithm and achieve better results.
In addition to accuracy, the interpretability of models is also crucial in the field of credit risk assessment. There has been literature attempting to reconcile interpretability with high classification accuracy in credit risk assessment. Due to the complexity of the models, these studies have mainly focused on estimating variable importance scores rather than true knowledge. The practitioners have been aware of this vital issue and need an effective tool to solve this problem. Therefore, developing an interpretable integration model with high accuracy is one of the most important research topics in the future of credit risk assessment. The proposal by Hsieh and Hung [25] is one of the first that deal with balance between accuracy and interpretability in ensemble models. They propose a multi-stage model that hierarchically combines two sources of diversity, bagging and multi-classifier systems. Tomczak and Zieba [26] use a variant of the Boltzmann Machine to generate weights for binary feature inputs to a simple relevance-based rating scale. The Classification-constrained Boltzmann Machine (ClassRBM) is first trained as a stand-alone classifier with the ability to predict credit status but without the inclusion of interpretable structure. To obtain an understandable model, the ClassRBM is used to evaluate the relevance of each binary feature, and these values are then used to create a rating scale (score card). And the superior interpretability of the generated score sheets is explicitly reported in the presence of more complex models. In order to exploit the potential of machine learning in the credit scoring domain, the issue of interpretability must be addressed. Giorgi et al. [27] in 2022 propose to apply LIME on top of a well-performing black box algorithm. This approach preserves the enhanced predictive power of machine learning while providing meaningful explanations to the applicants involved as well as to the regulators.
Methodology
Most ensemble credit risk prediction models mainly include two steps, i.e., base learner generation and base model fusion. In previous studies, ensemble models can be applied to solve the credit risk identification problem and become the most popular machine learning method in the field of credit risk prediction. However, as a practical problem with extensive attention, in addition to combining multiple algorithms that process different hypotheses to form a better hypothesis and to make good predictions, we should attach importance to the interpretability of the model, which is the weakness of current ensemble learning. Consequently, the fundamental issues considered in our research include (1) how to generate the base model pool, (2) how to select base learners and meta-classifier, (3) how to make better prediction by fusing different base models, and (4) how to improve the interpretability of the integrated model. The framework of the proposed method is illustrated in Fig. 1.
Framework of the proposed model.
Data quality is a crucial part for model training. Previous credit risk prediction models describe that the different types of data of borrowers collected from different sources is connected to a view, and then directly train with machine learning algorithms as input. The obvious disadvantage of this process is that it ignores the statistical characteristics of the data from different views, meanwhile too many features selected as inputs to the underlying model will affect the quality of training model.
Credit loan datasets are typical multi-view datasets, in which features fall into several groups [28]. Each feature category represents a detailed description of loan records or loan applicants from a specific perspective. This characteristic of credit loan data provides a firm foundation for the application of multi-view learning in credit risk assessment tasks. Hence, we propose the idea of multi-view partitioning, aiming to make full use of data features from different view. For credit loan data
At the core of multi-view learning is multiple independent feature subspaces, which can provide compatible and complementary information and depict instances from different perspectives, making it an effective method to promote diversity when building integrated models [29]. So, we divide data into four complementary subspace to train different models and further verify that. In addition, enhancing diversity has been proved to be a way to improve the ability of ensemble learning models to deal with imbalanced classification problems. Based on this, we utilize the idea of diversity and propose multi-view learning to solve the imbalanced ensemble classification issue.
Multi-layer ensemble model
To further improve the prediction performance of our model, we propose a method for constructing a multi-layer ensemble classifier. The framework of multi-layer ensemble classifier is to aggregate multiple heterogeneous classifiers into one classifier through multiple layers as shown in Fig. 2. The base classifiers generated in phase 2 located on the first layer, where multiple heterogeneous classifiers in the first layer are required to select the best classifier with high accuracy and high diversity based on stacking integration rules. Logistic regression acts as a meta-learner in the second layer, which aggregates the results obtained in the first layer and outputs them in the form of soft probabilities. In fact, we do not use the test dataset to estimate any parameters for ensemble pruning, but combine each test sample from the four views according to the classification results of the ordered and trained classifiers on the test dataset with different classifiers. Finally, the meta-classifier outputs from the phase 2 are fed into the third layer for multi-view integration. Overall, the same training dataset is used to train different base classifiers, and then aggregate the outputs of these classifiers together to form the final predicted output for each sample.
To sum up, we propose multi-layer ensemble classifier as the model construction. The following subsections explain the critical phase of the multi-layer ensemble model.
Multi-layer ensemble model.
The basic principles and optimization functions of the machine learning methods are diverse. Therefore, considering both the diversity and accuracy, this study employs multiple machine learning methods to generate the base classifiers for credit risk assessment. We build a set of different base classifiers for each view by using generation processes of multiple classification models with different parameters or training datasets, denoted as
LR: Logistic regression [30] belongs to supervised machine learning models. Besides, it could distinguish different classes (or categories) as a discriminative model. KNN: Neighbors-based classification is a type of instance-based learning or non-generalizing learning. K-Nearest Neighbor [31] implements learning based on the MLP: A multilayer perceptron [32] is a fully connected class of feedforward artificial neural network (ANN) with a large number of network nodes interconnected, linking feature values in various ways through linear and non-linear combinations to obtain the final recognition result. GNB: Gaussian Naive Bayes is a classification technique for machine learning based on probabilistic methods and Gaussian distributions. The category of the sample to be tested is determined by maximizing the posterior probability. DT: Decision Trees [33] are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. RF: Random forests [34] is an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time. For classification tasks, the output of the random forest is the class selected by most trees. Random forests correct for decision trees’ habit of overfitting to their training set. XGBoost: Extreme Gradient Boosting Decision Tree [35] is an implementation of machine learning algorithms in the Gradient Boosting framework, providing parallel tree boosting that can solve many data science problems quickly and accurately. LightGBM: Light Gradient Boosting Machine [36] is also an improved implementation of the GBDT framework, which can be seen as an optimization of XGBoost. Its optimization is mainly to improve the training speed of the model, it can process high-dimensional big data, and improve the efficiency and scalability.
The base learners we use all have hyper-parameters which have a significant influence on the performance of the model. Therefore, grid search must be carefully performed to optimize the hyper-parameters effectively. Following the aforementioned process, the grid search iteratively selects the current optimal hyper-parameters and do evaluation until reaching a pre-set iteration number. In this study, to deal with the imbalanced loan data, the simplest under sampling and oversampling technique (randomly resampling the minority class till the numbers of both classes are equal) are chosen as the base algorithm to train base learners for each view.
Based on the work done in the phase 1, a set of base classifiers are trained and the next phase is to combine these base learners into an aggregated classifier using an appropriate ensemble strategy. In this subsection, we aim at obtaining a group of base classifiers (the first layer) and a meta-classifier (the second layer) for each view based on their corresponding sub datasets.
Base classifiers were selected based on the average accuracy, sensitivity, specificity and precision produced by the phase 1 to construct the ranking criteria. Details on the metrics are shown in Section 4.3. In subsequent experiments without resampling, we found that the scores of sensitivity and specificity are about 5:1, whereby Eq. (1) shows the ranking criteria we constructed. Finally, we can obtain a score for each base classifier based on the entire validation set, and select a set of base classifiers with the highest score as candidate. Moreover, the stacking model is a heterogeneous ensemble, which requires the base learner to be more accurate and more differentiated. Therefore model diversity should also be considered.
In the first layer, we selected
Hard probability are used as the output form of each view classifier according to existing research, which may lead to some hidden information to be lost during the model ensemble process [37]. Therefore, we adopt soft probability as the output form of the phase 2 and hard probability as the output of the final integration. In addition, we have to determine the final classifier, because it will affect the accuracy of the whole model, and after several attempts, we choose random forest as the integrator for phase 3. Based on the accuracy of the output of the phase 2, we assign each view a corresponding weight and feed the results into the phase 3 of the integrated classifier (the third layer). In a binary classification problem, after obtaining the probability of each category, the final classes of instances need to be decided. The threshold is commonly set to 0.5 in most research for loan default judgment. However, it’s necessary to choose an optimal threshold based on the precision when dealing with an imbalanced dataset. To tackle this issue, this study achieves the final classification results by comparing the probabilities from different categories.
Model interpretability
In recent years, machine learning models, especially ensemble learning models, have improved in prediction accuracy and elapsed time. However, these methods present a major drawback: it is very difficult to understand what grounds the algorithm to take the decision. Explaining the models with feature importance and simple variable analysis visualization alone is not enough to assess the trust level of the whole model. To address this issue we combine the LIME (Local Interpretable Model-agnostic Explanations) method with our proposed algorithm and check its stability.
LIME [38] is a method to explain the black box model. It locally linearizes the machine learning model and uses the approximate linear model for local interpretation. Each model is specific to an input point: only in its adjacent region can the prediction of the explainable model be guaranteed to be very close to the prediction of the black box model. Specifically, we want to use simple models to explain complex models. The simple model here, the explainable model that behaves very similar to the original model, can be a linear model, because we can explain the model by checking at the coefficient of the linear model. LIME will generate a new dataset (this dataset is obtained by transforming specific sample data) and then we train a simple model (easy to explain) on this new dataset. We hope that the predictions of this simple model on the new dataset to be similar to the predictions of the complex model on that dataset. We can formulate our problem as the following:
where f stands for the original black-box model, the model needs to be explained, and g stands for the explainable model.
The disadvantage of LIME is that it is very sensitive to dataset dimensionality, i.e., when it is used to explain a model built with a large number of variables, the local explanation cannot distinguish between correlated and irrelevant features. In this study, although the dimension of the original data is very large, we use the sub-view to build the model. On the premise of ensuring that the data information is as complete as possible, the dimension of the feature variable will not be too large.
Data preparation
We use data provided by Lending Club, the world’s largest P2P lending platform, to conduct experiments. The main analysis is on loans issued from January 2015 to September 2018. Before 2015, Lending Club basically maintained a 100% annualized revenue growth. After that, due to the negative impact of merger failures and illegal operations, the company’s performance declined rapidly and the revenue growth rate decreased, which caused a huge loss. It was not until 2018 that the corporate profitability began to show signs of recovery. In total, original dataset includes 1,048,576 loan records and 151 attributes. We did a series of data preparation work due to the complexity of the data. The detailed process is described in the following sections.
Data cleaning
Data cleaning consider and eliminate for the missing values of dataset and samples of dataset. There are 1,048,576 loan records in the raw data. These instances have seven final states: “Fully paid”, “Current”, “Default”, “Charged off”, “Late (16–30 days)”, “Late (31–120 days)” and “In grace period”. We removed samples with “Current” loan status because their final outcome cannot be inferred. For value diagnosis, 12 blank records are deleted directly. In addition, there are 73,841 data missing to varying degrees. It is worth noting that because of the large amount of original data, we did not replace the missing values with mean or random values but we discarded all samples with missing values, which would not affect the final result. Besides, we eliminated records with obvious errors. After these operations, the number of dataset records changes from 1,048,576 to 936,982, of which about 23.28% are default loan records.
Data transformation
Some attributes need further transformation to make the data suitable for algorithms coping with numerical variables. (a) “loan_condition” represents the final status of the issued loans. As a binary classification task, we converted the status of “Fully Paid” to 0 that indicates good loans and set other categories to 1, which means bad loans. (b) “issue_d” indicates date of loan application, to make better use of such time data, we convert it to the number of days passed since application issuance and named it “issueDays”. (c) In addition, the same operation is performed on “emp_length” and “emp_title”. The value of “emp_title” is converted to 0 and 1, which indicates that the borrower has filled in the occupational information and not respectively. The value of “emp_length” is coded from 0 to 11, where 0 means that the borrower had been working for an employer for less than 1 year when he or she obtained this loan, and 10 means the borrower hadn’t changed job for 10 years or more. The “11” indicates missing employment length information. (d) Furthermore, the values of attributes including “grade”, “sub_grade”, “addr_state”, “purpose”, “home_ownship” and “verification_status” are non-numerical, which need to be converted to integers respectively.
Redundant feature removal
In this study, general approach and domain knowledge gleaned from various sources are used to identify redundant features. The following features are excluded in this study. (a) Features with more than 40% deletion ratio such as “revol_bal_joint” and “mth_since_las_record” were deleted directly. (b)Attributes like “url” which lack useful information were removed. (c) According to the data dictionary provided by Lenging Club, certain features contain the same or similar information. For example, “title” also represents the loan purpose, so we deleted it. (d) Sensitive personal or corporate information should be removed to ensure data security. (e) Post-loan variables such as “total_pymnt” were deleted to prevent model information from being disclosed in advance. (f) Finally, according to the field research experience, “settlement_term” and other meaningless features were deleted.
Feature selection
In credit loan data, a borrower often has many attributes (hereinafter referred to as features), and these features can be roughly divided into three main types: relevant features, irrelevant features and redundant features. The main purpose of feature selection is to select features that are beneficial to the learning algorithm. Moreover, in practical applications, the problem of dimensional disaster often occurs. Only selecting some of the features to build a model can greatly reduce the running time of the learning algorithm and increase the interpretability of the model. Previous studies have shown that a single feature selection method cannot handle all classifiers and datasets well. Therefore, this paper uses several different feature ranking methods based on the idea of ensemble learning, namely information value, pearson correlation coefficient method and feature-importance scores of XGBoost. The final ranking of features is calculated from three perspectives: filtering high contributing features, merging redundant features and calculating feature importance. The four methods are as follows:
Information value: Information Value (IV), which is used to indicate the degree of contribution of the feature to the target prediction, i.e., the predictive power of the feature. In general, the higher the IV is, the stronger the predictive power of the feature and the higher the degree of information contribution will be. Pearson correlation coefficient: Pearson product-moment correlation is used to select features that are highly correlated with the class label and have low correlation with other features. Here we only consider feature-to-feature correlations and the features with higher correlations are combined into one feature. The equation is as follows, where
Feature-importance scores of XGBoost: Similar to other tree-based classifiers, XGBoost provides feature-importance scores that measure the average objective reduction after taking the specific variables for splitting. A variable with a higher score has a higher importance in tree building. Given that XGBoost is one of the base learners employed in this study, we employ feature-importance scores as a rule.
We first deleted the IV variables with low values. Next, we calculated the pearson correlation coefficient of the remaining 106 features and selected the pairwise correlation variables with the threshold of pearson correlation coefficient above 0.7. For example, we combined “funded_amnt”, “funded_amnt_inv”, “installment” with a correlation of “1” into one feature. Finally, we got the feature importance score of each feature based on trained XGBoost model. Overall, we scored the features according to the above four criteria and retained 65 vital features from the original 151 features, which include 32 categorical variables and 33 continuous variables.
According to the official data dictionary and professional analysis of practitioners, the characteristics of Lending Club data can be artificially divided into four categories: Personal Information, Loan Description, Behavioral Information and Credit History. Each category is considered as a separate view and each feature can be assigned to only one view. Specifically, “Personal Information” (view 1) refers to the borrower’s job status, family status, asset status, etc, of which “annual income” and “debt income” reflect the borrower’s repayment ability. “Loan Description” (view 2) that describes the loan information includes “loan purpose” and “loan amount” features. “Behavioral Information” (view 3) describes consumer behavior of loan applicant from external sources. “Credit History” (view 4) refers to borrower’s bad credit history over the past few years. Finally, after multi-view partitioning, we obtained the final four view partitions, which contain the number of features 9, 14, 18 and 24 respectively. The feature importance in each view is ranked according to the final score obtained in Section 4.1.4. Details of feature partition and feature importance are provided in Fig. 3, where the larger the area is occupied by the feature, the more important it is.
Multi-view partitioning.
We use overall accuracy to evaluate credit evaluation model and AUC to evaluate the correctness of label prediction. However, the overall accuracy is not appropriate in imbalanced credit data. We expect these evaluation metrics would provide a comprehensive evaluation of credit scoring models. To compare the performance of our model and benchmarks comprehensively, six representative evaluation measures are used: accuracy, sensitivity, specificity, precision, F1-score and area under the ROC curve (AUC). A brief explanation of these metrics is described as follows:
Accuracy: accuracy measures the overall validity of the model. Among the classification metrics, the most natural one to think of is the accuracy rate, i.e., the percentage of correct predicted outcomes in the total sample. Due to the sample imbalance, the obtained highly-accurate results lack convincingness, so, in this case, it is necessary to calculate the precision, sensitivity and specificity.
Sensitivity: sensitivity (recall of positive samples) represents the probability of being classified correctly in a sample of actually good borrowers.
Specificity: specificity (recall of negative samples) represents the proportion of bad borrowers that are correctly predicted as bad borrowers. Correctly identifying bad borrowers can greatly reduce the losses of lending institutions and banks. Therefore, to increase the revenue of financial institutions, it is necessary not only to improve the classification ability of the model, but also to reduce the misclassification cost of the model.
Precision: precision represents the class consistency of the data labels with the good borrowers given by the model, which, as well as accuracy, can both be used to measure the classification ability of the model. Higher precision means that an algorithm returns more relevant results than irrelevant ones.
F1-score: The F1-score combines the precision and recall of a classifier into a single metric by taking their harmonic mean. It is primarily used to compare the performance of two classifiers. The F1-score of a classification model is calculated as follows:
AUC: AUC is a comprehensive evaluation metric. It is defined as the area under the receiver operating characteristic (ROC) curve. The larger value, the better performance of the proposed model.
Our model MLMVS synthesizes multi-view learning and multi-layer integration construction which is an integration of ensembles. We consider both soft probability and view weight in the integration strategy. We utilize the ablation test by removing the component of MLMVS to prove its effectiveness. Three key components are identified in MLMVS, namely multi-view partition, multi-layer integration and soft probability.
To verify the effect of the multi-view integration, the MLMVS is compared with the single-view method named Multi-layer Single-view Ensemble (MLSVE) that does not perform view partitioning in the process of the base classifier generating and puts all features into the model, equivalent to shortening the process of view integration with soft probability. Table 1 shows the performance results of MLMVS, MLSVE and the four views with different base learner ensembles. It should be noted that to better illustrate the role of view integration, we also show the results of putting the four feature subspaces data into the model separately and without integration. In terms of accuracy, the metric performance for all four views are not as good as the single-view ensemble. The possible reason is that the single-view ensemble trains a dataset containing all 65 features, which is more beneficial to the accuracy of the training results than the four views trained separately. However, after integrating the four views, the accuracy, sensitivity, precision, F1-score and AUC for multiple experiments of MLMVS are higher than the MLSVE. Although the specificity of MLMVS is slightly lower than that of MLSVE, its sensitivity is much higher than that of MLSVE. The reason is that the specificity and sensitivity represent the recall of negative and positive samples, respectively, and it is obvious that the improvement of sensitivity is at the expense of specificity. Meanwhile, we seek a model with a balance of accuracy and interpretability, and multi-view partition integration not only improves model performance but also meets the LIME requirement of not using too many input features. In addition to that, more useful information can be retained by training the base classifier from different feature perspectives, and better decisions can be made in terms of prediction accuracy and generalization ability. Thus prediction results are more convincing, revealing that the joint positive effect of multi-view learning can improve the interpretability of integrated classification models. Comparison experiments of MLMVS and MLSVE
To verify the effectiveness of the multi-layer integration, we replace the first-layer ensemble with LightGBM, a classifier with best training results in the base training pool, and name this model as LightGBM based Multi-view Stacking (lgb-MVS). LightGBM (Light Gradient Boosting Machine) is an improved implementation of GBDT framework. It is a fast, distributed and high-performance GBDT framework based on the decision tree, which can handle high-dimensional big data with improved efficiency and scalability compared to GBDT. Table 2 summarizes the average of 10 trials of the proposed model (MLMVS) and its comparison method (lgb-MVS). It is clearly seen that MLMVS outperforms lgb-MVS in all measures. Multi-layer integration not only cleverly combines the effectiveness of the base classifier and algorithm, but also integrates the results of multiple views, paving the way for interpretability.
Comparison experiments of MLMVS and lgb-MVS
To verify the effect of soft probability and weight assignment on the final result, experiments are conducted on the MLMVS and Multi-layer Multi-view Bagging Integration (MLMVB) that employs an integration strategy consisting of hard probability and majority voting. By comparing the two methods in Table 3 , we noted that MLMVS outperforms MLMVB and the improvement is obviously on the metrics of accuracy, sensitivity, precision, F1-score and AUC, which reveals the advantages of soft probability and weight allocation over hard probability and majority voting strategies. Similar to MLSVE, MLMVB achieves a higher specificity than MLMVS, but its sensitivity is much lower than that of MLMVS. The reason is that the specificity and sensitivity represent the recall of negative and positive samples, respectively, and it is obvious that the improvement of sensitivity is at the expense of specificity. Moreover, soft probability-based predictions better retain useful information and make decisions more convincing. By this token, the soft probability will have positive impact on the model.
Comparison experiments of MLMVS and MLMVB
In this section, we summarized and compared the results of the proposed method with the benchmark methods that use original data, oversampling data and undersampling data. The classification results are shown in Tables 4 and 5, where symbol “
Result of our method and benchmark techniques without resampling
Result of our method and benchmark techniques without resampling
From Table 4, we can see that our method achieves the second highest accuracy, which represents the true default identification rate in credit risk assessment. Furthermore, the MLMVS shows the best classification ability as it achieves the highest statistical significance AUC value. In terms of precision, XGB achieves the best value of 0.81 but XGB has the lowest sensitivity, indicating that this method achieves the best precision at the cost of misclassifying negative samples. It is worth noting that when the specificity (recall rate of negative samples) is increased, it inevitably leads to a decrease in the sensitivity (recall rate of positive samples). For extremely imbalanced data with a minority of negative samples, the improvement of specificity may significantly reduce the accuracy since correct classification in some positive samples may come at the expense of misclassification of negative samples. Absolute high accuracy is not desirable when dealing with highly imbalanced classification problems. What we pursue is good comprehensive prediction performance and satisfactory generalization ability.
Although the specificity is improved in our proposed method, we want to obtain more reliable performance of our model to reduce the risk associated with the imbalance problem, so we further test the studied method using oversampling and undersampling. The results are shown in Table 5, where “OS” for SMOTE oversampling and “US” for undersampling. We obtain the following observations:
Results of our method and benchmark methods with oversampling or undersampling
The result indicates that MLMVS Both oversampling and undersampling can deal with the problem of data imbalance to a certain extent, which greatly improves the specificity, and oversampling shows better advantages in various aspects. The specificity, although not at the highest level, was significantly improved compared to the data without resampling, with an improvement by 175.86% and 134.48% for oversampling and undersampling, respectively. In terms of sensitivity, GNB showed the best results for both oversampling and undersampling, but obtained the worst specificity and accuracy. As is expected, the improvement of sensitivity comes at the cost of the reduced accuracy and specificity, but this does not affect our pursuit of good comprehensive prediction performance. In our experiments, almost all metrics combined with the oversampling method are higher than those with the undersampling method. After several attempts, the oversampling results are more stable for rebalancing credit data and SMOTE works best.
We test LIME on several data points selected randomly, with the purpose of understanding the logic hidden into the stacking model employed. In Fig. 4, we report LIME explanations for the 20th user (point 20) as an example of a “good” borrower, which has been correctly predicted by the MLMVS model. In the figure, “Intercept” indicates the intercept of the model after local linearization, “Prediction_local” is the prediction given by the model after local linearization for the explanation points and “Right” is the prediction given by the original machine learning model for the explanation points. Different LIME settings are employed kernel
Visual analysis by LIME for point 20.
To calculate the index, LIME is applied 10 times in each view, but the available implementations allow setting the required number of repetitions. Here is an example of point 20 (a good borrower chosen at random) to explain our model using LIME. It is worth noting that LIME results for the stable explanation make sense from an economic and financial standpoint: for example, in view 1, the key regressors are the verification status, namely whether annual income is verified and home ownership status provided by the borrower at the time of registration or obtained on a credit report. Specifically, the user is considered a good borrower because the annual income information verification has been verified and the household members are two, i.e., married and childless. In view 2, the borrower’s loan interest rate is less than 0.18 and the borrower is working with a debt settlement company, so it is reasonable to be considered a good borrower by the MLMVS model in terms of loan description. We also note that a “hardship_flag” of 0 signifies that the borrower is in a difficult program, so there is a 13% probability that the user is considered as a bad borrower. Meanwhile, the main reason why the model MLMVS predicts this user to be a good borrower in terms of behavioral information in view 3 is that the current average balance of all accounts is greater than $16,969 and the amount which can be fully purchased on the revolving bank card exceeds $12,288. Similarly, for credit information, the main reason why view 4 predicts this user to be a good borrower is that the percentage of transactions that have never been delinquent was 0.8 and the number of accounts that were 90 days or more past due in the last 24 months was 2.
In this study, a novel multi-layer multi-view stacking integration (MLMVS) classification method for credit risk assessment has been proposed. The model comprises the diversity creation strategy grounded in multi-view learning, the ensemble members are integrated with a soft probability method and multi-layer integration based on stacking strategy. In addition, we did interpretability analysis of MLMVS based on LIME.
We concentrate on balance accuracy and interpretability for a credit risk prediction in social lending platforms and verify the feasibility and interpretability by evaluating the model on the real-world P2P loan data from Lending Club. Our experimental results show that the MLMVS not only achieves good accuracy and generalization ability of real default loan identification, which is, sensitivity, but also can be explained according to the actual demand. With the rapid development of the market in credit cards, the accuracy and interpretability of credit risk assessment are critically important to financial institutions’ profitability. Therefore, the proposed interpretable ensemble classification method is an effective and promising method for credit risk assessment.
There are a number of avenues of future work that we would like to explore. Since our results are based on a well-known but unique credit risk dataset, a large number of datasets from other domains should be examined to assess the performance of the model. On top of that, we are going to devote to improving the evaluation, investigating other strategies for addressing data imbalances problem and further optimizing the ensemble learning mechanisms to achieve the best equilibrium between accuracy and interpretability. Finally, improving the robustness and efficiency of integration methods is a potential direction.
Footnotes
Acknowledgments
This work was supported by the National Natural Science Foundation of China under Grant No. 61873279 and Fundamental Research Funds for the Central Universities under Grant No. 20CX05003B.
