Abstract
Credit scoring (CS) is an important process in both banking and finance. Lenders or creditors have to use CS to predict the probability that a borrower will default or become delinquent. CS is usually based on variables related to the applicant such as: his age, his historical payments, his behavior, etc. This paper first proposes a new method for variable selection. The proposed method (VS-VNS) is based on the variable neighborhood search meta-heuristic. VS-VNS allows us to select a set of significant variables for the data classification task. The VS-VNS is combined then with a Bayesian network (BN) to build models for CS and select counterparties. Further, six search methods are studied for BN on different sets of variables. The different techniques and combinations are evaluated on some well-known financial datasets. The numerical results are promising and show the benefits of the new proposed approach (VS-VNS) for data classification and credit scoring.
Keywords
Introduction
Variable selection called also attribute selection or feature selection is the operation that permits to select a subset of relevant or significant variables to be used in the data classification task. Variable selection is a pre-processing step before launching the classification task. In this work, we are interested in variable selection and classification for credit scoring (CS).
As shown by Mester (1997), CS is a crucial problem in financial institutions and banks. In order to select counterparties, the financial institutions have to use good techniques to distinguish between "bad" and "good" counterparties and decide if the credit will be granted or not. For example, in banks, lenders or creditors have to use CS to predict the probability that a borrower will default or become delinquent. CS is based generally on some variables related to applicants to evaluate their creditworthiness. These variables can be: the age of applicant, his historic, payments, guarantees, default rates and so on.
Various studies in finance and banking have shown the importance of CS. For instance, the credit scores are one of the most powerful predictors of risk as shown by Miller (2003). It helps in providing an objective analysis of the applicant’s creditworthiness which reduces discrimination and credit risk. CS can be used in the decision making whether grant credit to applicant or not. Further, CS can be used in corporate and collection scorecards.
To handle CS, researchers have used several techniques. Among them, we give the following ones: Abdou (2009) proposed a genetic programming technique for CS. Desay et al. (1996) gave a comparison of neural networks and linear scoring models in the credit union environment. Henley and Hand (1996) used a k-Nearest Neighbor (k-NN) classifier for assessing consumer credit risk. An interesting support vector machines (SVM) was proposed by Bellotti and Crook (2009) for credit scoring and discovery of significant features. Also, Sousaa et al. (2015) proposed a dynamic modeling framework for credit risk assessment.
Statistical methods are also studied for CS. We give as examples: the linear regression proposed by Hand and Henley (1997), the decision trees proposed by Quinlan (1987) and the classification and regression trees (CART) studied by Breiman, et al. (1984). Wiginton (1980) proposed the discriminant analysis and logistic regression based methods which are one of the most broadly established statistical techniques used to classify clients as “good” or “bad”. Also, Friedman et al. (1997) proposed Bayesian networks (BN) that may be used to build models for CS.
In this work, we study the impact of variable selection and search method on CS when combined with BN. This paper makes two main contributions: We propose a new variable selection called VS-VNS. The latter is based on the variable neighborhood search meta-heuristic to select a significant set of variables for the data classification task. This new technique is compared to two filtering methods to show its performance. We study the impact of different search methods on BN when combined with variable selection methods for CS.
An extensive experiment is conducted on four credit datasets to evaluate the performance of the different proposed combinations and techniques for CS. We perform credit scoring task on Australian, German, Japanese and the huge "Give me some credit" datasets. We discuss the performance of the different combinations by using various metrics.
The rest of this paper is organized as follows: Section 2 gives a background on BN and the search methods considered in this research. Section 3 details the proposed new technique for variable selection and the different combinations studied in this work. Section 4 presents the empirical studies on the four credit datasets. Finally Section 5 concludes and gives some perspectives.
Background
The aim of this section is to give an overview of some important concepts used in this study.
Bayes Networks (BN)
Bayes network (BN) is a well-known machine learning technique called also Bayesian network or belief network. BN is a statistical model based on combination of directed acyclic graph of nodes and link, and a set of conditional probability table. As shown by Friedman et al. (1997), John and Langley (1995), BN is a probabilistic graphical model that represents a set of random variables and their conditional dependencies via a directed acyclic graph. The default BN uses "K2" search method which is a greedy algorithm. The latter is run several times with random ordering of variables.
Search method and meta-heuristics
Meta-heuristics are computational search techniques that have been used successfully for solving several optimization problems in several areas. Meta-heuristics can be divided into two main categories: the population-based methods and the single solution-oriented methods (called also trajectory methods). The population based methods maintain and evolve a population of solutions. The trajectory methods or the single solution oriented methods work on a current single solution. Among the population based methods for optimization problems, we cite genetic algorithms used by Gen (2006) and evolutionary computation used by Li et al. (2011). Among the trajectory methods, we find local search proposed by Hansen (1986);, Hoos and Boutilier (2000); Boughaci et al. (2010), simulated annealing developed by Kirkpatrick et al. (1983) and tabu search proposed by Glover (1989).
In this work, we are interested in the trajectory methods, especially in hill climbing, tabu search and simulated annealing.
Proposed method and combinations
Variable selection is a pre-processing that can be launched before any classification task. It is the process that selects variables for the data classification task. It removes the redundant variables that are deemed irrelevant to the data classification task. Several methods have been studied for variable selection. These methods can be divided in two main methods: the wrapper methods proposed by Kohavi and John (1996) and the filtering methods studied by Caruana and Freitag (1994).
For the filtering methods, we choose the best-first search proposed by mich21 and the ranking filter information gain methods proposed by Caruana and Freitag (1994) from WEKA package which is available at Waikato (2017). In the rest of this section, we detail the new proposed variable selection VS-VNS for data classification. We give the variable vector representation used in VS-VNS. Then we detail the main components of the proposed VS-VNS method.
The variable vector representation
The aim of the variable selection is to search for a significant set of variables to be used with the classifier in the classification task. The variable vector can be represented as a binary vector which denote the variables present in the dataset, with the length of the vector equal to n, where n is the number of variables. To represent such vector we use the following assignment: if a variable is selected, the value 1 is assigned to it, a value 0 is assigned to it otherwise. For example, Fig.1 represents an assignment. We have a dataset of seven variables where the second, the third and the sixth variables are selected.

The variable vector representation.
The proposed VS-VNS is a variable selection method based on the variable neighborhood search (VNS). VNS is a local search meta-heuristic working on a set of different neighborhood. The basic idea is a systematic change of k neighborhood combined with a local search as shown by Mladenovic and Hansen (1997).
Neighbors solutions are generated by randomly adding or deleting a variable from the variable vector of size n. We use three structures of neighborhood (k=3) which are: N
1: where the neighbor solution x′ of the solution x is obtained by modifying only N
2: where the neighbor solution x′ of the solution x is obtained by modifying N
3: where neighboring solution x′ of the solution x is obtained by modifying
To avoid exploiting the same region, the proposed VS-VNS performs a certain number of local steps that combines intensification and diversification strategies to locate a good solution. The intensification step is applied with a fixed probability wp > 0 and the diversification step with a probability (1 - wp). The wp is a probability fixed empirically. Step 1: the intensification that consists in selecting the best neighbor solution having the best objective function value. Step 2: the diversification that consists in selecting a random neighbor solution.
The proposed VS-VNS is combined with BN. The overall method starts with an initial solution considering all the variables and then tries to find a good solution in the whole neighborhood in an iterative manner. The BN classifier is built for each candidate solution constructed by VS-VNS method. The solution is evaluated according to both accuracy and ROC (the area under the ROC Curve) values. This means that the solution quality is measured by using an objective function given as: objective function (f) = (Accuracy + ROC)/2. The objective is to find an optimal subset of variables by finding optimal combinations of variables from the dataset. The VS-VNS process is repeated for a certain number of iterations max _ iterations fixed empirically. The overall VS-VNS algorithm for variable selection is sketched in Algorithm 1.
Combination techniques
The second contribution of this paper is the study of the impact of search method on BN when combined with variable selection methods. We study six variants of BN for CS: BN with K2, BN with hill climbing, BN with repeated hill climbing, BN with TS, BN with SA and BN with TAN. For each variant, we consider the three variable selection methods: the best-first search given by Kohavi and John (1996), the ranking filter information gain method given by Caruana and Freitag (1994) and our new VS-VNS method. The numerical results are detailed in the next section.
Empirical study for credit scoring
All experiments were run on an Intel Core(TM) i5-2217U CPU@1.70 GHz with 6 GB of RAM under Windows 8, 64 bits, processor x64. The source code is written in Java under NetBeans IDE 8.2 and using the WEKA machine learning package.
Datasets description
We perform credit scoring task on four financial datasets: German, Australian and Japanese datasets available on UCI (University of California at Irvine) Machine Learning Repository 1 . We consider also the “Give me some Credit” dataset from Kaggle 2 . The description of the four datasets is given in Table 1.
Description of the datasets used in the study
Description of the datasets used in the study
Table 2 to 5 give the summary statistics of quantitative variables of Australian, German, Japanese and “Give me some credit” datasets respectively. The column Min gives the minimum value, the column Max is the maximum value, the column Mean is the average value and the column stdDev is the standard deviation.
Summary statistics of quantitative variables of the Australian dataset\label A1
Summary statistics of quantitative variables of the German dataset\label A2
Summary statistics of quantitative variables of the Japanese Credit dataset\label A33
Summary statistics of quantitative variables of the Give me some Credit dataset\label A4
We used both split training/test partition and a 10 fold cross-validation to evaluate models. The experiment (not reported here) showed that the splitting partition is more effective than cross-validation in our case. In consequence, the evaluation technique considered in this study is to run the BN classifier on the training data to get a model. Then, we apply this model on the test data to find the appropriate class. The example data are partitioned into training and test examples, approximately in the proportion of 66.6% to 33.4% , respectively.
We use several metrics to evaluate the performance of credit scoring models. Table 6 gives the confusion matrix where True Positives (TP) indicates the number of positive examples, labeled as such. False Positives (FP): is the number of negative examples, labeled as positive. True Negatives (TN): is the number of negative examples, labeled as such. False Negatives (FN): is the number of positive examples, labeled as negative. The diagonal elements the confusion matrix given in Table 6 (TP and TN) represents the data properly classified by the classifier while the diagonal elements (FN and FP) represents the misclassified data.
Confusion Matrix
Confusion Matrix
We consider the following metrics presented by Powers (2011): Recall, Sensitivity or true positive rate (TPR): Recall =TPR= Specificity or true negative rate (TNR): TNR = Precision or positive predictive value = False positive rate (FPR) = The harmonic mean of precision and sensitivity (F-measure) = Matthews correlation coefficient (MCC): MCC = Accuracy (ACC) or PRC Area where PRC curves plot precision versus recall. ACC= The area under the ROC curve (AUC). The ROC Area is a common evaluation metric for binary classification problems. ROC plots the value of the Recall against that of the FP Rate at each FP Rate considered.
We note that both ROC and PRC are important performance parameter. However, ROC is more robust than PRC in imbalanced class case because ROC is independent of the fraction of the test population which is class 0 or class 1.
In this section, we study the impact of the three variable selection methods on CS: the two filtering methods and our VS-VNS. We evaluate their performance when they are combined with search methods for BN. The same variable selection methods are used within the four datasets. The corresponding empirical studies are given in following.
Table 7 gives the number of selected variables with the three variable selection methods on the four datasets.
Number of selected variables with the three variable selection methods\label t10
Number of selected variables with the three variable selection methods\label t10
As already mentioned, we consider two filtering methods for variable selection. CFS is a Correlation Based Feature Selection presented by Hall (1999) that selects the best set of variables where variables are assumed to be conditionally independent. However CFS is not able to select all relevant variables when there are strong dependencies. For this reason, we use Information Gain based Ranking filter and VS-VNS. The Ranking filter finds the best subset of variables from the original dataset by using score. The variables are weighted by using the proxy measure rather than error rate as shown by Hall (1999); Caruana and Freitag (1994). VS-VNS applied the process already given in section 3.2 to select the relevant variables. The variables selected with CFS are as follows: For Australian dataset, 7 variables are selected: A4, A5, A7, A8, A10, A13 and A14. For German dataset, only 3 variables are chosen: A1, A2 and A3. For Japanese dataset, 7 variables are selected: A4, A6, A8, A9, A11, A14 and A15. For Give me some credit, there are 4 variables which are: A2, A3, A7and A9.
The variables selected with the ranking filter are as follows: For the Australian dataset, we remark that all the variables are ranked and selected: A8, A10, A9, A14, A7, A5, A6, A3, A13, A4, A2, A12, A11 and A1. The corresponding rank values are respectively: 0.425709, 0.213511, 0.156286, 0.110235, 0.110022, 0.10916, 0.050189, 0.041099, 0.036708, 0.029603, 0.022884, 0.010036, 0.000721 and 0.000139. For the German dataset, there 16 selected variables among 20. The selected variables: A1, A3, A2, A6, A4, A5, A12, A7, A15, A13, A14, A9, A20, A10, A17 and A19 with the following rank value respectively 0.094739, 0.043618, 0.0329, 0.028115, 0.024894, 0.018709, 0.016985, 0.013102, 0.012753, 0.011278, 0.008875, 0.006811, 0.005823, 0.004797, 0.001337 and 0.000964. For the Japanese dataset, the selected variables are: A9, A11, A10, A15, A8, A6, A14, A7, A3, A5, A4, A2, A13, A12 and A1 sorted by rank. The corresponding rank values are respectively: 0.425709, 0.213511, 0.156286, 0.110235, 0.110022, 0.107525, 0.05371, 0.049456, 0.041099, 0.02875, 0.02875, 0.021239, 0.010036, 0.000721 and 0.000423. For Give me some credit dataset, the ranking filter select all the variables sorted by rank as follows: A1, A7, A3, A9, A2, A6, A4, A5, A8 andA10. The ranks are: 0.0529, 0.04621, 0.03813, 0.03171, 0.01085, 0.00556, 0.00473, 0.00405, 0.00299 and 0.00155.
The set of variables selected with the proposed VS-VNS when SA is applied as a search method are as follows: For Australian dataset, VS-VNS selects 12 variables. The unselected variables are A1 and A6. For German dataset there are 19 selected variables. The variable A5 is removed. For Japanese dataset there are 15 selected variables where the variable A12 is removed. For Give me some credit dataset all the variables are selected.
In this section we present the different results when considering the six search methods combined with the three variable selection techniques in BN (CFS, Ranking and VS-VNS). The max _ iterations in VS-VNS is fixed empirically to 10 and the wp value is equal to 0.6. The numerical results are reported in Tables 8 to 11.
BN with variable selection and search methods on Australian Credit dataset\label A7
BN with variable selection and search methods on Australian Credit dataset\label A7
BN with variable selection and search methods on German Credit dataset
BN with variable selection and search methods on Japanese Credit dataset\label A9
BN with variable selection and search methods on Give me some Credit dataset
According to the numerical results, we can say that in general BN with K2, BN with hill climbing, BN with Repeated hill climbing and BN with TS are comparable on the four considered datasets. The numerical results show a slight performance in favor of BN with TAN when compared it to BN with K2, BN with BN with hill climbing, BN with Repeated hill and BN with TS. Moreover BN succeeds in finding good results when combined with SA search method for all the considered datasets compared to the others BN variants.
For example for Australian dataset, BN with SA gives a ROC% value equal to
When we compare the variable selection methods, we can see that VS-VNS provides good results compared to both CFS and Ranking methods. For example for Australian dataset, BN with SA gives a ROC% value equal to
In conclusion, promising results are obtained when combining BN with SA search method with our new VS-VNS technique. This improvement can be shown for all the considered datasets. The proposed method gives a ROC% value equal to

ROC% and PRC% found with variable selection with SA for Australian dataset.

ROC% and PRC% found with variable selection with SA for German dataset.

ROC% and PRC% found with Variable selection with SA for Japanese dataset.

ROC% and PRC% found with variable selection with SA for Give me some credit dataset.
To show statistically the significance of our results, we use here ANOVA (Analysis of variance) statistical tool. In our case, we use the ROC% to compare the effect of the VS-VNS variable selection when combined with SA search method in BN for CS. We compare between all the variable selection (CFS, Ranking and VS-VNS) when used with SA in BN. We compared also with BN when we consider all the variables (ALL). Table 12 describes the four ANOVA tests where the column df represents the degree of freedom. The column SS is the Sum of squares. The column MS is the mean square. The F-value is the F-statistic. The
ANOVA Test for BN with SA search method
ANOVA Test for BN with SA search method
Credit scoring is an important process in financial institutions and banks. It helps in decision making and permits to distinguish between ’bad’ and ’good’ counterparties. CS uses variables related to applicants to evaluate their creditworthiness. In this work, we studied the impact of variables and classification on credit scoring. We proposed a new variable selection called VS-VNS for CS. The proposed method is compared to two filtering methods. Further, we explored different combinations of variable selection and various search method for BN. We considered six search methods which are K2, Hill climbing, Repeated Hill climbing, TS, SA and TAN. The different techniques are combined with BN to build models for CS. The different combinations are evaluated on German, Australian, Japanese and the huge kaggle dataset. The numerical results are promising and show the benefit of the proposed combinations. Further, the proposed VS-VNS succeeds in finding promising results compared to the other variable selection techniques in particular when combined with SA in BN. We plan to combine several variable selection strategies together to further enhance the performance. It would be nice to evaluate the considered combinations with other classifiers on other datasets.
