Abstract
Machine learning techniques have been used successfully in several areas such as banking and finance. These techniques are used mainly for prediction, classification and partitioning data into different groups according to a certain common characteristic. In this work, we are interested in machine learning techniques for credit scoring and bankruptcy prediction in finance and banking. We evaluate and compare a range of machine learning techniques on several datasets issued from banks and financial institutions where the aim is to select the most appropriate methods suitable for each dataset. We use several metrics to evaluate the performance of the obtained models. The empirical studies are conducted on German, Australian, Japanese, Polish, Indian Qualitative Bankruptcy and Taiwan datasets. Also, we consider the huge “Give Me Some Credit dataset”.
The machine learning methods produce scores for applicants and companies and help a lot in the decision making. In other words, these methods permit us to distinguish between bad and good applicants or companies. The numerical study shows that there is no method able to consistently outperform the others on all the datasets. Also, there are significant differences between the studied methods on some datasets. For German and Give Me Some Credit datasets, the Bayes net method is able to produce good scores compared to the others studied methods. The LogitBoost method is competitive on both Polish and Australian datasets, while AdaBoost method is most appropriate for Japanese dataset. For Taiwan dataset, Random Forest method gives the best results compared to the other considered techniques. However, on Indian Qualitative Bankruptcy dataset, almost the methods are comparable due to the nature of this dataset.
Introduction
When a borrower or a loan applicant (also called “a counterparty”) applies for credit, lenders or creditors will check its credit score to determine whether it can get credit or not. A credit score is an evaluation that helps lenders and creditors to see how good the borrower is and indicates how likely the borrower is to pay back the owed debt, based on its past borrowing behavior.
Credit scoring (CS) is a process used to predict the probability that a borrower will default or become delinquent (Mester [20]). CS is the process of evaluating the creditworthiness of applicants in order to decide if the credit will be granted or not. The evaluation process is usually based on some variables related to applicants such as historical payments, guarantees, default rates, ect.
CS has several advantages that accrue to both lenders and borrowers. Several studies in finance and banking have shown that credit scores are one of the most powerful predictors of risk (Miller [21]). CS enables banks and financial institutions to produce scores for applicants and helps them to decide whether to grant credit to the applicant or not. It can be used for avoiding risk as well as supporting retail, corporate and collection scorecards. CS also provides an objective analysis of the applicant’s creditworthiness which reduces discrimination and credit risk. Further, credit scoring allows the automation of the lending process which leads to increased speed and consistency of the loan application.
Over the last few years, several models and techniques for credit scoring have been proposed and discussed. Almost all of them are based on statistical techniques, machine learning or data mining. Among these techniques we give the following ones: the linear regressions which are statistical methods (Hand et al. [15]) that permit us to analyze data and verify if the credit can be granted to a given applicant or not; the discriminant analysis and logistic regression based methods which are one of the most broadly established statistical techniques used to classify applicants as good or bad (Wiginton [29]); the decision trees (Vapnik [27], Quinlan [23]), Classification and Regression Trees (CART) (Breiman et al. [5]) and the Bayesian networks based models (Friedman [12]).
Computational intelligence based techniques are also investigated for developing credit scoring models. Among them, we cite: the neural networks (Desay et al. [10], Quinlan [24]), the k-Nearest Neighbor (k-NN) classifier (Henley et al. [16]), the support vector machines (SVM) (Bellotti and Crook [4]), the ensemble classifiers (Abellan and Mantas [3]), the genetic programming method (Abdou [2]), the evolution strategies (Li et al. [19]), the cooperative agents based system (Boughaci and Alkawaldeh [7]) and recently the new variable selection for credit scoring proposed in Boughaci and Alkawaldeh [8].
In this paper, we evaluate and compare the performance of various machine learning (ML) classifiers for CS and bankruptcy prediction. The empirical studies are conducted on seven well-known financial datasets which are: Australian, German, Japanese, Taiwan, Polish, Indian and the huge “Give Me Some Credit” from Kaggle. The Polish and Indian are dedicated to bankruptcy prediction while the other datasets are focused on CS.
The rest of this paper is organized as follows: Section 2 gives the problem formulation and then describes the set of the machine learning algorithms used in this study. Section 3 gives the considered datasets and the evaluation measures. Section 4 presents the empirical studies. Finally, Section 5 concludes and provides future work.
Classification and machine learning techniques
Classification is a supervised learning process that permits to find the best described computer model from a dataset with the correct class variable. This process is called a supervised learning because we have a set of input variables and an output variable (the label or class). We use an algorithm to learn the mapping function from the input to the output. The supervised learning technique creates a model based on training dataset and uses this model to classify new data. When we have only input data and no corresponding output variables, we deal with unsupervised learning technique also called clustering or segmentation. Unlike classification, clustering is an unsupervised learning algorithm that may be used to split a large dataset into clusters or groups according to a certain common characteristic.
In this paper, we discuss the credit scoring as a classification problem. The classification problem is to predict whether the test data belongs to one of the considered classes. Let us consider a test sample or a novel data to be classified. The problem is to predict whether the test data belongs to one of the considered classes. The training data is a set of examples of the form
In the following, we give the problem formulation and the set of the machine learning algorithms used in this study.
Problem formulation
The CS problem can be stated as follows (Miline et al. [14]):
We consider a set of features that can be used to characterize a financial dataset of applicants. Each feature may be: a numeric or a category. The applicant age and the interest rates are examples of numerical features that can be represented as a numeric. The categorical features are qualitative features such as the credit history or a geographic region code.
In a CS problem, the financial dataset consists of a set of applicants where each applicant is represented by a set of features. The problem is to classify the applicants into two classes: “bad” (Y = 0, who defaulted on their loans) or “good” (where the label class is Y = 1).
Mathematically, the CS problem can be formulated as follows (Miline et al. [14]):
The credit data can be organized as a matrix D of m rows and n columns where n is the number of features and m is the number of past applicants. The creditworthiness of new applicant can be determined by using the data on past applicants recorded in the matrix D. The decision is then represented as a vector Y with m elements where each element y
i
has two possible values 0 or 1. An element y
i
receives the value 1 when the applicant i is accepted, 0 otherwise. The classification is then the problem of determining the decision vector Y that indicates the accepted applicants (y
i
= 1) or the rejected ones (y
i
= 0).
In this study, we evaluate a range of machine learning classifiers on some credit scoring and bankruptcy datasets. The set of the considered machine learning classifiers is given as follows:
Random Forest (RF): a learning method proposed by Tin Kam Ho [18] for both classification and regression tasks. Random Forest is a forest decision tree classifier based on a multitude of decision trees where each tree gives a classification. Then the forest chooses the classification having the most votes over all the trees in the forest. k-Nearest Neighbor classifier (k-NN): this method classifies instances by comparing feature vectors of the different points and finding the k training examples that are closest to the test example. The predicted class is the most frequent one (Henley et al. [16]). In other word, the classifier predicts the class by the majority class of the k most similar training examples stored in the model based on a distance metric. We note that the value of k is fixed by empirical study. In this study, we set k to 3 and we use the Euclidian distance as a standard in the field. Naive Bayes classifier (NB): a simple probabilistic classifiers based on the Bayesian theorem with independence assumptions between the features or variables. In order to derive a conditional probability for the relationships between the feature values and the class, NB analyses the relationship between each feature and the class for each instance (Rennie et al. [25]). Bayes Network classifier (BN): also known as Bayesian network or belief network is a statistical model based on a combination of a directed acyclic graph of nodes and links, and a set of conditional probabilities held in a table. It is a probabilistic graphical model that represents a set of random variables and their conditional dependencies via a directed acyclic graph (Friedman et al. [12]). Multi-layer Perceptron (MLP): a feed-forward artificial neural network that uses a supervised learning back-propagation technique for training the network. An MLP is a set of layers of nodes in a directed graph where each node is a neuron with a nonlinear activation function. MLP model permits the mapping sets of input data onto a set of appropriate outputs (Frank et al. [26]). Logistic classifier: a regression model also known as logit regression. It was developed by the statistician David Cox in 1958 (Cox [9]). It is used to estimate the probability of a qualitative response based on one or more predictor or independent variables. Support Vector Machine (SVM): a machine learning technique that was proposed by Vladimir Vapnik for classification and regression (Friedman et al. [12]). The SVM classification method learns from a training data set and attempts to generalize and make correct predictions on novel data. SVM is a Kernel based method where the kernel represents similarity measures of examples of the training data. We find the following kernel functions: Linear, Polynomial, Laplacian, sigmoid and Gaussian also called radial Basis Function (RBF) (Friedman et al. [12]). Bagging: a Bootstrap aggregating machine learning. It is an ensemble algorithm used in classification and regression designed to improve the stability and accuracy of machine learning algorithms. The method was proposed by Leo Breiman in 1994 to improve the classification by combining classifications of randomly generated training sets (Breiman [6]). AdaBoost: Boosting is a process of increasing the performance of weak learning algorithms. It is a combination of a set of classifiers produced by the learning algorithm over a number of distributions of the training data. AdaBoost is an adaptive boosting machine learning algorithm. The classifier may be added to many other types of learning algorithms to improve their performance. The output of the other learning algorithms is combined into a weighted sum that represents the final output of the boosted classifier (Freund et al. [11]). OneR: a simple classification based on rules. The method learns a one-level decision tree for generating a set of rules that test one particular attribute (Holte, [17]). Logit Boost: a boosting algorithm that combines an AdaBoost machine learning algorithm with a logistic regression. We obtain the Logit boost algorithm by considering AdaBoost as a generalized additive model and applying the cost functional of logistic regression (Driedman [13]).
This study considers seven financial datasets which are: German, Australian, Japanese, Polish, Indian Qualitative Bankruptcy and Taiwan default of credit card clients datasets available on UCI (University of California at Irvine) Machine Learning Repository 1 . Further, we consider also the Give Me Some Credit data set from the Kaggle web site 2 . The descriptions of the considered datasets are given in Tables 1 and 2.
Description of the datasets (case of customers) used in the study
Description of the datasets (case of customers) used in the study
Description of the datasets (case of companies) used in the study
Classifiers performances on German credit dataset
The example data are partitioned into training and testing examples, approximately in the proportion of 70% to 30%, respectively. Credit scoring models are constructed on the training example data by using the considered machine learning algorithms and validated on the testing data. We evaluate each classifier on the training data to get a model. Then, we apply this model on the test data to find the appropriate class.
We use several metrics to evaluate the performance of credit scoring models where True Positives (TP) indicates the number of positive examples, labeled as such. False Positives (FP): is the number of negative examples, labeled as positive. True Negatives (TN): is the number of negative examples, labeled as such. False Negatives (FN): is the number of positive examples, labeled as negative. We consider the following metrics (Powers [22]):
Recall, Sensitivity or true positive rate (TPR): Specificity or true negative rate (TNR): Precision or positive predictive value False positive rate (FPR) The harmonic mean of precision and sensitivity (F-measure) Matthews correlation coefficient (MCC): Accuracy (ACC) or PRC Area where PRC curves plot precision versus recall. The area under the ROC curve (AUC). The ROC Area is a common evaluation metric for binary classification problems. ROC plots the value of the Recall against that of the FP Rate at each FP Rate considered.
We note that the ROC measure is more robust than PRC in imbalanced class case because ROC is independent of the fraction of the test population which is class 0 or class 1.
Classifiers performances on Australian dataset
Classifiers performances on Give Some Credit dataset
All experiments were run on an Intel Core(TM) i5-2217U CPU@1.70 GHz with 6 GB of RAM under Windows 8 64 bits, processor x64.
We conduct intensive experiments using the Waikato environment for knowledge Analysis (Weka) (Waikato [28]) version 3.9. Weka is a machine learning software developed by the University of Waikato. It allows preprocessing, analyzing and data classification. Weka is available under the GNU license (General Public License). In the following, we give the numerical results found when applying machine learning algorithms on the seven credit datasets. We compare the classifiers by using the ML performance metrics already given in Section 3.1. To choose the best method, we first look at the ML performance metrics: ROC and PRC rates. Then we look at the other performance metrics as a tie-breaker to confirm the decision based on ROC and PRC rates. The best results are in bold font.
Comparison of ML methods on CS data
Tables 3 through 7 give the numerical results found when applying machine learning algorithms on the CS datasets. As we can see from Table 3, the Bayes Net classifier gives a good classification with a PRC rate equals to
According to Table 4, we can see that Bayes Net (BN) gives the best results on Australian dataset when we considered TPR%, FPR%, Precision%, Recall%, F-Measure% and MCC% as measures of performances. With BN, we obtain TPR rate equals to
When we compare classifiers according to ROC area, we can conclude that LogitBoost method is the best one since it provides competitive results on Australian dataset. LogitBoost method gives a ROC rate equals to
For Give Me Some Credit dataset, from Table 5, we remark that Bayes Net method gives a classification with a ROC rate equals to
From Table 6, we can see that AdaBoost method performs well on Japanese dataset. The ROC rate is equal to
Classifiers performances on Japanese dataset
Classifiers performances on Japanese dataset
For Taiwan dataset, Table 7 shows that Random Forest method succeeds in finding good results compared to the other classifiers where the ROC rate is equal to
Classifiers performances on Taiwan dataset
As shown in Tables 3 to 7, Bayes Net, Random Forest, AdaBoost and LogitBoost methods are comparable on the CS considered datasets. These four methods give better results than the other classifiers in term of ROC, PRC, TPR, FPR, Precision, Recall, F-Measure and MCC points of view.
Tables 8 and 9 give the numerical results found when applying machine learning algorithms on the two bankruptcy datasets. From Table 8, we can see the performance of LogitBoost method on Polish dataset. It gives a ROC rate equals to
ROC% versus PRC% on Polish dataset
ROC% versus PRC% on Polish dataset
Classifiers performances on Indian Credit dataset
From Table 9 considering the Indian dataset, we remark that almost the considered classifiers succeed in finding good results with a PRC and ROC rates equal to
Credit scoring is a crucial problem for banks and financial institutions especially after the 2006 financial crisis. The financial institutions have to find good techniques to select counterparties. German, Australian, Polish, Taiwan, Japanese, Indian and Give Me Some Credit datasets are used in this research. The paper evaluates eleven techniques to distinguish between bad and good counterparties. We evaluated the k-Nearest Neighbor (k-NN), Random Forest, Bayes Network Classifier, Naive Bayes, Multi-layer Perceptron (MLP), Logistic classifier, Support Vector Machine (SVM), OneR, Bagging, AdaBoost and LogitBoost classifiers on seven datasets to measure their performance. Various measures are used in this study. We considered both ROC and PRC area measures. Also, we considered TPR, FPR, Precision%, Recall%, F-Measure% and MCC% measures to compare efficiently the machine learning algorithms for credit scoring.
The numerical study showed that Bayes Net, Random Forest, AdaBoost and LogitBoost machine learning classifiers produce efficient models for credit scoring.
In order to better visualize the behavior of the four best machine learning (Bayes Net, Random Forest, AdaBoost and LogitBoost), we draw the Figs 2 to 6. We compared the ROC and PRC as performance measures of the four classifiers on German, Australian, Japanese, Polish, Taiwan and the Give Me Some Credit datasets. From Figs 2 to 6, we visualized that the four classifiers are comparable on the considered datasets. They produce consistent results in term of both ROC and PRC which confirms the effectiveness of these techniques in credit scoring.

ROC% versus PRC% on German.

ROC% versus PRC% Australian datasets.

ROC% versus PRC% on Japanese.

ROC% versus PRC% Taiwan datasets.

ROC% versus PRC% on “Give Me Some Credit”.

ROC% versus PRC% on Polish datasets.
We conclude that the Bayes Net and Boosting classifiers in general are effective models for credit scoring. Their numerical results are competitive and demonstrate the benefit of the considered models for credit scoring and bankruptcy prediction.
As future work, it would be interesting to develop deep learning techniques to handle very huge or big datasets and improve performance. Like machine learning, deep learning is a technique for data analysis. It permits to learn from data, identify patterns and make intelligent decisions. However, deep learning is a combination of computing power and a special kind of neural networks such as recurrent neural networks that may be used to solve the problem end to end. Deep learning is used mainly to learn complicated patterns in large and voluminous amounts of data. When data is very huge, complex and diverse, the traditional machine learning techniques are unable to produce good results, hence, the need to use deep learning instead of machine learning. This is due to the supremacy of deep learning in terms of accuracy when trained with huge datasets. We note that “big data” is a very huge dataset collected from a variety of sources. It describes terabytes or petabytes of complex and diverse data captured over time and where the traditional machine learning techniques are inadequate to deal with them. In such situation, it is recommended to use deep learning techniques to handle such kind of data.
