Abstract
Customer behavior prediction is gaining more importance in the banking sector like in any other sector recently. This study aims to propose a model to predict whether credit card users will pay their debts or not. Using the proposed model, potential unpaid risks can be predicted and necessary actions can be taken in time. For the prediction of customers’ payment status of next months, we use Artificial Neural Network (ANN), Support Vector Machine (SVM), Classification and Regression Tree (CART) and C4.5, which are widely used artificial intelligence and decision tree algorithms. Our dataset includes 10713 customer’s records obtained from a well-known bank in Taiwan. These records consist of customer information such as the amount of credit, gender, education level, marital status, age, past payment records, invoice amount and amount of credit card payments. We apply cross validation and hold-out methods to divide our dataset into two parts as training and test sets. Then we evaluate the algorithms with the proposed performance metrics. We also optimize the parameters of the algorithms to improve the performance of prediction. The results show that the model built with the CART algorithm, one of the decision tree algorithm, provides high accuracy (about 86%) to predict the customers’ payment status for next month. When the algorithm parameters are optimized, classification accuracy and performance are increased.
Introduction
Nowadays, credit cards which are indispensable for many consumers, have reached a tremendous amount in the market with the rapid advances in technology. It is an alternative payment tool instead of cash and can be used for paying at shopping points with POS devices or for cash advance from the bank ATMs. The expenses made by credit card must be paid to the bank on a monthly basis or in installments. For banks, there exist credit card risks which can be defined as losses that may arise due to not fulfilling the liabilities of credit card customers in a timely manner. The management of unpaid credit card problem is gaining importance due to the increasing amount of credit cards and the negative effects of the credit card risks. Banks may suffer losses as a result of their customers failing to fulfill their obligations partially or fully without complying with the contractual requirements. Moreover, successive debt burden for credit card users brings along bad conditions such as depression, suicide, divorce, violence, theft etc. Thus, banks should reduce the negative effects of unpaid credit card debts on banks and on its credit card customers. For the management of such risks, predicting customer behavior is a significant subject in the banking sector. In recent years, researchers studied this problem using statistical and mathematical techniques such as discriminant analysis, nearest neighbor, logistic regression, decision trees, neural networks, machine learning etc.
The main purpose of this study is to propose a model that predicts with high accuracy whether credit card users will pay their debts or not. With the proposed model, potential unpaid risks can be predicted and necessary actions can be taken in time. The contribution of this study is to apply ANN and SVM that are traditional artificial intelligent algorithms, CART and C4.5, which are widely used decision tree algorithms for forecasting the customers’ payment status of next months. We compare the prediction power of these four methods using various performance criteria and identify the inferior methods relative to the others for the presented prediction problem. Moreover, we conduct an extensive literature review for parameter optimization and present the usefulness of parameter optimization for each method in terms of prediction accuracy.
The organization of the study is as follows: we present literature review about machine learning methods and risk prediction in bank sector in the second section. In the third section, we present the theoretical fundamentals of the prediction methodologies used in this study. In the fourth section, we describe the dataset and variables. In the fifth section, we analyze the results of the proposed methods and in the last section we discuss the findings and conclusions.
Literature review
Machine learning is used to extract structure from data and validate this structure using specialized algorithms. It is a discipline focused on two related questions: “How can computer systems evolve automatically with experience?” and “What are the basic statistics, calculation, knowledge and theory rules governing all learning systems, including computers, people and organizations?”. Machine learning is applied in many areas such as speech/voice recognition, computer vision, biological surveillance, robot control etc.
One of the most common aims of machine learning is to make predictions for the future by using historical data. In the related literature, we summarize some of the studies that use machine learning techniques for forecasting in different sectors/areas as follows: Stock-market predictions have been made by multi-layer perceptron, dynamic and hybrid ANNs [10] as well as the Nonlinear Autoregressive Network with Exogenous inputs (NARX) neural network [6], forecasts of the household electricity consumption are made by hybrid Autoregressive Integrated Moving Average (ARIMA) ANN [45] financial time series forecasting (e.g. exchange rates) is studied by a hybrid method combining SVM with ARIMA [26], rainfall forecasting models are solved using an evolutionary hybrid of adaptive neuro fuzzy inference system (ANFIS) with Firefly Algorithm (ANFIS-FFA) [29], forecasting a port’s container throughput is made by ANN and SVM [15], prediction of Remaining Useful Life (RUL) of aircraft engines is studied by ARIMA and SVM [30], tourist demand forecasting was applied using autoregressive neural network [51] and personal consumption expenditure inflation prediction is studied using three machine learning models (ANN, k-Nearest Neighbor methods (k-NN) and SVR) [43].
In the banking sector, there exist also many studies using machine learning algorithms for predicting the bank failure, credit card risk assessment, fraud prediction and customer churn. In what follows, a brief review of the applications using machine learning techniques in banking sector is presented. Boyacioğlu et al. [25] used four different neural network models, three multivariate statistical methods and SVM for the prediction of bank failures. Le and Viviani, [14] compared the efficiency of discriminant analysis and logistic regression as traditional statistical techniques and ANN, SVM and k-NN as machine learning approaches for predicting the failure of banks. The results showed that ANN and k-NN methods had better performance for predicting the failure of banks. Another study for bank failure prediction was done by Gogas et al. [32] using SVM. Carmona et al. [31] applied extreme gradient boosting method for bank failure prediction. For determining important factors of prediction of bank financial strength ratings, Öğüt et al. [16] used multiple discriminant analysis and ordered logistic regression and compared the model performances with SVM and ANN. Patil et al. [39] proposed three different machine learning algorithms which are logistic regression, Interactive Dichotomimizer 3 (ID3) decision tree algorithm and random forest decision tree algorithm for fraud detection on credit card. It is obtained from empirical analysis that random forest decision tree algorithm had better performance in terms of accuracy, precision and recall. Shirazia and Mohammadi [13] used CART for churn prediction. Nami and Sharaji [38] proposed a method that applied dynamic random forest algorithm for detecting fraudulent payment card transactions. Smeureanu et al. [18] compared neural networks and SVM techniques’ prediction performance for customer segmentation of a commercial bank in Romania. Both neural networks and SVM performed well for customer segmentation but the SVM model with radial basis kernel function had a better performance for the segmentation process than the multilayer perceptron. Serengil and Ozpinar [36] focused on workforce planning in bank operation centers by a hybrid multi-level machine learning algorithm. They compared neural network with alternative exponential smoothing algorithms for evaluation of results. For credit risk assessment, Zhang et al. [42] employed multiple instance learning method which is based on radius basis function. The above mentioned studies that used machine learning methods for prediction are summarized in Table 1.
Summary of literature review
Summary of literature review
To avoid creating under fitting or over fitting models, the parameters must be optimized. However, there exists no consensus on the exact rules to find best parameter values for each machine learning model in the literature. In Table 2, a literature review on the parameter setting for ANN, SVM, CART and C4.5 methods is presented. For each technique, the most important parameters, parameter settings and the optimization methods are listed.
Summary of literature review about parameter settings of SVM, C4.5, CART, ANN
We use ANN and SVM that are traditional artificial intelligent algorithms as well as CART and C4.5, which are widely used decision tree algorithms, for forecasting the customers’ payment status of coming months. The features, mostly used parameters and advantages/disadvantages of these four machine learning techniques employed in this study are summarized as follows.
Support vector machine (SVM)
SVM is a supervised machine learning algorithm that can be used for classification or regression problems [44]. In supervised learning, the existing data set is called training set and is composed of coupled input variables “X” and output variable “Y”, shown as S = ((x1, y1),(x2, y2),.....(x
l
, y
l
)) ɛ (R
n
×Y)l, l is the size of the existing observations. The learning type differs according to the nature of the output space Y. If y
i
= {1,-1} then the problem is defined as a binary classification problem, if y
i
= {1,2,3,....m} the problem is a multiple class classification problem. In binary classification, the overall aim is to find a function h(X) in R
n
for predicting the output value of y using the input variables x:
When using a linear classifier, the dot product between two vectors, also referred to as an inner product or scalar product is defined as w
T
x = Σ w
i
x
i
. A linear classifier is based on a linear discriminant function of the form:
However, in real-life applications a nonlinear classifier commonly provides better accuracy. Yet, linear classifiers have advantages, one of them being that they often have simple training algorithms. A simple way of making a nonlinear classifier out of a linear classifier is to map the data from the input space X to a feature space F using a nonlinear function. While achieving mapping of data to a high-dimensional feature space, kernel methods perform well in terms of memory usage and computational time.
Moreover, the concept of margin for a SVM classifier is defined differently. In SVM, support vectors are the closest points to the hyperplane separating the positive and negative examples. A margin is specified to separate two classes using these examples defined as support vectors. In SVM, in order to obtain a maxium-margin classifier the below convex quadratic programming is solved:
SVM has many key parameters that should be set properly. The classification accuracy of SVM is sensitive to two factors: (i) the algorithms needed for solving the quadratic problem and (ii) parameters setting. Generally, different parameter settings will cause different classification performance. One of the most important parameters is kernel function. The most common used kernel functions are linear kernel function, polynomial kernel function, normalized polynomial kernel function, Fourier kernel function, radial basis kernel function (RBF), Sigmoid kernel function and MLP Kernel function. Formulation of some kernel functions is shown in Table 3.
In addition to kernel function, SVM performance depends on the values of kernel parameters and penalty factor C. The penalty parameter C controls trade-off between the classification accuracy and acceptable misclassification errors [20].
ANN is one of the most powerful machine learning algorithms used in many different fields such as estimation, classification, signal processing and pattern recognition. In this study, we use a multi-perceptron neural network with a back-propagation learning.
The ANN algorithm is performed in two steps: a forward step and a backward step. In backward steps the synaptic weights are updated according to an error using a supervised learning method. In forward steps the signals are propagated fm input to output using the computational units.
The ANN model is presented below [35].
Forward Computation:
The output signal of neuron j in the layer l is as follow:
Logistic Activation Function:
Hyperbolic Tangent Activation Function:
In the output layer (l = L):
and the error signal can be computed as follow:
Backward Computation:
Local gradients (δ) are calculated:
The synaptic weights of the network are updated according to follow equation.
The main parameters of ANN can be defined as the number of hidden layers, the number of hidden neurons and the transfer function that directly affect the ANN model’s reliability [3]. Hidden layers, that are a layer of an ANN between the input and output layers, ensure the network’s nonlinear modelling capabilities [27].
Decision trees are mainly based on the division of input data into groups by means of a clustering algorithm. In decision tree learning, a tree structure is created. The leaves of this tree show the class labels. The arms leading to these leaves represent the processes on the properties. During decision tree learning, the information learned is modeled on a tree. All interior nodes of this tree represent inputs.
The CART model is presented as follows [33]:
First step: The candidate split of each parameter is determined.
Second step: The possibility of left candidate split (P
L
) and the possibility of right candidate split (P
R
) are calculated.
Third Step: The possibility oleft candidate split at every class (P(j/tL)) and the possibility of right candidate split at ery class (P(j/tR)) are calculated.
Forth Step: The value of 2 P
L
P
R
on the first candidate split is calculated.
Fifth Step: The sum of all reductions
Sixth Step: θ(s|t) is calculated.
After getting the amount of conformity, maximum θ(s|t) for the main ne is decribed. Afterwards the iterations are continued uil the leaf nodes and a complete decision tree is formed.
To establish a reliable CART model, two main parameters of this algorithm must be adjusted: the minimum size of leave and the depth of tree. The minimum size of leave specifies the minimum number of instances required to produce a leaf. The depth of tree identifies the number of layers to grow up the tree [37].
In this study, we also apply one of the most used decision tree algorithms: C4.5 tree algorithms. The C4.5 tree is an extension of ID3 and used for classification problems. It spreads a decision from the nodes by dividing data over the feature with the highest information gain. The decision tree constructing process has two main phases: the growth phase and the pruning phase. The basic difference of C4.5 tree and CART is that during the growth phase, the central selection made by the ID3 algorithm is the selection in which the properties at each node are tested in the most useful way to classify samples. Whereas C4.5 algorithm uses entropy evaluation function as the selection criteria. For the calculation of the entropy evaluation function, five steps must be applied [22]. The purpose of these calculations is to determine the class of predictors that provides the highest information gained.
First step: For identifying the class of the training set S, calculate Info(S).
Second Step: For feature X to partition S calculate expected information value.
Third Step: Calculate the information gained.
Forth Step: Calculate the partition information value.
Fifth Step: Calculate the gain ratio.
The aim of the pruning stage is to remove tree parts that do not contribute to the classification of the tree in order to avoid over fitting of the training data. Prior to the application of the C4.5 algorithm, two main parameters must be set: the minimum number of cases that reach a leaf/minimum numbers of split-off cases (M) and the pruning confidence level factor (CF). In the pruning phase, pruning confidence level parameter influences whether a node of the tree will be delete or not. A small value of CF will cause the more aggressive pruning of tree nodes. The parameter that sets the minimum number of cases reaching a leaf (M) affects the size of the grown tree in the construct phase. If M has a small value, the tree can be very large and have many branches. Parameter settings for C4.5 are crucial for avoiding overfitting or under fitting and achieving high rate classification accuracy. Quinlan [22], developer of the C4.5 algorithm suggests default values for M and CF to be 2 and 25% respectively. The major advantage of C4.5 is that it can work with both continuous and discrete data. Also, C4.5 tree can handle incomplete or missing data. The prime weakness is small data variations could cause different decision trees.
Although parameter setting is crucial to improve model accuracy and performance, most of the machine learning algorithms suffers from incorrect parameter settings. To avoid creating under fitting or over fitting models, the parameters must be optimized. However, there exists no consensus on the exact rules to find best parameter values for each machine learning models.
Similar to other machine learning algorithms before applying ANN, SVM, CART and C4.5 methods, the parameter settings must be done in advance. Using the review of parameter optimization approaches in Section 2, the most important parameters, parameter settings and the optimization methods are identified for each method. The performance of SVM is highly related to its kernel function types, kernel parameters gamma or sigma and kernel penalty factor C. For C4.5 minimum cases (M) for the leaf and pruning confidence level (CF) affect performance of the model. CART model accuracy and performance is related to maximum depth, minimum samples split and minimum samples leaf. And, for ANNs, parameters such as hidden layer size, activation functions, the rate of learning, regularization parameter alpha, number of iteration can be significant.
Selecting the proper parameters is utmost importance to improve the accuracy of the machine learning model. Generally, a trial and error method is used but there are also numerous optimization techniques which search for the best parameters. Grid search, genetic algorithm, particle swarm analysis, SS-based approach are just some of them.
Empirical analysis
In this section, we first describe the dataset that is used to train and test our model. Then we present the criteria used to evaluate the performance of the proposed model. After that, we present parameter selection and optimization. Finally, we report the experimental results using the predictions obtained.
Description of the dataset
In this study, we use the data obtained from an important bank in Taiwan [17]. The dataset includes 10713 customer’s records. These records include customer information like the amount of credit, gender, education level, marital status, age, past payment records, invoice amount and amount of credit card payments. We represent these customer data as input variables of our model. Details about to the variables in the dataset are shown in Table 4.
Description of dataset
Description of dataset
We use two different methods to divide the dataset into two parts as model training and the model validation. First method is cross validation and second method is hold-out method.
k-fold Cross Validation: The steps of the k-fold cross validation method, which is a variant of the Monte-Carlo cross validation method, are as follows [4]:
First step: The elements of dataset X are divided into sets of approximately equal K sets. The elements of the k th set constitute the validation set X val . The other sets constitute the learning dataset X learn .
Second step: The training of the model g is done using X
learn
and the error E
k
(g) is calculated with the following equation:
Third step: First step and second step are repeated for each k value varying from 1 to K. The average error is calculated by following equation:
Hold-out: The hold-out method divides a database randomly into two groups with at the given rate. One group is used for training the data and the other group is used to test the data [46].
In order to evaluate the performance of the proposed algorithms, performance criteria such as kappa statistics, precision, recall, F-measure, ROC (Receiver Operating Characteristic Curve) Area, PRC (Precision-Recall Curves) Area, mean absolute error and especially correctly classified rate are used. The confusion matrix required for calculating the performance criteria are shown in Table 5:
Confusion matrix
Confusion matrix
Correctly Classified Rate: Correctly classified rate, which is one of the widely used performance criteria, shows the prediction success. To calculate the correctly classified rate, the following equation is used:
Kappa Statistics: The Kappa statistic, which is one of the performance criteria to evaluate the proposed model, measures the degree of consistency between the predicted and observed values [1]. Kappa statistic is obtained from following equation:
Here O ij : the observed values and E ij : expected values.
Precision: Precision can be expressed of as the probability that the detected structural change points are correct [24]. Precision is obtained from following equation:
Recall: Recall states the measure of how many of the true roles were extcted by the model [5]. Recall is calculated by following equation:
F-Measure: The F-measure is defined as a harmonic mean of precision and recall performance metrics [7]. To calculate the F-measure, the following equation is used:
ROC Area: The area under the ROC curve is one of the commonly used pformance metrics to indicate the overall discrimination. A ROC area varies between 0.5 (no discrimination) and 1.0 (perfect discrimination) [23].
PRC Area: Precision-Recall Curves is a performance metric to understand the quality of the model when the dataset contains imbalanced classes. The value of this metric closer to 1 shows a good classifier [34].
Mean Absolute Error: Mean absolute error is a measure of the difference between the prediction value and the observed value and calculated by following equation. When this measure is close to zero, prediction success increases.
In this study, after executing ANN and SVM artificial intelligent algorithms as well as CART and C4.5 decision tree algorithms for predicting the customers’ payment status of next months, we also optimize the parameters of each algorithm in order to improve prediction accuracy. We aim to show whether modifying default value of the parameters have valuable effects on the performance of each algorithm. We use Weka 3.8 Software to make predictions using different machine learning algorithms and Weka’s CVParameter Selection to perform parameter selection for our four classification algorithms. Based on the literature review about parameter selection and optimization, we select the parameters shown in Table 6 for improving our proposed model performance.
Parameter selection
Parameter selection
The algorithms are evaluated using the performance metrics for each different training and test sets. The results of the algorithms according to the performance criteria are shown in Tables 7 and 8. ROC and PRC curves which show the performance of algorithms are shown in Figs. 1 and 2.
Results of algorithms with respect to performance criteria for 10-fold cross-validation and 80% split
Results of algorithms with respect to performance criteria for parameter optimization

ROC curve and PRC curve for 10-fold cross validation and 80% split.

ROC curve and PRC curve for 10-fold cross validation with optimized parameters.
As shown in Table 7, when the dataset is divided into two groups as training and test sets using 10-fold cross validation and the proposed algorithms are applied to the data set, CART algorithm gives the best result with a rate of 85.709% for the correct classification rate, which is one of our most important performance metrics. Although the SVM algorithm according to the mean absolute error and the ANN algorithm for the ROC area and PRC area give the best results, it can be said that the CART algorithm has the best forecasting success among the proposed algorithms, considering the status of all performance criteria.
When the dataset is divided into two groups as training and test sets using 80% split, as seen in Table 7, CART algorithm has the best result with a rate of 84.2277% according to the correct classification rate. While SVM algorithm for the mean absolute error and the ANN algorithm for the kappa statistic yield the best results, CART algorithm for all other performance measures gives good results.
After executing the algorithms with default parameter settings, we optimize some important parameters for each algorithm to improve the performance of the models. Because of the 10-fold cross-validation method’s positive effect on the results compared to the split method, we only optimize parameters for 10-fold cross validation. We execute SVM, ANN, CART and C4.5 algorithms using default parameter settings, selected parameters setting separately and together. The results show that prediction performance of the models are improved according to Improvement in Correctly Classified Rate after parameter optimization. The improvement of SVM, ANN and C4.5 are limited, whereas the prediction accuracy improvement of CART is larger.
According to the optimized final results in Table 8, CART algorithm with minimum samples leaf parameter settings has the best prediction result with a rate of 85.803% according to the correct classification rate. The SVM algorithm according to the mean absolute error and the ANN algorithm for the ROC area and PRC area have the best results. These results are parallel with the 10-fold cross validation results. Moreover, the parameter optimization improves the performance criteria for all algorithms.
Risk is a concept which is present in every field and every sector and seriously affects competition in today’s world. Predicting customer’s behaviors is a significant to manage risks in the banking sector like in other sectors and fields. For this reason, we aim to propose models that predicts whether credit card users will pay their debts or not with a high accuracy.
We use ANN and SVM algorithms as traditional artificial intelligent algorithms, CART and C4.5, which are widely used decision tree algorithms, for forecasting the customers’ payment status of next month. We apply cross validation and hold-out method to divide the dataset into two parts as training and test sets and we evaluate the algorithms with the proposed performance metrics. Moreover we present a literature review for parameter optimization of these algorithms and compare the improvements obtained for earch.
Algorithm results show that the model, built with the CART algorithm, one of the most basic decision tree algorithms, provides the highest accuracy (about 86%) to forecast the customers’ payment status for next month. It is observed that the model established by CART algorithm has comparable results with ANN and SVM algorithms which are traditional artificial intelligence algorithms frequently used in prediction models. Furthermore, it is possible to say that the CART algorithm is an effective and preferable algorithm for this problem since the model installation and application is simpler than the other proposed models. In addition, it is clearly understood that the 10-fold cross-validation method had a positive effect on the results compared to the split method, and it can be said that the cross-validation method provided better learning for this problem.
After executing the algorithms with default parameter settings, we optimize some important parameters for each algorithm to improve the performance of the models. Because we obtain higher accuracy using 10-fold cross-validation method compared to the split method, we only optimize parameters for 10-fold cross validation. Experimental results demonstrate that the optimized parameter setting for all algorithms gives better performance for customer’s payment status prediction model.
Future studies can be conducted in these directions: First, for each selected parameter in the present study a sensitivity analysis can be carried out to ensure the robustness of proposed models. Furthermore, different parameters of ANN, SVM, CART and C4.5 can be selected to optimize. In this study, we use Weka’s CVParameter Selection to perform parameter selection for our four classification algorithms. In the future studies, different optimization techniques such as grid search, genetic algorithm, particle swarm analysis, SS-based approach can be used and the optimization performance of these techniques can be compared. Second, different machine learning methods or ensemble machine learning methods such as Bagging, AdaBoost can be used to establish a model for forecasting of the customers’ payment status. Finally, the proposed model can be applied to other real world problem to conclude whether it can efficiently solve such problems.
