A novel SSA-CatBoost machine learning model for credit rating

Abstract

Categorical Boost (CatBoost) is a new approach in credit rating. In the process of classification and prediction using CatBoost, parameter tuning and feature selection are two crucial parts, which affect the classification accuracy of CatBoost significantly. This paper proposes a novel SSA-CatBoost model, which mixes Sparrow Search Algorithm (SSA) and CatBoost to improve classification and prediction accuracy for credit rating. In terms of parameter tuning, the SSA-CatBoost optimization obtains the most optimal parameters by iterating and updating the sparrow’s position, and utilize the optimal parameter to improve the accuracy of classification and prediction. In terms of feature selection, a novel wrapping method called Recursive Feature Elimination algorithm is adopted to reduce the adverse impact of noise data on the results, and further improves calculation efficiency. To evaluate the performance of the proposed SSA-CatBoost model, P2P lending datasets are employed to assess the prediction results, then the interpretable Shap package is used to explain the reason why the proposed model considers a sample as good or bad. Consequently, the experimental results show that the SSA-CatBoost model has an ideal accuracy in classification and prediction for credit rating by comparing the SSA-CatBoost model with the CatBoost model and other well-known machine learning models.

Keywords

CatBoost sparrow search algorithm parameter tuning feature selection credit rating

1 Introduction

Classification and prediction are the core tasks of machine learning in credit rating. In recent years, a variety of gradient-boosting machine learning methods have developed for building classification and prediction models, of which, Categorical Boost (CatBoost) based on Gradient Boosting Decision Tree (GBDT), with its fewer parameters and superior algorithms, already has a pivotal position in the field of machine learning. Prokhorenkova et al. [1] proposed CatBoost for the first time in 2017, and pointed out that the most effective way to minimize the loss of information when dealing with categorical features is to use Target Satistics (TS) as a new numerical feature [2], then demonstrated that CatBoost outperforms leading GBDT packages. Dorogush et al. [3], to reduce the gradient bias and accelerate the scoring and training speed, modified the algorithm of selecting the tree structure, adopted the oblivious tree to facilitate the calculation of leaf values, and GPU acceleration to optimize the CatBoost model. After that, CatBoost enriched the categories of machine learning models, and many people put the CatBoost model into practical applications. Jabeur et al. [7] compared the performance of CatBoost and eight machine learning models which include SVM and Random Forest for rating the company’s credit level, and proved that the CatBoost model has a significant improvement in classification performance. Li et al. [5] utilized CatBoost to measure the credit risk in P2P lending and found that CatBoost was superior to SVM, Random Forest and other traditional machine learning models. Ibrahim et al. [8] discussed the performance of CatBoost, Random Forest and other machine learning models, recommended CatBoost for better prediction of loan approvals. Izotova and Valiullin [6] used the classification algorithm in the CatBoost model to solve the problem of fraud detection. These works mentioned above have obtained some promising results, and numerous researches on CatBoost have been reported in [9 –12], but parameter tuning are not involved in these papers, which is the essential part during the classification and prediction process, via iteration will directly affect the machine learning model’s performance. If appropriate parameter is employed, the results in these papers will reach a higher level. The trial and error method is usually applied to find the optimal parameters. However, this method is inefficient, and the resulting parameters are most likely not optimal. Swarm intelligence algorithm is an excellent approach for optimizing machine learning model parameters. It significantly improves the calculation efficiency and adjusts the parameters more accurately. (It adjusts the parameters more accurately and improves the calculation efficiency significantly.). Hence, many researchers use this (swarm-intelligence-based) method to optimize the parameters of the machine learning model. Dong et al. [13] mixed Bat algorithm and CatBoost model for predicting the pan evaporation in northwest China. Huang and Dun [14] combined the Particle Swarm Optimization (PSO) algorithm and SVM model to perform feature selection and parameter optimization. Barmana and Choudhur [15] hybridized Grey Wolf Optimizer (GWO) with SVM to predict the power system load in the Indian state of Assam. Many successful examples of combining machine learning methods with group calculation methods have been published [16 –27] and have achieved remarkable results. Instead of PSO and GWO, this study tries a more advanced swarm algorithm, Sparrow Search Algorithm (SSA), which proposed by Xue and Shen [4] in 2020, inspired by the swarm wisdom. SSA with its high computational efficiency and fast convergence speed has been widely used in machine learning and other fields. It formed from the predation and anti-predation behavior of sparrows and randomly selects producers from the entire population. When the surrounding danger is less than the safety threshold, the producer will continue to initiate an alarm to warn the surrounding sparrows. Otherwise, the producer will take anti-predation behavior and lead the sparrow to fly to a safe position. Finally, the fitness and the best position for the entire population will be found via multiple iterations of the sparrow position. By comparing SSA with GWO, PSO, and Gravitational Search Algorithm (GSA), Xue and Shen [4] fund that SSA has apparent advantages in accuracy, stability, and other aspects.

Another crucial issue in the performance improvement of the machine learning model is feature selection, which has a direct effect on the classification and prediction accuracy. A classification problem always needs abundant features to prompt the machine learning model to make correct judgments when classifying, however, a key point is some features involved in dataset maybe useless or redundant. Once those redundant features are retained by the dataset and utilize the dataset to train a machine learning model, this will lead to a decrease in the accuracy of classification. In order to increase the accuracy of classification, feature selection is required in the process of dealing complicated data, and many researches have been reported. Li et al. [24] proposed a novel feature selection method named chaotic search algorithm which is embedded in the searching iterations of GSA to optimize the feature subsets. Wang and Ku [31] employed the Correlation-based filter method due to it can generate a general feature subset, and 22 features are selected out of 28 features, achieved an ideal result in corporate credit rating.

In this paper, motivated by Huang and Dun [14] and other researches [16 –27], SSA was borrowed as an optimization algorithm to tune the regularization parameter in CatBoost, then formed a hybrid SSA-CatBoost model for personal credit rating experiment. Furthermore, a feature selection algorithm named Recursive Feature Elimination is utilized to improve the accuracy of classification. Consequently, by comparing with CatBoost, XGBoost, and other machine learning models, we found that SSA-CatBoost has better classification and prediction performance in our experiment.

This paper is organized as follows: Section 2 describes the related works of CatBoost and SSA. Section 3 illustrates the specific process of the SSA-CatBoost model. Section 4 describes the source and processing of the data. Section 5 gives some experiments which verify the effectiveness of the SSA-CatBoost model. Conclusions are finally drawn in Section 6.

2 Related works

2.1 CatBoost

2.1.1 CatBoost classifier

Assume we observe a dataset of examples $D = {(x_{iitsc}, y_{iitsc})}_{i = 1 . . n}$ , where $x_{i} = (x_{i}^{1}, . . ., x_{i}^{m})$ is a random vector of m features and y_i is a target. Borrowed the process of Prokhorenkova et al. [1] on CatBoost, we build a sequence of approximations F^T iteratively, where T = 0, 1, ⋯. The F^T is obtained from the previous approximation F^T-1 in an additive manner: $F^{T} = F^{T - 1} + α h^{T}$ where α is step size. The goal of a learning task is to train a function F which chooses the h^T from a family(collection) of functions H to minimize the expected of loss function L (f (x) , y), and adapt F to accomplish the test task. Therefore, h^T is given as follows.

$h^{T} = \underset{h \in H}{arg min} E L (F^{T - 1} (x) + h (x), y)$ (1)

There are two common approaches to minimize the expected loss, the Newton method and the Negative Gradient method [1] that both aradient descent methods. Since the Newton method uses the second-order partial derivativto calculate the negative gradient, it improves the quality of the classified results, and we mainly introduce the Newton method in this paper. According to the document [28], CatBoost uses the following loss function (the loss function is defined:): $\begin{matrix} L (f (x), y) = \sum_{i} w_{i} \cdot l (f (x_{i}), y_{i}) + J (f) \end{matrix}$ (2) where l (f (x_i) , y_i) is the value of the loss functioat point (x_i, y_i), w_i is the weight of the i-th object, J (f) is the regularizationart.o minimize the loss function L, Taylor expansion should be used at point (a^t-1, y), the loss function is expressed in the following form: $\begin{matrix} L (a_{i}^{t - 1} + φ, y) \approx \sum w_{i} [l_{i} + l_{i}^{'} φ + \frac{1}{2} l_{i}^{''} φ^{2}] + \frac{1}{2} λ ∥ φ ∥_{2} \end{matrix}$ (3) where $l_{i} = l (a_{i}^{t - 1}, y_{i})$ , $l_{i}^{'} = - \frac{\partial l (a, y_{i})}{\partial a} |_{a = a_{i}^{t - 1}}$ and $l_{i}^{''} = - \frac{\partial^{2} l (a, y_{i})}{\partial a^{2}} |_{a = a_{i}^{t - 1}}$ are the first-order partial derivative and the second-order partial diva of the loss function with respect to the a at point $(a_{i}^{t - 1}, y)$ , respectively λ is the L2 regularization parameter. Then the least-squares approximation is used and the h^T is: $h^{T} = \underset{h \in H}{arg min} E {(l_{i}^{''} - h (x))}^{2}$ (4)

In the internal structure of CatBoost, trees are bui sequentially, and each next tree is built to approximate negative gradients $l_{i}^{''}$ of the loss function l at predictions of the current ensemble. Thus, $l_{i}^{''}$ performs a gradient descent optimition of the loss function L, here, the quality of the gradient descent is measured by the score function which will be explained detailedly in Section 2.1.2. According to Prokhorenkova et al. [1], CatBoost also uses a binary decision tree as a predictor. The decision tree is built by recursively dividing the feature space into several disjoint areas (tree nodes) based on the values of some splitting attribute. Therefore, the estimated output described by Prokhorenkova et al. [1] is: $h (x) = \sum_{j = 1}^{J} C_{j} 𝕝_{{x \in R_{j}}}$ (5)

h (x) is a decision tree function of the explanatorvariables x, R_j is the disjoint regions corresponding to the leaves of the tree.

2.1.2 Score function

The score function is a key issue of CatBoost, it measures the quality of gradient approximation. When a new tree needs to be added to the ensemble, the score function is used to evaluate the gradient descent optimization of the candidate tree, and it contains the following four categories: L2, Cosine, NewtonL2, and NewtonCosine. Based on the Newton method in Section 2.1.1, NewtonL2 use second-order derivatives in the calculation process, this approach can improve the quality of the model rests. In order to derive NewtonL2 [28], We start from L2: $\begin{matrix} L 2 scoring function = - \sum_{i} w_{i} \cdot {(s_{leaf} - g_{i})}^{2} \end{matrix}$ (6) where s_leaf and w_i denotes the optimal value of leaf and the weight of the i-th tree, respectively. Suppose the tree is divided into left side and right side by the boundary value 𝕓, the L2 scoring function takes the following form:

$\begin{matrix} L 2 scoring function = \\ - (\sum_{i = 0}^{𝕓} w_{i} {(s_{left} - g_{i})}^{2} + \sum_{i = 𝕓}^{N} w_{i} {(s_{right} - g_{i})}^{2}) \end{matrix}$ (7)

The next step is to find the index i^* of the optimal fte and the suitable boundary 𝕓^* of the tree. From equation (3), after regrouping by left leaves and right leaves, the lose function takes follow form: $\begin{matrix} L_{left} \approx \sum_{i = 0}^{𝕓} [(\sum_{i = 0}^{𝕓} w_{i} l_{i}^{'}) s_{left} + \frac{1}{2} (\sum_{i = 0}^{𝕓} w_{i} l_{i}^{''} + λ) s_{left}^{2}] \end{matrix}$ (8) $\begin{matrix} L_{right} \approx \sum_{i = 𝕓}^{N} [(\sum_{i = 𝕓}^{N} w_{i} l_{i}^{'}) s_{right} + \frac{1}{2} (\sum_{i = 𝕓}^{N} w_{i} l_{i}^{''} + λ) s_{right}^{2}] \end{matrix}$ (9) So, the optimal value of s_left and s_right are: $s_{left}^{*} = \frac{\sum_{i = 0}^{𝕓} w_{i} l_{i}^{'}}{\sum_{i = 0}^{𝕓} w_{i} l_{i}^{''} + λ}, s_{right}^{*} = \frac{\sum_{i = 𝕓}^{N} w_{i} l_{i}^{'}}{\sum_{i = 𝕓}^{N} w_{i} l_{i}^{''} + λ} .$ (10)

As expressed in formula (10), the regularization parameter λ is in the denominator of the leaf value, which means that the larger the L2 regularization parameter, the smaller the leaf value. This situation will lead to over-fitting, otherwise it will lead to under-fitting. Hence, it is necessary to find a suitable regularization parameter with SSA to achieve the optimal accuracy of the classification of the CatBoost model.

Eanding the brackets of equation (7), use $s_{left}^{*}$ and $s_{right}^{*}$ to replace s_left and s_right, respectively, the i* and 𝕓^* can be written as: $\begin{matrix} i^{*}, 𝕓^{*} = argma x_{i, 𝕓} \sum_{i = 𝕓}^{N} w_{i} \cdot {(s_{left}^{*})}^{2} + \sum_{i = 0}^{𝕓} w_{i} \cdot {(s_{right}^{*})}^{2} \end{matrix}$ (11)

Thus, the NewtonL2 score function is expressed as follows: $\begin{matrix} NewtonL 2 score function = \\ - (\sum_{i = 0}^{𝕓^{*}} w_{i} {(s_{left}^{*} - g_{i})}^{2} + \sum_{i = 𝕓^{*}}^{N} w_{i} {(s_{right}^{*} - g_{i})}^{2}) \end{matrix}$ (12)

2.2 Sparrow search algorithm

Followed [4], a detailed description on the SSA is given. Suppose a matrix of X_i_,j represent the position of sparrows, where i is theber of sparrows, and j shows the parameter dimension that needs to be optimized. Afterward, the optimized fitness of all sparrows can be described by F_X: $F_{X} = [\begin{matrix} f ([x_{1, 1} & x_{1, 2} & \dots & \dots & x_{1, j} \\ f ([x_{2, 1} & x_{2, 2} & \dots & \dots & x_{2, j} \\ ⋮ & ⋮ & ⋮ & ⋮ & ⋮ \\ f ([x_{i, 1} & x_{i, 2} & \dots & \dots & x_{i, j} \end{matrix}]$ each row in the F_X is the fitness value for each sparrow. Therefore, the location of the producer is defined as follows. $X_{i, j} = {\begin{matrix} X_{i, j} \cdot exp (\frac{- i}{Rand \cdot MaxIter}) & if R < S \\ X_{i, j} + Q \cdot L & if R ⩾ S \end{matrix}$ (13) where MaxIter is the maximum number of iterations, Rand ∈(0,1] a random number, R ∈(0,1] and S ∈(0.5,1] is the alarm value and safety threshold respectively. L represents a matrix of 1×D for which each element side is 1 a random number that obeys normal distribution. When R < S, it means that there is no danger around the sparrow. Otherwise, the producer discovers the danger and flies to other safe places.

Once the location of the producer is updated, the position of the beggar will also change. If $i ⩽ \frac{n}{2}$ , the fitness value of the i-th scrounger makes it difficult to be a prey, and the formula for its position movement is: $X_{p} + | X_{i, j} - X_{p} | \cdot A^{+} \cdot L$ where X_P is the optimal position occupied by the producer, A represents a matrix filled with 1 and -1 randomly, A⁺ = A^T (AA^T) ^-1. After calculation, a more accurate formula for the position movement of the beggar can be obtained, where X_worst is the current global worst position: $\begin{matrix} X_{i, j} = {\begin{matrix} Q \cdot exp (\frac{X_{worst} - X_{i, j}}{i^{2}}) & if i > \frac{n}{2} \\ X_{p} + \frac{1}{D} \sum_{d = 1}^{D} (rand {- 1, 1} \cdot | X_{i, d} - X_{p} |) & if i ⩽ \frac{n}{2} \end{matrix} \end{matrix}$ (14) Based on above, the mathematical formula for the sparrow individuals aware of the danger is expressed as follows: $\begin{matrix} X_{i, j} = {\begin{matrix} X_{best} + β \cdot | X_{i, j} - X_{best} | & if f_{i} > f_{g} \\ X_{i, j} + K \cdot (\frac{| X_{i, j} - X_{w} |}{(f_{i} - f_{w}) + ɛ}) & if f_{i} = f_{g} \end{matrix} \end{matrix}$ (15) where X_best is the current global optimal position, β is a random number that obeys a normal distribution, K ∈ [- 1, 1] is a random number, f_i is the current fitness value of the sparrow, f_b and f_w represent the current global best and worst fitness values, respectively, ɛ is the smallest constant to avoid the denominator being 0. When f_i > f_g, X_i,j represents the sparrow is on the edge of the group. If f_i = f_g, X_i,j represents that the sparrow in the middle of the group is aware of the danger and needs to be close to other sparrows. Based on the above introduction to SSA, the basic steps of the SSA can be summarized as the pseudocode shown in Algorithm 1.

Algorithm 1 Sparrow Search Algorithm
Input: pop, MaxIter, ST, PD, SD
Output: Fitness, Position
1: while i<MaxIter do
2: Find the current best individual and the current
worst individual.
3: forj = 1, 2, …, PDdo
4: Using equation (13) update the producer’s location
5: end for
6: forj = (PD + 1) , …, popdo
7: Using equation (14) update the scrounger’s location
8: end for
9: forj = 1, 2, …, SDdo
10: Using equation (15) update the sparrow’s location
11: end for
12: Get the current new location;
13: If the new location is better than before, update it;
14: i=i+1
15: end while
16: return Fitness, Best Position

3 An SSA-based parameters optimization

3.1 Fitness definition

As mentioned in Section 2.1.2, improper regularization parameters will lead to over-fitting or under-fitting of the model. In order to find an appropriate regularization parameter L2 _ leaf _ reg, SSA is used to optimize the regularization parameter of the CatBoost model. To implement the idea of combining SSA and CatBoost, the following step is to define the Fitness which measures the accuracy of SSA-CatBoost for data prediction in each iteration. Firstly, the dataset is randomly divided into K subsets, then build K CatBoost models for each subset. Now, the Fitness is defined as the average value of the prediction accuracy of the K CatBoost models, that is: $\begin{matrix} Fitness = 1 - \frac{1}{K} \sum_{k = 1}^{K} ac c_{k} \end{matrix}$ (16) where k represents the k-th subset, acc is the prediction accuracy of the CatBoost model based on the subset. It is evident that the smaller the fitness, the higher the classification accuracy of the CatBoost model.

3.2 The procedures of SSA-CatBoost model

This section gives a detailed description on the hybrid model of SSA-CatBoost. For simplicity, the essential parameters are listed and explained in Table 1.

Table 1
Essential parameters of SSA-CatBoost

Model Parameters Explanations

CatBoost Iterations Number of iterations, maximum number of trees built in the learning process

Learning_rate The learning rate Used for reducing the gradient step

L2_leaf_reg L2 regular parameters, related to the cost function

Score_function Score that is used during tree construction to select the next tree split

Task_type During the training process, the L2 score function can only be used on the GPU

SSA Pop The number of sparrows

MaxIter The maximum iterations

ST The alarm value

PD The percentageof producers

SD The percentage of sparrows who perceive the danger

Model	Parameters	Explanations
CatBoost	Iterations	Number of iterations, maximum number of trees built in the learning process
	Learning_rate	The learning rate Used for reducing the gradient step
	L2_leaf_reg	L2 regular parameters, related to the cost function
	Score_function	Score that is used during tree construction to select the next tree split
	Task_type	During the training process, the L2 score function can only be used on the GPU
SSA	Pop	The number of sparrows
	MaxIter	The maximum iterations
	ST	The alarm value
	PD	The percentageof producers
	SD	The percentage of sparrows who perceive the danger

Next, the procedures of SSA-CatBoost are given as follows:

Step 1. Data preparation: For a research dataset, we randomly divide it into K subsets, each of which contains the training set and the test set. The training set is to build a basic CatBoost model. The test set is to evaluate the classification accuracy of the CatBoost model.

Step 2. SSA parameters setting and initialization: Set the SSA parameters, including the number of sparrows, the warning threshold, the proportion of discoverers, the ratio of sparrows aware of the danger, and the maximum number of iterations. Then generate a random value of initial L2_leaf_reg via SSA.

Step 3. Set iteration i = i + 1.

Step 4. CatBoost model training and prediction: NewtonL2 is assigned to the Score function, and L2_leaf_reg is calculated by SSA. The basic CatBoost model is established through the training set, then use the test set to make predictions. Then, the average values of prediction accuracy are obtained.

Step 5. Fitness calculation: After training the CatBoost model on the training set and predicting on the test set, formula (14) is used to calculate the Fitness which is the global fitness of sparrows in SSA.

Step 6. Sparrow position updating: All the sparrows move according to the fitness value, and update the positions that followed formulas (11), (12), (13). Each sparrow moves to next new position, and the new position produces a new L2 regularization parameters L2_leaf_reg.

Step 7. Parameter optimization results checking: The new L2 regularization parameter is employed to train CatBoost model, and used the trained model to obtain the prediction accuracy. Then, formula (14) is adopted to calculate the new Fitness value, if the new Fitness value is smaller than the former one, replace Fitness value with the new Fitness value, and keep the regularization parameter which corresponds to the new Fitness value; otherwise, abandon the regularization parameter value which corresponds to the new Fitness, and adopt the former Fitness to continue searching for a smaller one, until meet the condition of Step 8.

Step 8. End condition checking: If the current number of iterations is less than the maximum number of iterations, return to Step 3; otherwise, proceed to the next step.

Step 9. End the SSA-CatBoost model.

Based on the above, the flowchart of the SSA-CatBoost model is summarized in Fig. 1.

Fig. 1

Flowchart of the SSA-CatBoost model.

4 Numerical illustrations

4.1 Data description

In this paper, the data chosen from the US Lending Club company is employed to evaluate the performance of SSA-CatBoost model, and contains 28339 pieces of transaction records are entered our research. Before using these data to build SSA-CatBoost model, some missing borrower features have been filtered out to ensure the data is authoritative and valid. Now, we summarize the borrower’s personal information with 81 features, of which, feature ‘grade’ is divided into three categories: A, B, and C, and each class is further divided into five sub-categories. That is, Lending Club uses these 15 sub-categories to identify credit ratings for borrowers. Figure 2 shows the percentage of each sub-category.

Fig. 2

Proportion of each sub-category.

4.2 Feature selection using Recursive Feature Elimination algorithm

Feature selection plays a key role in credit rating. Among all the features of borrower’s personal information, some features have major contributions to classification and prediction, other features may be slight or redundant. If the dataset contains the slight or redundant features and is utilized to train CatBoost model, it will cost longer computational time, and decrease the accuracy of classification and prediction. Nowadays, feature selection has three groups: Wrapper methods, Embedded methods and Filter methods. Wrapper methods are based on the weight of features in the training model to eliminate features with lower weight and retain features with higher weight. The selected subset consists of the features with higher weight, and use it to train the machine leaning model with best performance [29]. Embedded methods eliminate slight and redundant features in the process of training the model. Rodriguez-Galiano et al. [29] use wrapper and embedded method to evaluate the prediction strength of every feature, and the result shows that, wrapper method has a lower mean misclassification error than embedded method, although it costs a longer computational time. Filter method is based on the correlation between features and target variables to eliminate slight and redundant feature. Rodriguez-Galiano et al. [29] also point out that the subset selected by filter method will increase or decrease the classification accuracy of the machine learning model. This can be found in [29].

As mentioned above, wrapper method has a better performance than other two feature selection methods. As one of Wrapper methods, Recursive Feature Elimination algorithm is more advanced than others. It trains the CatBoost model to get the weight of every feature, and eliminates noise data by iterates multiple times to complete the feature selection. Thus, Recursive Feature Elimination algorithm is applied to select those features in this paper. The following gives a detailed description about Recursive Feature Elimination algorithm.

Using dataset $D$ , Recursive Feature Elimination algorithm trains a CatBoost model, and gets scores for every feature with the trained model, eliminates those features with worst scores. Three algorithms for Recursive Feature Elimination to calculate feature scores is Prediction Values Change, Loss Function Change and Shap Values algorithm. Prediction Values Change is the fastest algorithm to calculate the feature score, but its accuracy is the lowest. Loss Function Change balances calculation speed and accuracy to achieve a better result than Prediction Values Change. Shap Values algorithm costs the longest time to get the highest accuracy. Thus, Shap Values algorithm is employed to calculate the feature scores. Now, a brief description about the Shap Values algorithm is given as follows.

Invoked that $x_{i} = (x_{i}^{1}, . . ., x_{i}^{j}, . . ., x_{i}^{m})$ is a contribution vector of m features (see Section 2.1.1), here i (i = 1, 2, . . . , n) represents the i-th borrower. Shap Values algorithm calculates the contribution vector $v_{i} = (v_{i}^{1}, . . ., v_{i}^{j}, . . ., v_{i}^{m})$ for each x_i, and $v_{i}^{j}$ represents the contribution of the j-th feature to the classification result. Denote a_i as the contribution of all features for the i-th borrower, i.e. $\begin{matrix} a_{i} = \sum_{j = 1}^{m} v_{i}^{j} \end{matrix}$ (17) where m is the total number of features. To calculate the score of j-th feature for i-th borrower, we need to obtain the current lose value L_i and the current lose value H_i without j-th feature, then lose function (as mentioned in Section 2.1.1) is employed to calculate L_i and H_i: $\begin{matrix} L_{i} = l (a_{i} - ɛ, y_{i}) \end{matrix}$ (18) $H_{i} = l (a_{i} - ɛ - v_{i, j}, y_{i})$ (19) where ɛ is the sum of the contributions of the eliminated features to the classification results. Thus, based on the formular (18,19), the scorof j-th feature for i-th borrower can be expressed as follows: $\begin{matrix} S_{j} = H_{i} - L_{i} \end{matrix}$ (20) Then, the score of j-th feature can be expressed as the sum of all borrower’s score: $\begin{matrix} Scor e_{j} = \sum_{i = 1}^{n} (H_{i} - L_{i}) \end{matrix} .$ (21) where n is the total number of the borrowers. Through the feature selection, Table 3 ranks all the features and their corresponding scores. According to the score of every feature, we selected 20 features with higher scores. Table 2 explains the selected features.

Table 2

Description of selected features

ature name	Description
total_rec_int	Interest received to date.
total_rec_prncp	Principal received to date.
loan_amnt	The listed amount of the loan applied for by the borrower.
out_prncp	Remaining outstanding principal for total amount funded.
installment	The monthly payment owed by theorrower if the loan originates.
funded_amnt_inv	The total amount committed by investors for that loan at that point in time.
funded_amnt	The total amount committed to that loan at that point in time.
fico_range_high	The upper boundary range the borrower’s FICO at loan origination belongs to.
total_pymnt	Payments received to date for total amount funded.
all_util	Balance to credit limit on all trades.
annual_inc	The self-reported annual income provided by the borrower during registration.
percent_bc_gt_75	Percentage of all bankcard accounts > 75% of limit.
total_pymnt_inv	Payments received to date for portion of total amount funded by investors.
out_prncp_inv	Remaining outstanding principal for portion of total amount funded by investors.
last_fico_range_high	The upper boundary range the borrower’s last FICO pulled belongs to.
fico_range_low	The lower boundary range the borrower’s FICO at loan origination belongs to.
bc_open_to_buy	Total open to buy on revolving bankcards.
last_fico_range_low	The lower boundary range the borrower’s last FICO pulled belongs to.
Term	The number of payments on the loan. Values are in months and can be either 36 or 60.
last_pymnt_amnt	Last total payment amount received.

Table 3

Features and their corresponding scores

Feature name	Score	Feature name	Score	Feature name	Score	Feature name	Score
total_rec_int	3.89946	total_il_high_credit_limit	0.95279	total_bc_limit	0.95237	inq_last_6mths	0.94992
total_rec_prncp	3.06676	emp_length	0.95279	inq_last_12m	0.95236	purpose	0.94961
loan_amnt	2.06661	num_actv_rev_tl	0.95278	open_il_24m	0.95229	loan_status	0.9493
out_prncp	1.90708	open_rv_12m	0.95276	open_rv_24m	0.95228	inq_fi	0.94882
Installment	1.74198	tot_coll_amt	0.95276	total_bal_ex_mort	0.95224	num_tl_120dpd_2m	0.94846
funded_amnt_inv	1.71041	collections_12_mths_ex_med	0.95274	num_bc_sats	0.95221	mort_acc	0.94792
funded_amnt	1.39575	total_acc	0.95272	verification_status	0.95212	num_sats	0.94716
fico_range_high	1.03804	mo_sin_rcnt_tl	0.95271	revol_bal	0.95208	open_il_12m	0.94542
total_pymnt	0.95635	home_4ership	0.95268	num_bc_tl	0.95201	acc_open_past_24mths	0.94419
all_util	0.95287	num_tl_op_past_12m	0.95268	num_il_tl	0.95197	num_tl_90g_dpd_24m	0.94294
annual_inc	0.95287	pub_rec_bankruptcies	0.95265	mo_sin_rcnt_rev_tl_op	0.95188	policy_code	0.94179
percent_bc_gt_75	0.95287	total_cu_tl	0.95264	open_act_il	0.95181	mths_since_rcnt_il	0.94035
total_pymnt_inv	0.95287	num_rev_accts	0.95261	tot_hi_cred_lim	0.95176	delinq_amnt	0.93833
out_prncp_inv	0.95287	tot_cur_bal	0.95259	application_type	0.95164	num_tl_30dpd	0.93831
last_fico_range_high	0.95287	max_bal_bc	0.95256	delinq_2yrs	0.95143	tax_liens	0.9347
fico_range_low	0.95287	num_rev_tl_bal_gt_0	0.95254	bc_util	0.95122	chargeoff_within_12_mths	0.9265
bc_open_to_buy	0.95287	num_op_rev_tl	0.95251	mo_sin_old_rev_tl_op	0.95098	open_acc	0.91613
last_fico_range_low	0.95286	num_actv_bc_tl	0.95251	mths_since_recent_bc	0.95071	total_bal_il	0.90226
Term	0.95285	pct_tl_nvr_dlq	0.95244	total_rev_hi_lim	0.95046	num_accts_ever_120_pd	0.88753
last_pymnt_amnt	0.95283	open_acc_6m	0.95243	pub_rec	0.95023	avg_cur_bal	0.86437
mo_sin_old_il_acct	0.95282

5 Experimental results

5.1 Experimental environment and evaluation indicators

The experimental environment is based on Python 3.6.3. The processor of the physical host is Intel Core i7-6700HQ, and the operating system is 64-bit Windows 10.

In this paper, a set of evaluation indicators is introduced to evaluate the performance of experimental results, namely Accuracy score, Precision score, Recall score, and F1 score. The specific formulas related to these evaluating indicators are shown as follows. $\begin{matrix} Accuracy = \frac{TP + TN}{TP + TN + FP + FN} \end{matrix}$ (22) $\begin{matrix} Precision = \frac{TP}{TP + FP} \end{matrix}$ (23) $\begin{matrix} Recall = \frac{TP}{TP + FN} \end{matrix}$ (24) $\begin{matrix} F 1 = \frac{2 * Precision * Recall}{Precision + Recall} \end{matrix}$ (25) where TP (True Positive Rate) represents the classifier predicted an event is true and it is actually true, TN (True Negative Rate) represents the classifier predicted an event is not true and it is actually not true, FP (False Positive Rate) represents the classifier predicted an event is true but it is actually not true, FN (False Negative rate) represents the classifier predicted an event is not true but it is actually true.

Another evaluation indicator is Receiver Opating Characteeristics (ROC), which measures the performance of classification accuracy, plots the false positive rate (FPR) on the Y-axis and the true positive rate (TPR) on the X-axis.PR and TPR are defined as follows: $\begin{matrix} FPR = \frac{FP}{FP + TN} \end{matrix}$ (26) $\begin{matrix} TPR = \frac{TP}{TP + FN} \end{matrix}$ (27)

An ideal machine learning model’s ROC curve coincides with the Y axis, but any model cannot achieve this situation. Area Under the Curve (AUC) is defined as the area composed of the ROC curve and the abscissa, and it always be bounded between 0 and 1, which can intuitively evaluate the classifier’s performance. The larger the size, the stronger the classification performance of the machine learning model.

5.2 Fitness value curve

According to (16), the fitness curve as shown in Fig. 3. It shows the fitness value gradually decreases with iterations from 0 to 200, and tends to be stable after reaching a certain number of iterations. It achieves a better and stable convergence after 25 iterations, which means that SSA continuously looks for the regularization parameters during the iteration process and finally finds the optimal one. Then, using the optimal regularization parameter, train the CatBoost model and compare the performance of SSA-CatBoost to other machine learning models, we will give full details about this in the next section.

Fig. 3

Fitness value curve of SSA-CatBoost model.

5.3 Comparison experiments

Proper model parameter setting can improve significantly the classification accuracy, and SSA is an advanced algorithm to optimize parameter. So in this paper, SSA is employed to tune the regularization parameter L 2_leaf_reg in CatBoost to obtain the optimal parameter. To test the performance of the SSA-CatBoost model, some comparison experiments is given as follows:

Case I. Comparison between SSA-CatBoost and some other classical models such as XGBoost, LightGBM, SVM and CatBoost. In this case, the CatBoost model and other models all set default parameters without tuning any hyperparameters manually.

Case II. Comparison between SSA-CatBoost and other classical models with optimal parameter values, such as SSA-XGBoost, SSA-LightGBM and SSA-SVM.

5.3.1 Experiment for Case I

To evaluate the performance of the SSA-CatBoost model, we compare SSA-CatBoost with other machine learning models. Here, the compared models include XGBoost, LightGBM, SVM and CatBoost which all set default parameters. As for the parameters in SSA-CatBoost, we try to set the sparrow’s number in an interval of 10 from 0 to 100. After several experiments, we found that the number of sparrows did not significantly affects the model’s classification accuracy. This phenomenon is because the sparrows will always move towards the optimal position, and the entire population will eventually converge to the optimal position. However, setting enormous sparrow size will result in a significant increase in the program’s running time. Hence, the number of sparrows is assigned to 20 in the following experiment. The detailed parameter information of SSA-CatBoost, CatBoost and other machine learning models are listed in Table 4.

Table 4
Dominating parameters of machine learning

Model Parameters values

SSA-CatBoost Iterations=100, Learning_rate = 0.3, L2_leaf_reg=1.10486, PD = 0.7, SD = 0.2, Score_function = NewtonL2, Task_type = GPU, Pop = 20, MaxIter = 200, ST = 0.6

CatBoost Iterations=1000, Learning_rate = 0.03, L2_leaf_reg=3, Score_function = NewtonL2, Task_type = GPU,

XGBoost eta=0.3, max_depth = 6, gamma = 0, alpha = 0 , lambda = 1, min_child_weight = 1

SVM C=0.7849469,gamma=0.7849469

LightGBM learning_rate=0.1, n_estimators = 100, min_split_gain = 0, reg_alpha = 0 , reg_lambda = 0

Model	Parameters values
SSA-CatBoost	Iterations=100, Learning_rate = 0.3, L2_leaf_reg=1.10486, PD = 0.7, SD = 0.2, Score_function = NewtonL2, Task_type = GPU, Pop = 20, MaxIter = 200, ST = 0.6
CatBoost	Iterations=1000, Learning_rate = 0.03, L2_leaf_reg=3, Score_function = NewtonL2, Task_type = GPU,
XGBoost	eta=0.3, max_depth = 6, gamma = 0, alpha = 0 , lambda = 1, min_child_weight = 1
SVM	C=0.7849469,gamma=0.7849469
LightGBM	learning_rate=0.1, n_estimators = 100, min_split_gain = 0, reg_alpha = 0 , reg_lambda = 0

After training these models on the training set and predicting the probabilities on the test set, the evaluation indicators including Accuracy score, F1 score, Recall score and Precision score is obtained, and Fig. 4 showed the comparison results associated to the four metrics. The SSA-CatBoost algorithm achieves the highest accuracy, as plotted with blue bar, and intuitively shows that the SSA-CatBoost model has better classification performance than other classifiers. More precisely, the corresponding values of evaluation indicators are recorded in Table 5, SSA-CatBoost achieved the highest Accuracy Score of 0.90302, Precision Score of 0.90150, Recall Score of 0.91422, F1 Score of 0.90758, much better than CatBoost, XGBoost and LightGBM. At the same time, the correctly classified value (TP+TN) and miss-classified value (FP+FN) of SSA-CatBoost also demonstrate the proposed model has a better performance than other three implemented models.

Fig. 4

Comparison for the models with bar chart.

Table 5

Performance comparison results of four models

	Accuracy Score	Precision Score	Recall Score	F1 Score	TP+TN	FP+FN
SSA-CatBoost	0.90302	0.90150	0.91422	0.90758	25591	2748
XGBoost	0.88636	0.88072	0.90533	0.89275	25119	3220
LightGBM	0.87889	0.87455	0.89451	0.88386	24907	3432
CatBoost	0.89602	0.89027	0.89431	0.89174	25392	2947
SVM	0.53237	0.47339	0.87456	0.61019	15020	13319

In addition, Fig. 5 plots the ROC curve of the applied models, the SVM classifier has the least AUC value of 0.49522 while other machine learning models are higher than it, and the SSA-CatBoost achieves the highest AUC value of 0.97224 which means the proposed SSA-CatBoost better than the other classifiers.

Fig. 5

ROC curve for the models.

5.3.2 Experiment for Case II

In order to demonstrate the effectivity of SSA-CatBoost, we construct SSA-XGBoost, SSA-XGBoost and SSA-SVM by implementing SSA to XGBoost, LightGBM and SVM, and further give some comparisons among these hybrid models. Similar to Section 5.3.1, we also obtain the accuracy of the four involved machine learning models (SSA-CatBoost, SSA-XGBoost, SSA-LightGBM and SSA-SVM) in Fig. 6, and find that all the accuracy of the three involved machine learning models improve significantly after optimized by SSA. Moreover, the SSA-CatBoost model still achieves the highest score of Accuracy score, F1 score, Recall score and Precision score. More detailed information about the evaluating indicators is shown in Table 6. Table 6 shows SSA-CatBoost achieves the highest Accuracy score of 0.90302, F1 score of 0.90758, Recall score of 0.91442 and Precision score of 0.90150.

Fig. 6

Bar chart for the models.

Table 6

Model performance comparison

	Accuracy Score	Precision Score	Recall Score	F1 Score	TP+TN	FP+FN
SSA-CatBoost	0.90302	0.9015	0.91422	0.90758	25591	2748
SSA-XGBoost	0.8916	0.88705	0.90782	0.89714	25267	3072
SSA-LightGBM	0.8793	0.87297	0.8979	0.88487	24918	3421
SSA-SVM	0.56601	0.365224	0.62501	0.45985	16040	12299

Noted that the prediction accuracy of the SSA-CatBoostt and SSA-XGBoost models is no much difference, but the AUC values of the two models are significantly distinct. This is the accuracy only measures the model’s ability of classifying samples, but AUC can better measure the comprehensive performance of the model. So, when comparing the performance of machine learning models, more attention should be focused on the AUC values. Consequently, the results measured by ROC curve in Fig. 7 illustrate the AUC value of these ensemble learning algorithm models. Compared with Fig. 5, machine learning models optimized by SSA have a significant improvement, the SSA-SVM classifier has the least value of 0.5, while SSA-CatBoost achieves the highest AUC value of 0.97224, this means SSA-CatBoost outperforms others. In addition, SSA-XGBoost’s AUC score implies that SSA-XGBoost’s performance is better than LightGBM, but inferior to SSA-CatBoost.

Fig. 7

ROC curve for the models.

5.4 The explainability of the SSA-Catboost model

The interpretable machine learning Shap package is used to explain the SSA-CatBoost classification model experiment results. Below is a detailed description.

The summary plot on Fig. 8 illustrates the importance of every feature on the model output. This figure shows the importance of 20 features, from which we know the feature total_rec_int contains the most information and has a significant impact on the classification output, followed by the feature total_rec_prncp.

Fig. 8

The importance of every feature to the output.

Figure 9 summarizes the Shap values of every feature. One dot shown in each variable’s row represents a sample or a borrower in the dataset. The X-axis represents every feature’s positive or negative impact on the model prediction results. A positive SHAP value increases the probability of the sample to be considered a bad one, and a negative SHAP value enable a sample more likely to be a good one [30]. The Y-axis represents the Shap value of every feature (yellow is high, pink is low), ordered by the average absolute SHAP value. The right side of X-axis indicates that the contribution to the forecast result is positive, otherwise, negative. The colormap bar in the right side of Fig. 9 moves from pink to yellow as the feature value increases. High feature values on the positive side of the X axis have a positive correlation with the corresponding feature. Hence, the greater the positive value, the more likely the borrower is to be considered as a bad one, and vice versa. For instance, for the first feature total_rec_int, the color from yellow to pink as the feature value decrease, at the same time, the Shap value increases from -8 to 6, the negative correlation exists between feature value and Shap value, that is the higher the value of its feature, the smaller the Shap values and therefore, higher the probability of a good individual. On the other hand, the fifth feature loan_amnt, the higher the feature value, the higher the Shap value, the positive correlation makes a sample with high feature value more likely to be considered as a bad borrower.

Fig. 9

SHAP values for every feature.

The scatter plot of the total_rec_int is shown in Fig. 10. Each dot in the graph represents a borrower. The feature value of total_rec_int and funded_amnt_inv are negatively correlate with the Shap value, which makes a borrower more likely to be predicted into a bad one when the Shap value is positive (feature value is close to 0), as shown by the pink dots at the beginning of the X-axis in Fig. 10. Conversely, as the feature value of total_rec_int and funded_amnt_inv increase, the dots from pink turns yellow, Shap value gradually decreases to negative, as shown by the yellow dots and pink dots with negative Shap value in Fig. 10. The extremely negative Shap value prompts the model to consider this part of the borrowers as good without hesitation.

Fig. 10

Dependence plot of feature total_rec_int.

Similar to Figs. 10, 11 shows the relation between the feature value of loan_amnt and their Shap values. As expected, there is a positive correlation between the feature values and the Shap values. This is more likely to predict a borrower into a bad one when the loan_amnt and funded_amnt_inv are of high values, as presented by the yellow dots in Fig. 11. On the contrary, when the dots turn pink (at the beginning of the X-axis), the feature value close to 0, the negative Shap value make the model doesn’t hesitate to predict these borrowers into good.

Fig. 11

Dependence plot of feature loan_amnt.

6 Conclusion

In this paper, the main contributions are listed as follows:

A novel machine learning model called SSA-CatBoost is proposed, which hybrided the SSA and CatBoost to improve the classification accuracy. The developed SSA-CatBoost model optimizes the regularization parameters via SSA, and adopted the optimized parameters to enhance the accuracy of CatBoost.

In addition, to further improve the accuracy when dealing with large-scale data, Recursive Feature Elimination algorithm is used to eliminate useless features and reduce the impact of noisy data on results, and 20 features with higher scores are selected and entered into our research.

After selecting the features, the dataset from US Lending Club is divided into eight subsets to build the training model, then ROC curve, Accuracy score, Precision score, Recall score, and F1 score are utilized to compare the prediction results of SSA-CatBoost and other machine learning models, such as XGBoost and LightGBM.

To understand why the SSA-CatBoost model predicts a sample as a good or bad one, the interpretable Shap package is employed to illustrate the relationship of every features.

The experiment results show that the SSA-CatBoost model obtains much higher classification and prediction accuracy than XGBoost and other machine learning models, this means SSA-CatBoost model is capable for searching for the optimal CatBoost parameters, and make it convincible that the proposed model is a useful classification method. However, this paper still exists two limitations to improve. Firstly, our proposed model spends too much time to running, nearly one day to classify and predict. Secondly, the final prediction accuracy is only 90.3%, it still has very large improvement space. So, our further work will optimize the algorithm of the SSA-CatBoost model to improve its computational efficiency, and try to use other swarm intelligence optimization algorithms to improve the accuracy of model’s classification and predictions.

Footnotes

Acknowledgments

We acknowledge the financial support from the National Natural Science Foundation of China (No. 72261028, 71761029), the Project of Collaborative Innovation Center on Research of China-Mongolia-Russia Economic Corridor (No. DS20210010), and the Project of Regional Digital Economy and Governance Research Center Project (No. szzl2022002). In addition, we also thank the Lending Club to support the data for our academic research.

References

Prokhorenkova

, Gusev

, Vorobev

, Dorogush

A.V.

, Gulin

CatBoost: unbiased boosting with categorical features. (2017), arXiv preprint arXiv:1706.09516.

Dorogush

A.V.

, Ershov

, Gulin

CatBoost: gradient boosting with categorical features support. (2018), arXiv preprint arXiv:1810.11363.

Cestnik

Estimating probabilities: A crucial task in machine learning, In Proc. 9th European Conference on Artificial Intelligence (1990).

Xue

and Shen.

, A novel swarm intelligence optimization approach: sparrow search algorithm, Systems Science & Control Engineering 8(1) (2020), 22–34. doi: 10.1080/21642583.2019.1708830.

, Huang

, Zheng

Research on Credit Risk of P2P Lending Based on CatBoost Algorithm (2019), doi: https://doi.org/10.12677/fin.2019.93015.

Izotova

and Valiullin

, Comparison of Poisson process andmachine learning algorithms approach for credit card frauddetection, Procedia Computer Science 186 (2021), 721–726. doi: https://doi.org/10.1016/j.procs.2021.04.214.

Jabeur

S.B.

, Gharib

, Mefteh-Wali

and Arfi

W.B.

, CatBoost model and artificial intelligence techniques for corporate failure prediction, Technological Forecasting and Social Change 166 (2021), 120658https://doi.org/10.1016/j.techfore.2021.120658.

Ibrahim

A.A.

, Ridwan

R.L.

, Muhammed

M.M.

, Abdulaziz

R.O.

and Saheed

G.A.

, Comparison of the CatBoost classifier with other machine learning methods, International Journal of Advanced Computer Science and Applications 11(11) (2020), 10.14569/ijacsa.2020.0111190.

Al Daoud

, Comparison between XGBoost, LightGBM and CatBoost using a home credit dataset, International Journal of Computer and Information Engineering 13(1) (2019), 6–10. doi: https://doi.org/10.5281/zenodo.3607805.

10.

Kang

, Jang

, Im

, Kwon

and Kim

, Developing a new hourly forest fire risk index based on catboost in South Korea, Applied Sciences 10(22) (2020), 8213. doi: https://doi.org/10.3390/app10228213.

11.

Postnikov

E.B.

, Esmedljaeva

D.A.

, Lavrova

A.I.

A CatBoost machine learning for prognosis of pathogen’s drug resistance in pulmonary tuberculosis. In 2020 IEEE 2nd Global Conference on Life Sciences and Technologies (LifeTech) IEEE (2020), 86-87. doi: 10.1109/LifeTech48969.2020.1570619054.

12.

Hancock

J.T.

and Khoshgoftaar

T.M.

, CatBoost for big data: an interdisciplinary review, Journal of Big Data 7(1) (2020), 1–45. doi: 10.1186/s40537-020-00369-8.

13.

Dong

, Zeng

, Wu

, Lei

, Chen

, Srivastava

A.K.

and Gaiser

, Estimating the Pan Evaporation in Northwest China by Coupling CatBoost with Bat Algorithm, Water 13(3) (2021), 256. doi: https://doi.org/10.3390/w13030256

14.

Huang

C.L.

and Dun

J.F.

, A distributed PSO–SVM hybrid system with feature selection and parameter optimization, Applied Soft Computing 8(4) (2008), 1381–1391. doi: https://doi.org/10.1016/j.asoc.2007.10.007.

15.

Barman

and Choudhury

N.B.D.

, A similarity based hybrid GWO-SVM method of power system load forecasting for regional special event days in anomalous load situations in Assam, India, Sustainable Cities and Society 61 (2020), 102311. doi: https://doi.org/10.1016/j.scs.2020.102311.

16.

Sarafrazi

and Nezamabadi-pour

, Facing the classification ofbinary problems with a GSA-SVM hybrid system, Mathematical andComputer Modelling 57(1-2) (2013), 270–278. doi: https://doi.org/10.1016/j.mcm.2011.06.048.

17.

Wei

, Ni

, Liu

, Chen

, Wang

, Li

, Cui

, Ye

An improved grey wolf optimization strategy enhanced SVM and its application in predicting the second major, Mathematical Problems in Engineering (2017), 1–12. doi: https://doi.org/10.1155/2017/9316713>.

18.

Dong

, Zheng

, Huang

, Pan

and Liu

, Time-shift multi-scale weighted permutation entropy and GWO-SVM based fault diagnosis approach for rolling bearing, Entropy 21(6) (2019), 621. doi: https://doi.org/10.3390/e21060621.

19.

Avalos

, GSA for machine learning problems: A comprehensive overview, Applied Mathematical Modelling 92 (2021), 261–280. doi: https://doi.org/10.1016/j.apm.2020.11.013.

20.

Song

, Yan

, Ding

, Gao

and Lu

, A steel property optimization model based on the XGBoost algorithm and improved PSO, Computational Materials Science 174 (2020), 109472. doi: https://doi.org/10.1016/j.commatsci.2019.109472.

21.

Lucay

F.A.

, Cisternas

L.A.

and Galvez

E.D.

, An LS-SVM classifier based methodology for avoiding unwanted responses in processes under uncertainties, Computers & Chemical Engineering 138 (2020), 106860. doi: https://doi.org/10.1016/j.compchemeng.2020.106860.

22.

Yan

, Mu

, Yi

, Yang

and Chen

, Fault diagnosis of wind turbine based on PCA and GSA-SVM. In prognostics and system health management conference (phm-Paris), IEEE (2019), 13–17. doi: 10.1109/PHM-Paris.2019.00010.

23.

, Jiang

, Qin

, Zheng

, Social

December.

Fault diagnosis of wind turbine based on PCA and GSA-SVM. In 2019 prognostics and system health management conference (phm-Paris). IEEE (2019), 13–17. doi: 10.1007/978-3-030-68851-6_28.

24.

, An

and Li

, A chaos embedded GSA-SVM hybrid system for classification, Neural Computing and Applications 26(3) (2015), 713–721. doi: https://doi.org/10.1007/s00521-014-1757-z.

25.

Cestnik

Estimating probabilities: A crucial task in machine learning, In Proc. 9th European Conference on Artificial Intelligence (1990).

26.

Massaoudi

, Refaat

S.S.

, Abu-Rub

, Chihi

, Wesleti

F.S.

A hybrid Bayesian ridge regression-CWT-catboost model for PV power forecasting. In 2020 IEEE Kansas Power and Energy Conference (KPEC). IEEE (2020), 1–5. doi: 10.1109/KPEC47870.2020.9167596.

27.

Zhang

, Fleyeh

, Bales

A hybrid model based on bidirectional long short-term memory neural network and Catboost for short-term electricity spot price forecasting, Journal of the Operational Research Society (2020), 1–25, doi: https://doi.org/10.1080/01605682.2020.1843976..

28.

CatBoost, Score functions, Types of score functions. [online], Available: https://catboost.ai/en/docs/concepts/algorithm-score-functions#types-of-score-functions,.

29.

Rodriguez-Galiano

V.F.

, Luque-Espinar

J.A.

, Chica-Olmo

and Mendes

M.P.

, Feature selection approaches for predictive modelling of groundwater nitrate pollution: An evaluation of filters, embedded and wrapper methods, Science of the Total Environment 624 (2018), 661–672. doi: https://doi.org/10.1016/j.scitotenv.2017.12.152.

30.

Torrent

N.L.

, Visani

, Bagli

PSD2 Explainable AI Model for Credit Scoring. (2020), arXiv preprint arXiv:2011.10367.

31.

Wang

and Ku

, Utilizing historical data for corporate credit rating assessment, Expert Systems with Applications 165 (2021), 113925. doi: https://doi.org/10.1016/j.eswa.2020.113925.