Application research of credit fraud detection based on distributed rotation deep forest

Abstract

Credit fraud is a common financial crime that causes significant economic losses to financial institutions. To address this issue, researchers have proposed various fraud detection methods. Recently, research on deep forests has opened up a new path for exploring deep models beyond neural networks. It combines the features of neural networks and ensemble learning, and has achieved good results in various fields. This paper mainly studies the application of deep forests to the field of fraud detection and proposes a distributed dense rotation deep forest algorithm (DRDF-spark) based on the improved RotBoost. The model has three main characteristics: firstly, it solves the problem of multi-granularity scanning due to the lack of spatial correlation in the data by introducing RotBoost. Secondly, Spark is used for parallel construction to improve the processing speed and efficiency of data. Thirdly, a pre-aggregation mechanism is added to the distributed algorithm to locally aggregate the statistical results of sub-forests in the same node in advance to improve communication efficiency. The experiments show that DRDF-spark performs better than deep forests and some mainstream ensemble learning algorithms on the fraud dataset in this paper, and the training speed is up to 3.53 times faster. Furthermore, if the number of nodes is further increased, the speedup ratio will continue to increase.

Keywords

Deep forest credit fraud detection ensemble learning RotBoost spark

1. Introduction

Credit fraud detection is a crucial area in the financial industry, aimed at identifying and preventing individuals attempting to deceive financial institutions. These fraudulent activities include applying for fake loans, credit card fraud, identity theft, and more. Credit fraud detection can help financial institutions detect fraudulent behavior in a timely manner, reducing risks and losses. Traditional methods of credit fraud detection rely mainly on rules or statistical models, which require expert knowledge and have limited model accuracy. With the development of machine learning and deep learning technologies, more and more research is exploring the use of machine learning and deep learning models for credit fraud detection, which can improve the predictive performance of models by learning features from large amounts of data [1].

Inspired by deep neural networks and ensemble learning, Zhou et al. [2] proposed a deep learning model called deep forest or gcForest, which is an ensemble method of forests. Compared to deep neural networks, it does not use complex backpropagation algorithms and avoids the problem of tedious hyperparameter tuning. The introduction of gcForest also opened a new door for the construction of non-differentiable deep models. Currently, deep forest has been widely applied in different fields and has achieved good results. Specifically, deep forest has also been applied in the field of credit fraud detection. In 2019, Huang et al. [3] found a serious customer financial cash-out fraud problem in Ant Financial’s online credit financial company. They collaborated with Zhou’s team to optimize and improve the model for the dataset and implemented it in a distributed manner on the Kunpeng system to address the online cash-out fraud detection in the face of massive data, achieving good results.

Although deep forest performs well in the field of credit fraud detection, it also has some shortcomings. Firstly, when training data with spatial or temporal correlation, deep forest tries to extract the correlation between features as much as possible by multi-granularity scanning layers to increase sample diversity and improve model performance. However, credit fraud data has no temporal or spatial relationship, making it impossible to use multi-granularity scanning layers to increase sample diversity. Additionally, in the context of the big data era, as the amount of data continues to increase, traditional single-machine machine learning algorithms will face problems such as long processing times, limited computing and storage capacity, and low scalability when dealing with large-scale data. Huang’s research is a distributed fraud detection algorithm implemented on the Kunpeng system and is not universal.

In order to maintain the advantages of deep forest in credit fraud classification tasks, we proposed a dense rotation deep forest algorithm based on the improved RotBoost algorithm, called DRDF, to improve classification accuracy and stability and reduce the risk of financial platforms. Additionally, to adapt the DRDF algorithm to the fraud detection scenarios in large data environments, we proposed a DRDF-spark algorithm based on Spark, which solves the problems of limited computing and storage capabilities and poor scalability faced by single-machine algorithms. DRDF-spark reduces the training time cost of the algorithm and makes the proposed algorithm more versatile and applicable to different data scale scenarios. The main contributions of this paper can be summarized as follows:

(1)
Since financial transaction data belongs to tabular data and does not have logical relationships in time or space, it is not possible to use multi-scale scanning layers to improve sample diversity. Therefore, this paper uses RotBoost, which is composed of rotation forest and AdaBoost as the base classifier of the model, to make up for the lack of multi-scale scanning structure. The core idea of the rotation forest is to use principal component analysis (PCA) to extract features from the training set of each base classifier, construct diversified classifiers, and maintain accuracy by retaining all principal components. In addition, RotBoost sets weights for samples, allowing different samples to contribute differently to the model during training.
(2)
Due to the long training time and high single-machine computing resource requirements of DRDF when applied to fraud detection classification tasks in large-scale data scenarios, a parallel deep rotation forest algorithm called DRDF-spark is proposed. First, in the process of constructing rotation matrices, rotation forest requires random class instance sampling, bootstrap sampling, and PCA calculation for each feature subset, which is very suitable for combining with distributed computing. Secondly, in order to find a balance between parallelism and communication overhead, DRDF-spark no longer constructs a unique rotation matrix for each rotation tree, but uses a single sub-forest to make the model more suitable for parallel construction.
(3)
Finally, to further reduce the network communication transmission consumption during the model’s parallel construction process, a pre-aggregation mechanism is introduced to perform local aggregation of the statistical results of sub-forests in the same node in advance, thereby reducing the network data transmission between nodes. The structural arrangement of this article is as follows. Section 2 introduces related work. Section 3 presents the specific algorithm implementation of DRDF-spark. In Section 4, the performance of the model is validated through experimental results. Finally, Section 5 summarizes the algorithm and proposes suggestions for future research.

2. Related work

2.1 Credit fraud detection

Credit fraud detection is an important means to prevent credit risks and is one of the most important applications in the financial industry. In recent years, domestic and foreign researchers have conducted in-depth research on credit fraud detection models through large amounts of data and machine learning technologies, which are widely used in banks, internet finance platforms, and online payment companies, among others.

Supervised learning, unsupervised learning, and association analysis are effective tools to solve credit fraud problems, widely used in data preprocessing, feature extraction, and model training. Supervised learning mainly uses algorithms such as cross-validation, decision tree, random forest, support vector machine, neural network, and Bayesian to model labeled positive and negative samples and predict whether unknown data has fraudulent behavior. Unsupervised learning mainly uses algorithms such as clustering, anomaly detection, and density estimation to discover potential fraudulent behavior through data analysis and model establishment. Association analysis mainly uses graph data structures to discover the relationships between nodes and detect possible group fraudulent behavior [4]. The research on credit fraud detection has a long history, dating back to the late 1990s and early 21st century. In the 1960s, researchers used statistical methods to identify fraudulent behavior. With the development of computer technology, fraud detection began to use computer programs to identify fraudulent behavior. At the same time, the emergence of artificial intelligence technology has also provided more possibilities for fraud detection.

In 2012, Brown et al. [5] compared the performance of several classification algorithms on imbalanced datasets in the field of credit scoring and found that Random Forest performed best in both precision and sensitivity, making it more suitable for detecting credit fraud. In 2018, Roy et al. [6] proposed a classifier model based on multilayer perceptrons and autoencoders, which outperformed traditional algorithms in terms of performance and could effectively detect credit card fraud. In 2019, Monika et al. [7] proposed using heterogeneous ensemble learning methods for two-stage consumer credit risk modeling, which performed better than a single algorithm and could improve the accuracy of credit risk management and evaluation. Feng et al. [8] also verified the performance advantages of ensemble learning over single models and proposed a dynamic weighted ensemble classification model that can effectively improve the prediction accuracy of credit scoring. In 2023, Srivastava et al. [4] proposed a fraud detection method based on graph analysis, which effectively detected fraudulent behavior in distributed graph databases.

In recent years, more and more researchers have been using decision tree-based ensemble algorithms to handle binary classification problems. In 2017, Zhou et al. proposed Deep Forest, a decision tree-based ensemble method that uses a multi-granularity scanning layer to extract logical relationships in time or space from samples using a sliding window mechanism and a cascade layer to enhance feature processing layer by layer. In 2019, Ant Financial and Zhou’s team collaborated to optimize Deep Forest for imbalanced datasets, improving the accuracy of detecting cash-out fraud transactions [9]. In 2021, Huang et al. [10] proposed an improved Deep Forest model for identifying fraudulent online transactions, improving the security of transactions. In 2022, Wang Xiaoxiao et al. studied a credit risk assessment method for P2P online lending borrowers based on Deep Forest, which improved the accuracy and stability of borrower credit evaluation by collecting data and using Deep Forest algorithm for feature extraction and model training.

2.2 Deep forest

Deep Forest is a deep non-neural network model that is based on the ideas of random forests and deep learning. It consists of two important parts: the multi-grained scanning layer and the cascade layer. The multi-grained scanning layer is located at the beginning of the model and is used to process datasets with spatial or temporal relationships to improve the model’s performance. The multi-grained scanning layer introduces a sliding window scheme similar to convolutional neural networks, using three different window sizes to slide over the original data with a certain stride to extract features and increase sample diversity. Figure 1a illustrates the process of the multi-grained scanning layer in processing the MNIST dataset with spatial correlation. Suppose the original input consists of 60,000 images, each of which is a 28 $\times$ 28 panel with 784 image pixels. Three window sizes are used for multi-grained scanning: 7 $\times$ 7, 10 $\times$ 10, and 13 $\times$ 13, respectively. One of them, the 7 $\times$ 7 window, will generate 121 feature vectors (i.e., 121 7 $\times$ 7 panels). Instances that are extracted from windows of the same size will be used to train a random forest and a completely-random tree forest. Class vectors will be produced for each forest, and then concatenated to form transformed features for training in cascaded layers.

The cascade layer is inspired by the layer-by-layer processing of deep neural networks, where each layer of the cascade receives the feature information processed by the previous layer and outputs its processing results to the next level. However, each level of the cascade is an ensemble of decision tree forests, i.e., an ensemble of ensembles. Additionally, the model uses different types of forests to encourage diversity, two random forests and two completely-random tree forests, respectively. Figure 1b illustrates the process of the cascade layer structure. After multi-grained scanning, the generated feature vectors arrive at the entrance of the cascade layer. The feature vectors pass through each layer and produce the estimated class distribution, forming a class probability vector. The class probability vector is concatenated with the original input feature vector as an enhanced feature and becomes the input feature vector for the next layer. With the cascaded layer being trained layer by layer, the model’s learning ability is improved as the number of layers increases. Additionally, to reduce the risk of overfitting, the model evaluates the accuracy of its predictions through k-fold cross-validation after each layer of the cascade layer. If the evaluation criteria meet the pre-set termination condition, the model will terminate the training process early.

Figure 1.

The overall procedure of Deep Forest trained using MNIST datasets.

Deep forest has opened a new door for building non-differentiable deep models and provided new ideas for solving classification problems. Compared to deep neural networks, deep forest does not require tuning a large number of hyperparameters, and the model consists of a set of tree-based classifiers, with each tree can be viewed as a series of decision rules. In contrast, the decision process of deep neural networks is more difficult to explain. In addition, deep neural networks require a large amount of training data to avoid overfitting, while deep forest can achieve good classification performance even on small training sets. Currently, deep forest has been widely applied in various fields and has achieved good results. Guo et al. [11] improved deep forest and applied it to cancer subtype classification on small-scale biological datasets. Zhang et al. implemented deep forest in the Kunpeng system and applied it to credit card fraud detection. Gao et al. [12] proposed an improved version of deep forest called IMDF and applied it to imbalanced data. Yang et al. [13] extended deep forest to multi-label learning, and then Wang et al. [14] applied multi-label learning to the medical field and achieved good results. Wang et al. [15] proposed weakly labeled deep forest, which used the transitivity of cascade forest to improve performance layer by layer, achieving excellent results in the field of weak label learning. He et al. [16] proposed the Mondrian deep forest model, which is based on the improvement of deep forest and supports incremental learning. It can gradually learn and update models from data streams and has efficient applications in processing data streams and online learning.

2.3 Rotation forest

Rotation Forest is an ensemble algorithm that is based on the improvement of Random Forest. This algorithm focuses on improving the accuracy and diversity of base classifiers by utilizing the idea of feature transformation. The experimental results on 33 selected datasets from the UCI Machine Learning Repository showed that Rotation Forest outperforms the standard Random Forest algorithm to a large extent [17]. Assuming a training set $\mathcal{D}=\{(\bm{X}_{\bm{i}},Y_{i})\}_{i=1}^{N}$ with $N$ samples, where $X_{i}$ represents the $i$ -th sample and $\bm{X}_{i}=(x_{i1},x_{i2},\ldots,x_{ip})$ , and $p$ is the number of feature dimensions. $Y_{i}$ is the label of the $i$ -th sample, and $Y_{i}\in\{1,2,\ldots,\mathrm{c}\}$ , where $c$ is the number of categories in the dataset. The base classifiers in the forest are defined as $C_{t}(t=1,2,\ldots,T)$ , where $T$ is the total number of base classifiers. To encourage diversity, the input features of $C_{t}$ are obtained by applying principal component analysis (PCA) to the original features to produce spatial transformations. During the PCA process, all principal components are retained without losing any useful information, which ensures both model prediction accuracy and diversity. The following steps can be taken to construct the base classifier $C_{t}$ :

(1)
Randomly divide the feature space $\mathbb{P}$ of the training set into $K$ disjoint subfeature spaces $\mathbb{P}_{t,k}$ , where $k\in\{1,2,\ldots,\mathrm{K}\}$ , each of which contains $m=\frac{p}{K}$ features.
(2)
Use bootstrap sampling to randomly select 75% of the samples from the training set and select the corresponding $m$ features from each subfeature space $\mathbb{P}_{t,k}$ to form each subset.
(3)
Use PCA to obtain the eigenvectors $v^{t,k}=(v_{1}^{t,k},v_{2}^{t,k},\ldots,v_{m}^{t,k})$ of the covariance matrix for each subset and define a coefficient matrix using the following formula:

$\displaystyle R_{t}=\left[\begin{array}[]{ccccccccc}v_{1}^{t,1}&\cdots&v_{m}^{% t,1}&&0&&\cdots&0&\\ &0&&v_{1}^{t,2}&\cdots&v_{m}^{t,2}&\cdots&0&\\ &\vdots&&&\vdots&&\ddots&\vdots&\\ &0&&&0&&v_{1}^{t,K}&\cdots&v_{m}^{t,K}\end{array}\right]$ (1)
(4)
Rearrange the columns of the coefficient matrix $R_{t}$ to obtain the final rotation matrix $R_{t}^{\prime}$ , and obtain the training set $\mathcal{D}_{t}=\mathcal{D}R_{t}^{\prime}$ .
(5)
Finally, train the classifier $C_{t}$ using the transformed training set $D_{t}$ .

2.4 RotBoost

Zhang et al. combined the ideas of rotation forest and another successful strong classifier algorithm AdaBoost, and named the resulting algorithm RotBoost [18]. Similar to AdaBoost, RotBoost focuses on the misclassified samples from the previous iterations and updates the distribution of training data weights during the training process. Based on rotation forest, Giving the reconstructed training set $D_{t}$ an initial equal weight $W_{t,1}(i)=\frac{1}{N}(i=1,2,3\ldots N)$ , Using the weighted distribution of $D_{t}$ we train a weak classifier $C_{t,s}$ , where $s\in\{1,2,\ldots,\mathrm{S}\}$ is the iteration number of the AdaBoost algorithm. Afterwards, we compute the error rate of classifier $C_{t,s}$ :

$\displaystyle\varepsilon_{t,s}=\mathrm{P}(C_{t,s}(\mathbf{x}_{i})\neq y_{i})=% \sum_{i=1}^{N}I(C_{t,s}(\mathbf{x}_{i})\neq y_{i})W_{t,s}(i)$ (2)

The coefficient $\alpha_{t,s}$ of the weak classifier $C_{t,s}$ in the final classifier can be calculated based on the error rate, which indicates the importance of $C_{t,s}$ . As the error rate decreases, the contribution of the classifier increases. The coefficient can be calculated using the following formula:

$\displaystyle\alpha_{t,s}=\frac{1}{2}\ln\left(\frac{1-\varepsilon_{t,s}}{% \varepsilon_{t,s}}\right)$ (3)

In the subsequent iterations, the weight distribution of the training set is adjusted by increasing the weights of the instances that were misclassified by previously trained classifiers and decreasing the weights of the correctly classified instances. This way, the subsequently trained classifiers can better predict these harder-to-classify instances. The weight distribution is updated using the following formula, where $Z_{s}$ is a normalization factor.

$\displaystyle W_{t,s+1}(i)=\frac{W_{t,s}(i)}{Z_{s}}\times\left\{\begin{array}[% ]{l}e^{-\alpha_{t,s},},\text{ if }y_{i}=C_{t,s}(\mathbf{x}_{i})\\ e^{\alpha_{t,s}},\text{ if }y_{i}\neq C_{t,s}(\mathbf{x}_{i})\end{array}\right.$ (4)

The ensemble model can enhance the representational power and generalization performance of individual models. Zhang et al. demonstrated through experiments that RotBoost not only reduces the bias and variance of individual trees, but also generates lower prediction error than Rotation Forest and AdaBoost.

Figure 2.

The flowchart of RotBoost-ICA training process. Suppose the input dataset contains N instances with p features. There are T decision trees, and within each decision tree, the rotation matrix ${R}_{t}^{\prime}$ is constructed using bootstrap sampling and ICA; the rotation matrix is used to perform a linear transformation on the original dataset, resulting in a new dataset ${D}_{t}$ . The bootstrap proportion is 75%, and the number of features extracted is $m=2$ .

3. The proposed approach

3.1 Single layer structure of cascade layer

Each layer of the cascade in the deep forest model is an ensemble of random forests and completely-random tree forests, which are both ensemble learning algorithms that integrate multiple decision trees. The difference is that completely-random tree forests are more random in the process of building the forest, and exhibit better generalization ability. When training data with spatial or sequential structures, deep forest utilizes a multi-granularity scanning structure to handle the relationships between features and encourage diversity. However, credit fraud datasets lack spatial or temporal structural relationships, which makes the multi-granularity scanning layer inapplicable and thus fails to fully leverage the superior performance of deep forest.

To address the aforementioned issues, DRDF uses RotBoost as a component of the cascade layer to rebuild the training set for each decision tree in the ensemble, which to some extent replaces the diversity of multi-scale scanning structures in enriching the samples and enhances the model’s representation learning ability. Diversity is crucial for model building, and deep forests contribute to diversification by introducing extreme forests into the model. Similarly, in DRDF, RotBoost-ICA is introduced to increase the diversity of the model, and each layer of the cascaded layers is ultimately an ensemble of RotBoost and RotBoost-ICA. In the process of constructing rotation matrices with RotBoost-ICA, ICA is used instead of PCA for feature transformation of data, with the aim of building accurate and diverse classifiers. The introduction of ICA also gives the model greater advantages. First, it can better identify the concentration of data in the dimensional space. Second, it can find a basis that is not necessarily orthogonal, which may better reconstruct the data in the presence of noise than PCA. Figure 2 is the training flowchart of RotBoost-ICA. In order to better illustrate the process, Algorithm 3.1 shows the pseudocode of RotBoost-ICA.

[h] : RotBoost-ICA[1]

•
$\mathcal{D}$ : the objects in the trainning data sets, $\mathcal{D}=\{(\bm{X}_{\bm{i}},Y_{i})\}_{i=1}^{N}$ , in which $\bm{X}_{\bm{i}}=(x_{i1},x_{i2},\ldots,x_{ip})$ and $Y_{i}\in\{1,2,\ldots,c\}$ , $p$ is the dimension of the feature and $c$ is the number of categories.
•
$T$ : the number of trees in the classifier
•
$S$ : the number of iterations of AdaBoost
•
$K$ : the number of split sub-datasets
•
$\mathbf{x}$ : a data point to be classified

•
the class lable for $\mathbf{x}$ predicted by the final ensemble classifier $C^{}$ as: $C^{}(\mathbf{x})=\underset{y\in\Phi}{\operatorname{argmax}}\sum_{t=1}^{T}I(C_% {t}(\mathbf{x})=y)$

Process:

For $t=1,2,3,\ldots,T$ : 1. Randomly split the feature space ${P}$ into K disjoint feature subsets ${P}_{s,k}$ , each of which contains $m=\frac{p}{K}$ features. 2. For $k=1,2,3,\ldots,K$ : (a) Obtain the corresponding subset matrix ${X}_{t,k}$ based on the feature space ${P}_{t,k}$ . (b) Use the bootstrap algorithm to extract 75% of samples from ${X}_{t,k}$ to obtain $X_{t,k}^{\prime}$ (c) Calculate the eigenvector $v^{t,k}$ of the covariance matrix of $X_{t,k}^{\prime}$ using ICA, and fill in the coefficient matrix $R_{t}$ in Eq. (1). 3. Rearrange the eigenvectors in the columns of $R_{t}$ to obtain the rotation matrix ${R}_{t}^{\prime}$ . 4. Compute the training set $D_{t}$ for the t-th decision tree as $\mathcal{D}_{t}=\mathcal{D}R_{t}^{\prime}$ . 5. Initialize the weight distribution $W_{t,1}$ of the training set $D_{t}$ as $W_{t,1}(i)=\frac{1}{N}(i=1,2,3,\ldots N)$ . 6. For $s=1,2,3,\ldots,S$ : (a) Train the classifier $C_{t,s}$ with weighted $D_{t}$ and calculate the error rate of $C_{t,s}$ using Eq. (2): (b) If $\varepsilon_{t,s}$ is greater than 0.5 or equals to 0, the iteration process is terminated. (c) Calculate the coefficient $\alpha_{t,s}$ of $C_{t,s}$ using Eq. (3). (d) Update the weight distribution $W_{t,s}$ of $D_{t}$ using Eq. (4). 7. Calculate $C_{t}(\mathbf{x})=\underset{y\in\Phi}{\operatorname{argmax}}\sum_{s=1}^{S}% \alpha_{t,s}I(C_{t,S}(\mathbf{x})=y)$

RotBoost integrates the ideas of rotation forest and the strong classifier AdaBoost. In the construction of each decision tree, the rotation forest performs bootstrap sampling on the original dataset and then divides the resulting new training set into several subsets. After performing principal component analysis on each subset, a rotation matrix can be obtained, which is used to linearly transform the original dataset. Because the angles of linear transformations on the training sets of each decision tree are different, the diversity of the model is also ensured. At the same time, this also solves the problem of the inability to apply multi-scale scanning layers. In addition, the AdaBoost algorithm introduces the idea of weights, which updates the weights of the samples so that the previously misclassified samples receive more attention in the later stages. The introduction of weights allows different samples to make different contributions to the model during the training process. However, deep forests have not made good use of this.

Figure 3.
The overall structure of DRDF. Suppose there are two classes to predict, and the raw features are 14-dimensional. P.RotBoost (black) represents the original RotBoost algorithm that uses the PCA version, and I.RotBoost (red) represents RotBoost-ICA, which is the improved version that uses ICA.

3.2 Overall architecture of DRDF

Figure 3 shows the overall architecture of DRDF. Each layer of DRDF is an ensemble of RotBoost and RotBoost-ICA, and each classifier consists of 50 decision trees. For simplicity, P.RotBoost represents the original RotBoost algorithm and I.RotBoost represents the RotBoost-ICA algorithm. Assuming the original input is N 14-dimensional data samples, if it is a binary classification task, each layer of DRDF will generate N 4-dimensional class vectors. Subsequently, these 4-dimensional class vectors are combined with the input vectors of the previous layer as enhanced features, concatenated into an 18-dimensional feature vector, and used as the input vector of the next layer. With the increase of the number of cascaded layers, the dimension of the feature vector linearly increases until the end of the last layer, where the model training is successful. After passing through the last layer of the model, the test set will obtain N 4-dimensional class vectors. Then, the average distribution probabilities of each forest output are calculated, and the maximum value is taken as the final prediction result. In addition, the output class probability vectors of each layer in the cascaded layers are obtained through 5-fold cross-validation, and the model will evaluate the predictive performance. If the performance does not improve after three consecutive layers, the training process will be terminated early. Therefore, the complexity of the model can be adaptively determined. The algorithm pseudo-code for this process is shown in Algorithm 3.2.

: DRDF[1]

•
${I}_{0}$ : the initial training data sets, ${I}_{0}=\{(\bm{X}_{\bm{i}},Y_{i})\}_{i=1}^{N}$ in which $\bm{X}_{\bm{i}}=(x_{i1},x_{i2},\ldots,x_{ip})$ and $Y_{i}\in\{1,2,\ldots,c\}$ , $p$ is the dimension of the feature and $c$ is the number of categories.
•
${H}_{0}$ : the initial test data sets.
•
$l$ : the index of a layer in the DRDF, $l=(1,2,3,\ldots,L)$ .
•
$F$ : the number of RotBoost at each layer.
•
$T$ : the number of trees in the RotBoost.
•
$K$ : the number of cross-validated folds.
•
$O_{(l,f,t)}$ : the class probability vector of the t-th decision tree in the f-th classifier in the l-th layer.

•
$P_{f\textit{inal}}$ : the final result.

Training Process:

1. Load the initial training data sets ${I}_{0}$ 2. For $l=1,2,3,\ldots,L$ : $2.1 .$ For $f=1,2,3,\ldots,F$ : $(a)$ Use $I_{l-1}$ to train RotBoost/RotBoost-ICA by K-fold cross validation $(b)$ Store the trained RotBoost model $M_{l,f}$ for the f-th classifier in $l$ -th layer $(c)$ Obtain the estimated class distribution at $l$ layer as: $O_{(l,f)}=T^{-1}\sum_{t=1}^{T}O_{(l,f,t)}$ $2.2 .$ Construct the $I_{l}$ by using the equation: $I_{l}=[I_{l-1},O_{(l-1,1)},O_{(l-1,2)},\ldots,O_{(l-1,f)},\ldots,O_{(l-1,F)}]$

Test Process:

1. Load the initial test data sets ${H}_{0}$ 2. For $l=1,2,3,\ldots,L$ : $2.1 .$ For $f=1,2,3,\ldots,F$ : $(a)$ Load the trained model $M_{l,f}$ . $(b)$ Use $H_{l-1}$ to predict the probabilistic features by K-fold cross validation. $2.2 .$ Define the $H_{l}$ as $H_{l}=[H_{l-1},O_{(l-1,1)},O_{(l-1,2)},\dots,O_{(l-1,f)},\dots,O_{(l-1,F)}]$ 3. Calculate the final prediction result as: $P_{\textit{final}}=\max\left\{F^{-1}\sum_{f=1}^{F}O_{1}^{(l,f)},F^{-1}\sum_{f=% 1}^{F}O_{2}^{(l,f)},\ldots,F^{-1}\sum_{f=1}^{F}O_{c}^{(l,f)}\right\}$
3.3 Parallel algorithm of DRDF based on spark

However, in the context of the big data era, traditional single-machine machine learning algorithms face problems such as long processing time, limited computing and storage capabilities, and low scalability when dealing with large-scale data due to the continuous increase in data volume. Therefore, distributed computing has become an important means of processing large-scale data. Spark, as a popular distributed computing framework, has the advantages of high performance, high scalability, and strong fault tolerance, and has been widely used in the fields of big data processing and machine learning. In this context, in order to make DRDF more universal, this chapter proposes a distributed deep rotation forest algorithm based on Spark, DRDF-spark, and applies it to the fraud detection scenario, aiming to improve the speed and efficiency of data processing.

Based on the compatibility between the construction process of rotation forest and distributed computing, a distributed deep rotation forest algorithm based on Spark is proposed, and a pre-aggregation mechanism is added to perform local aggregation of the statistical results of sub-forests in the same node in advance, thereby reducing network data transmission between nodes and improving communication efficiency. In the improved cascading layer structure, since each rotation forest base classifier needs to perform multiple PCA calculations to obtain the rotation matrix, and then rotate the training and test sets through the rotation matrix, such operations bring diversity to the model while increasing the runtime.

Based on the above observations, this section designs a parallel deep rotation forest model, which makes good use of some characteristics of rotation forest in the construction process to improve the efficiency of model parallel computing. As can be seen from Section 2.3, in the process of constructing the rotation matrix, each feature subset of the rotation forest needs to perform operations such as random class instance extraction, bootstrap sampling, and PCA calculation. This construction process is very compatible with the combination of distributed computing. Figure 4 shows the algorithm flow of parallel construction of rotation matrix in the deep rotation forest model.

At the entrance of the model, there is a training set $\mathcal{D}=\{(X_{i},Y_{i})\}_{i=1}^{N}$ , where $Y_{i}$ represents the label of the sample, which belongs to a binary classification problem as it is loan data, therefore $Y_{i}\in\{0,1\}$ . The feature dimension is $p$ and the number of instances is $N$ . The parallel training is divided into three stages. In the first stage, random category sampling is performed on the data, which often occurs after the training set is partitioned according to the feature space. Actually, the partition of the feature space is also parallel, but it is not included in the process of constructing the rotation matrix and will be described in detail later. During this stage, the categories are selected in parallel, and each subset of the feature space selects its own category instances, followed by the removal of instances from other categories. In the second stage, each subset performs Bootstrap sampling on the instances simultaneously, and the sampling ratio is generally set to 75%. After the sampling is completed, it enters the most important parallel stage of PCA processing. In the third stage, PCA is applied in parallel to each subset, and the processed subsets retain all of their principal components. While improving diversity, it seeks higher accuracy. The retained principal components are re-ordered to form the important rotation matrix. In the Spark MLlib framework, the PCA algorithm has been implemented using parallel singular value decomposition (SVD) algorithm, which divides the dataset into several small data blocks and calculates their covariance matrices in parallel on each node, then sends the results to the master node for accumulation, and finally performs eigenvalue decomposition on the master node to obtain the reduced dataset [19]. This distributed implementation method can improve the computational efficiency and scalability of the PCA algorithm, and is suitable for processing large-scale data. This paper will use the PCA algorithm in the MLlib module, which provides convenience for implementing distributed deep rotation forests [20].

Figure 4.

The Flowchart of parallel construction rotation matrix algorithm. Suppose the input dataset contains N instances with p features, and randomly divide the feature space $\mathbb{P}$ of the training set into $K$ disjoint subfeature spaces $\mathbb{P}_{k}$ , where $k\in\{1,2,\ldots,\mathrm{K}\}$ , each of which contains $m=\frac{p}{K}$ features.

According to Fig. 4, multiple rotation matrices need to be constructed within each rotation forest. The rotation forest can be seen as a collection of rotation trees, each of which requires a unique rotation matrix. Therefore, each tree in the forest represents a parallel task. When the number of rotation trees in the rotation forest is $T$ and the number of forests in the cascade is $F$ , the parallelism of the cascade layer is $F*T$ . Obviously, this is a relatively high degree of parallelism. In the Spark framework, a large number of tasks will be launched to process this parallel task simultaneously. However, since the construction of a single tree does not last too long, the launch of the task at this time will be a very performance-consuming operation, resulting in the time spent on task start-up even greater than the time spent on task execution. Moreover, a large number of tasks need to communicate over the network to merge the final results after obtaining intermediate results, resulting in additional communication overhead [21]. Such a design of parallelism is undoubtedly a disaster for the Spark framework.

To address the aforementioned issues, DRDF-spark no longer builds a unique rotation matrix for each rotation tree, but instead uses a single sub-forest. As shown in Fig. 5, the process of parallel training of rotation forests involves dividing each rotation forest into S sub-forests, and constructing a rotation matrix for each sub-forest of the model. Multiple decision trees within a sub-forest share the same rotation matrix because decision trees in a random forest randomly select a subset of samples with replacement. Therefore, sharing rotation matrices does not affect the results. At this point, the parallelism of the cascade layer is no longer $F*T$ but is reduced to $F*S$ , where $S$ is an adjustable parameter in the model. This reduces the additional overhead of network communication and enables the model to find a balance between parallelism and communication costs.

Assuming that each level of the cascade layer contains 4 rotation forests, each consisting of 100 decision trees, the parallelism of the cascade layer before improvement is 400. As mentioned above, higher parallelism is not always better, and sometimes it can have a negative effect. In the Spark framework, if there are many tasks, more resources can be utilized, and higher parallelism is better. However, if the number of tasks is too high and machine resources are insufficient, the machine will execute tasks in batches and only proceed to the next batch after completing the previous one, resulting in a decrease in parallel efficiency due to the time required to start and stop tasks. After improvement, the introduction of parameter $S$ , assuming $S$ equals 10, allows each rotation forest to have 10 sub-forests that can be trained in parallel, and the cascade layer can construct 40 sub-forests simultaneously, reducing the parallelism from 400 to 40 and effectively reducing communication and resource overhead.

Algorithm 4 presents the pseudocode for the distributed algorithm of the cascade layer, which shows the workflow of the l-th level of the cascade layer. The input of Algorithm 4 is the training set $D$ . During the cascade layer, each rotation forest and sub-forest within it are computed in parallel, as well as the construction of rotation matrices after feature space partitioning. Subsequently, the estimated values of these sub-forests are combined, and the result $V_{l}$ returned by Algorithm 4 is the estimated class distribution of the 1-th layer of the cascade layer.

$\displaystyle V_{l}=\text{concatenate}\left(S^{-1}\sum_{s=1}^{S}V_{1,s},S^{-1}% \sum_{s=1}^{S}V_{2,s},\ldots,S^{-1}\sum_{s=1}^{S}V_{f,s}\right)$ (5)

[h] : Cascade Layer Distributed Algorithm[1]

•

$\mathcal{D}$ : the objects in the trainning data sets, $\mathcal{D}=\{(\bm{X}_{\bm{i}},Y_{i})\}_{i=1}^{N}$ , in which $\bm{X}_{\bm{i}}=(x_{i1},x_{i2},\ldots,x_{ip})$ and $Y_{i}\in\{1,2,\ldots,c\}$ , $p$ is the dimension of the feature and $c$ is the number of categories.

•

$F$ : the number of forests in the cascade layer

•

$T$ : the number of trees in the rotation forest

•

$S$ : the number of rotation forest cut subforests, is equal to the number of rotation matrices

•

$K$ : the number of split sub-datasets

•

${R}_{s}^{\prime}$ : the rotation matrix of the s-th subforest

•

$V_{l}$ : the estimated class distribution of the l-th layer of the cascade layer

Process:

for $s=1,2,3,\ldots,S$ : do (parallel computing)

Randomly split the feature space ${P}$ into K disjoint feature subsets ${P}_{s,k}$ .

for $k=1,2,3,\ldots,K$ : do (parallel computing)

$(a)$ Obtain the submatrix ${X}_{s,k}$ correspond to the attributes in ${P}_{s,k}$ $(b)$ Randomly sample class instances ${X}_{s,k}^{\prime}$ $(c)$ Use the bootstrap algorithm to randomly draw 25% samples from ${X}_{s,k}^{\prime}$ to get ${X}_{s,k}^{\prime\prime}$ $(d)$ Calculate the eigenvector $v^{s,k}$ of the covariance matrix of $X_{s,k}^{\prime\prime}$ using PCA, and fill in the coefficient matrix $R_{s}$ in Eq. (1) Construct the rotation matrix ${R}_{s}^{\prime}$ by Rearranging the columns of ${R}_{s}$ Calculate the training set $D_{s}=DR_{s}^{\prime}$ of the sth sub-forest Initialize the s-th sub-forest model $M_{s}$ , and use the training set $D_{s}$ to train $M_{s}$ Get the estimated class distribution $V_{s}$ of the sub-forest $M_{s}$ The estimated class distribution produced by the l-th layer of the cascade layer are calculated by:

${\quad\quad\quad\quad V_{l}=\text{concatenate}\left(S^{-1}\sum_{s=1}^{S}V_{1,s% },S^{-1}\sum_{s=1}^{S}V_{2,s},\ldots,S^{-1}\sum_{s=1}^{S}V_{f,s}\right)}$

Figure 5.

The Flowchart of parallel construction rotation forest algorithm. Suppose there are S sub-forests, where ${R}_{s}$ represents the rotation matrix of the $s$ -th sub-forest and ${V}_{s}$ represents the estimated class distribution of the s-th sub-forest.

The estimated class distribution of the s-th subforest in the f-th rotation forest in the cascaded layer is denoted as $V_{f,s}=\frac{s}{T}\sum_{i=1}^{t_{s}}p_{i}$ . Therefore, the formula for calculating the estimated class distribution of the f-th rotation forest is as follows:

$\displaystyle V_{f}=S^{-1}\sum_{s=1}^{S}*\frac{S}{T}\sum_{i=1}^{t_{s}}p_{i}=T^% {-1}\sum_{s=1}^{S}\sum_{i=1}^{t_{s}}p_{i}$ (6)

Where $t_{s}$ is the number of trees in the s-th subforest, and $p_{i}$ is the estimated class distribution of the $i$ -th tree in the corresponding subforest. Then, the output of the l-th layer will be concatenated with the initial vector to form the input vector of the ( $l+1$ )-th layer, which will be transmitted to the next layer of the cascaded layer for training. The model repeats this process until the performance of the subsequent 3 layers in the cascaded layer does not improve, and then the training process will be terminated early.

3.4 Pre-aggregation mechanism

When constructing the forest in a single-machine environment, the probability vectors of all decision trees are aggregated in memory to obtain the final result. However, in a distributed environment, the forest is divided into multiple sub-forests that are distributed on different nodes of the cluster. After computing the class vectors of all sub-forests, each node directly sends its results to the host for intermediate result merging. When the intermediate result data is large, network communication may become a performance bottleneck [22]. To address this potential bottleneck, this section introduces a pre-aggregation mechanism to further improve the network communication efficiency of the distributed deep rotation forest. Although a balance has been found between parallelism and communication costs in the design of parallelism, to further reduce network communication costs, this mechanism mainly performs local aggregation of the statistical results of sub-forests on the same node before aggregating all results in a sub-forest. This reduces the network data transmission between nodes.

Figure 6.

The parallel computing pre-aggregation process flowchart.

Figure 6 shows the flow chart of the forest with the pre-aggregation mechanism added during the parallel process. In the cascade layer, assuming that the rotation forest is divided into five sub-forests after being rotated five times, i.e., the dataset $D$ is rotated five times to obtain $D_{1}$ to $D_{5}$ and it is known that the current forest is allocated to three working nodes. The sub-forests are allocated according to the current cluster resource situation. Sub-forests 2 and 5 are assigned to node 1, sub-forest 1 is assigned to node 2, and sub-forests 3 and 4 are assigned to node 3. After each sub-forest is trained, it generates its own result class vector. Before optimization, the first five sub-forests will send their results to the host for merging. At this point, the five class vectors will be transmitted over the network. After adding the pre-aggregation mechanism, sub-forests 2 and 5, because they are on the same node, their results will be merged in memory first. Similar local aggregation will also be performed on the other nodes. Since the memory occupied by each result vector produced by each sub-forest is not large, it is completely feasible to perform pre-aggregation in memory. After pre-aggregation, only three class vectors are transmitted over the network and merged, reducing the workload of the host and improving the network communication efficiency.

4. Experiments

4.1 The experiments of DRDF

The experiment uses two datasets, Credit Approval and Lending Club, to evaluate the performance of DRDF. Lending Club is an online lending platform based in the United States that provides unsecured personal loans between $1,000 and $40,000 to borrowers. The platform evaluates applicants based on their personal information, personal credit situation, loan amount, loan purpose, and other factors before deciding whether to approve the loan. The loan approval decision involves two types of risks. Firstly, if an applicant is likely to repay the loan but the loan is not approved, the company may lose business and suffer financial losses. Secondly, if an applicant engages in credit fraud and is unlikely to repay the loan, approving the loan could result in the applicant defaulting and lead to financial losses for the company. The Credit Approval dataset belongs to relevant data for credit card applications. Due to the high confidentiality and privacy of bank data, all feature names and values have been changed to meaningless symbols to protect the security of the data. The processed data contains 690 samples and 15 features. The Lending Club dataset is loan data from Lending Club from 2016 to 2017, containing information about past loan applicants and whether or not they have defaulted. Its purpose is to determine whether the applicant has the risk of default, which can be used as a reference for staff to reject loans, reduce loan amounts, or loan to risky applicants at higher interest rates. The loan data contains over 20 million records, including both approved and rejected applications. Only the approved application data is used in the experiment, with a total of 759,338 records and 72 features.

The performance of DRDF was validated on the preprocessed Lending Club and Credit Approval datasets, and compared with deep forest and some mainstream machine learning algorithms. The DF21 algorithm is a deep forest open-source library introduced by the Zhou Zhihua team in February 2021, which optimized and encapsulated gcForest. In addition, for ease of expression, random forest is abbreviated as RF. The experiment was run on a Windows 10 operating system using PyCharm Professional version 2022.3, with an Intel Core i7-8700 processor, 6 cores, and 16GB of RAM.

The experiment compared the results of DRDF and common machine learning algorithms, evaluating the model’s performance using evaluation metrics such as accuracy, precision, recall, $F_{1}$ score, and AUC value. Accuracy can evaluate the overall classification performance of the classifier and is widely used to evaluate machine learning classification algorithms. However, since the dataset used in this experiment exhibits class imbalance, evaluating the model’s performance solely based on classification accuracy is insufficient. Therefore, the experiment used evaluation metrics such as accuracy, precision, recall, $F_{1}$ score, ROC curve, and AUC value to evaluate the performance of the algorithm in different aspects.These indicators may not directly reflect the economic impact of each model, but even slight improvements in these indicators beyond the decimal point are important as they can reduce significant economic losses. Since general indicators are more common, only ROC and AUC indicators will be briefly introduced here.

(1)
ROC Curve (receiver operating characteristic curve): a curve that plots the TPR (true positive rate) against the FPR (false positive rate) by continuously adjusting the classifier’s classification threshold, to evaluate the classification performance of a model. TPR is the recall rate, and FPR represents the proportion of negative samples that are misclassified. The closer the curve is to the upper left corner, the better the performance of the model.
(2)
AUC (Area Under Curve): usually refers to the area below the ROC curve in a two-dimensional coordinate system where TPR is on the y-axis and FPR is on the x-axis, used to evaluate the classification performance of a model. The AUC value ranges from 0 to 1, where a value closer to 1 indicates better model performance. It also represents the probability that a randomly chosen positive sample will be ranked higher than a randomly chosen negative sample.

In the experiment, the original dataset was split into a training set (70% of the data) for model training and a testing set (30% of the data) for evaluating the model’s classification performance. Since the original dataset was imbalanced with a large difference in the number of normal and default samples, the experiment chose default samples as positive samples. Due to the small sample size of the Credit Approval dataset and the strong randomness of the results obtained in a single run, the experiment took the average of 20 runs as the final result.

The experiment compared MLP, SVM, Logistic, RF, RotBoost, gcForest, DF21, and the proposed DRDF algorithm on the original dataset using five metrics: accuracy, precision, recall, $F_{1}$ score, and AUC. Since the data was not image data and had no temporal or spatial logic relationship, CNN deep neural networks were not applicable, and MLP was used as the comparison classifier. In addition, the experiment skipped the multi-granularity scanning layer of the deep forest and directly used the cascade layer for training. The configuration of MLP was adjusted as suggested by Zhou and Feng.

Table 1
The performance of each algorithm on the Credit Approval dataset (%)

Classifier Accuracy Precision Recall F1-Score AUC

MLP 75.29 75.48 75.31 75.10 82.43

SVM (linear) 82.61 84.53 83.71 82.58 90.46

SVM (rbf) 63.29 68.29 60.29 57.03 70.14

Logistic 84.54 85.61 85.39 84.54 91.58

RF 87.25 87.14 87.14 87.19 93.91

RotBoost 85.65 85.55 85.70 85.57 91.42

gcForest 87.50 87.50 87.63 87.49 92.69

DF21 87.27 87.15 87.40 87.21 94.04

DRDF 89.37 89.26 89.55 89.33 94.12

Table 2
The performance of each algorithm on the Lengding Club dataset (%)

Classifier Accuracy Precision Recall F1-score AUC

MLP 77.92 65.73 55.26 60.04 68.40

SVM (linear) 77.73 65.82 50.42 57.10 64.19

SVM (rbf) 77.94 67.69 51.79 58.68 64.77

Logistic 78.24 68.04 53.95 60.18 71.08

RF 78.12 67.27 53.70 59.72 70.19

RotBoost 77.91 65.69 55.52 60.18 71.16

gcForest 78.23 67.82 54.58 60.48 70.85

DF21 78.16 67.50 53.86 59.91 70.65

DRDF 78.60 69.14 55.60 61.64 73.20

Tables 1 and 2 show the test results of the five metrics for each algorithm on the two datasets. The best-performing algorithm in each dataset is highlighted in bold, and it can be seen that DRDF is competitive compared to other algorithms. In terms of accuracy, DRDF outperformed the original gcForest algorithm by 2.1% on the Credit Approval dataset and performed the best. Although DRDF’s accuracy on the Lending Club dataset was not significantly different from that of other algorithms, other metrics such as $F_{1}$ score, recall, and AUC would be more convincing in an imbalanced dataset.

Figure 7 shows the comparison of $F_{1}$ scores of each classifier, and it can be seen that DRDF’s $F_{1}$ score is higher than other classifiers. In Fig. 7a, DRDF had a 2% advantage over gcForest and DF21, and it was more advantageous compared to other common machine learning algorithms. In Fig. 7b, DRDF also had higher $F_{1}$ scores than other classifiers, with a 4.54% advantage over the lowest-scoring SVM and a 1.16% advantage over gcForest. Due to the imbalanced nature of the selected dataset, to better verify that the proposed DRDF algorithm is competitive compared to the original deep forest, Fig. 8 plots and compares gcForest, DF21, and DRDF from the three dimensions of precision, recall, and $F_{1}$ score, and it can be seen that DRDF had higher precision, recall, and $F_{1}$ score than the original deep forest algorithm and DF21.

Figure 7.
Comparison of $F_{1}$ scores of each classifier.

Figure 8.
Comparison of gcForest, DF21 and DRDF in three indicator dimensions.

In addition, according to business experience, the ROC curve is always the best choice for providing visual comparison, and decisions are always based on this metric, especially in the application field of imbalanced data. Therefore, improving this metric has a greater benefit for the model. The ROC curve shows the relationship between recall rate and FPR for each possible cutoff point. Generally, compared with curves with smaller coverage areas, curves with larger coverage areas are always considered better. Figure 9 shows the comparison of ROC curves and AUC values of various classifiers on the Lending Club dataset. In the ROC curve, the closer the curve is to the upper left corner, the better the performance. As shown in Figure 9, under the same FPR, the TPR of the DRDF classifier is higher than that of other classifiers. The corresponding AUC value shows that the DRDF method is 2.4% higher than the gcForest classifier and 9% higher than the lowest SVM, indicating that the DRDF algorithm has better performance than other classifiers on this dataset.

Figure 9.
Comparison of ROC curve and AUC value of each classifier on the Lending club dataset.

4.2 The experiments of DRDF-spark

Classifier	Accuracy	Precision	Recall	F1-Score	AUC
MLP	75.29	75.48	75.31	75.10	82.43
SVM (linear)	82.61	84.53	83.71	82.58	90.46
SVM (rbf)	63.29	68.29	60.29	57.03	70.14
Logistic	84.54	85.61	85.39	84.54	91.58
RF	87.25	87.14	87.14	87.19	93.91
RotBoost	85.65	85.55	85.70	85.57	91.42
gcForest	87.50	87.50	87.63	87.49	92.69
DF21	87.27	87.15	87.40	87.21	94.04
DRDF	89.37	89.26	89.55	89.33	94.12

Classifier	Accuracy	Precision	Recall	F1-score	AUC
MLP	77.92	65.73	55.26	60.04	68.40
SVM (linear)	77.73	65.82	50.42	57.10	64.19
SVM (rbf)	77.94	67.69	51.79	58.68	64.77
Logistic	78.24	68.04	53.95	60.18	71.08
RF	78.12	67.27	53.70	59.72	70.19
RotBoost	77.91	65.69	55.52	60.18	71.16
gcForest	78.23	67.82	54.58	60.48	70.85
DF21	78.16	67.50	53.86	59.91	70.65
DRDF	78.60	69.14	55.60	61.64	73.20

To verify the performance of DRDF-spark, experiments evaluated the feasibility of the algorithm using evaluation metrics such as speedup ratio and scalability ratio. In the cluster environment of this experiment, Hadoop Distributed File System (HDFS) was used for file storage, Spark was used as the distributed computing component, and Hadoop Resource Manager (Yarn) was responsible for resource scheduling and management. The cluster was built using six nodes, with one node serving as the master node responsible for metadata management, cluster resource allocation, and task scheduling, and all other nodes serving as worker nodes responsible for data storage and task computation. All nodes were configured with Intel Xeon Cascade Lake 8255C, four-core CPUs, and 32 GB of memory.

The distributed experiments continued to use the Lending Club dataset from the DRDF experiment. Since the Credit Approval dataset is small in scale and not suitable for big data experiment environments, it was not used as experimental data in this section. In addition, to better verify the performance of distributed algorithms in different data environments, the Creditcard Transaction v2 dataset and the Lending Club All dataset were added. The Creditcard Transaction v2 dataset has a sample size of over ten million and a file size of 6.6 GB. This dataset contains over 20 million transactions generated by a multi-agent virtual world simulation executed by IBM, covering 2,000 consumers who reside in the United States but travel worldwide. The data also covers decades of purchase records, including multiple cards for many consumers. Data analysis shows that it matches real data reasonably well in many aspects, such as fraud rate, purchase amount, merchant category codes, and other indicators. Additionally, Lending Club All contains all lending data from 2007 to 2018, with over one million preprocessed records, making it suitable for big data experiments. The scale of the three datasets is shown in Table 3, with sample sizes ranging from tens of thousands to tens of millions and feature numbers ranging from low to high dimensions. The experiments will evaluate the parallel performance on three datasets of different scales and the evaluation metrics are detailed as follows:

Table 3
Introduction of dataset size

Datasets	Number of samples	number of features
Lending club	160985	111
Lending club all	1269492	118
Creditcard transaction v2	24386900	18

(1) Speedup ratio

Speedup ratio is one of the performance metrics that can evaluate the overall improvement of parallel algorithms in horizontal scalability, which is widely used to evaluate distributed algorithms in big data environments. Specifically, the speedup ratio refers to the ratio of the time consumed by running the same task in a parallel processing environment to that in a serial processing environment. It is usually represented using the calculation Eq. (7).

$\displaystyle S_{p}=\frac{T_{s}}{T_{p}}$ (7)

The variable $T_{p}$ records the time taken by Spark to perform parallel computation in a cluster environment with p nodes. $S_{{p}}$ represents the speedup ratio achieved in a corresponding distributed environment with $p$ nodes, indicating the degree of performance improvement. A larger value of $S_{{p}}$ implies a greater enhancement of job performance with an increasing cluster size. However, as explained in Section 4.2 of the distributed algorithm design, the DRDF algorithm cannot be fully parallelized, and certain parts can only be processed in a serial manner. Here, the executable parallelization ratio r is introduced, with a serialization ratio of ( $1-r$ ). The total formula for running time with distributed processing is shown in Eq. (8).

$\displaystyle T_{p}=(1-r)\times T_{{s}}+\frac{{r}\times T_{{s}}}{P}$ (8)

(2) Scalability

This metric, also known as cluster efficiency, is obtained by dividing the speedup ratio by the current number of nodes and belongs to one of the scalability metrics of Spark. The calculation formula is shown in Eq. (9).

$\displaystyle\text{Efficiency}_{p}=\frac{S_{p}}{p}$ (9)

Where $S_{{p}}$ is the speedup ratio with p nodes, and $\text{Efficiency}_{p}$ represents the scalability with $p$ nodes. If the speedup ratio increases linearly with the number of nodes, the value of $\text{Efficiency}_{p}$ will remain stable. However, if the speedup ratio does not increase linearly, the value of $\text{Efficiency}_{p}$ will decrease as the number of nodes increases. Therefore, better scalability is indicated by a slower decrease in the Efficiency value.

4.2.1 The experiment of boostrap sample rate

During the construction of the original Rotation Forest, the bootstrap sampling technique was used to randomly select 75% of the samples from the training set in order to avoid obtaining the same principal components repeatedly and to maximize sample diversity. However, in distributed deep Rotation Forest, there are multiple layers of parallel computing logic, and a 75% sampling ratio will lead to significant memory usage problems. In fact, this sampling ratio can be lower because distributed deep Rotation Forest is applied in scenarios with massive amounts of data, and even lower sampling ratios can contain sufficient information. However, a lower sampling ratio will lead to reduced memory usage and an increased training speed for the cascade layers.

As shown in Fig. 10, experiments were conducted on the Lending Club dataset using different bootstrap ratios to evaluate whether the size of the bootstrap ratio affects the accuracy, with 5 parameter samples selected for the sampling ratio ranging from 10% to 75%. The box plot in Fig. 10 indicates that the size of the bootstrap ratio has no significant effect on the accuracy of the rotation forest, except for a slight decrease in accuracy when the sampling ratio drops to 10%, in the big data environment. Therefore, for big data environments, faster training time and lower memory usage can be achieved by using a smaller sampling ratio, such as 25%.

Figure 10.

Experimental results graph showing the impact of bootstrap sample rate on accuracy.

4.2.2 The experiment of speedup ratio

As the absolute speedup requires recording the running time of the algorithm in serial processing, i.e., the running time of the single-machine version of the algorithm, a series of performance benchmarks were first conducted on the DRDF single-machine version algorithm. The running time and classification accuracy of DRDF are shown in Table 4. For datasets with small amounts of data, the running time is measured in seconds, while for datasets with millions or tens of millions of records, the running time is measured in minutes due to the long running time of the algorithm. However, during the benchmark experiment, the newly added Creditcard Transaction v2 dataset had too many samples, and the original file size occupied 6.6 GB of space, resulting in insufficient memory during the execution on a single machine with 16 G of RAM. Therefore, the running time and classification accuracy of the DRDF algorithm on this dataset could not be obtained, and it is indicated by a dash in Table 4. In the subsequent calculation of the speedup ratio, special treatment will be applied to this dataset, and the running time on the minimum runnable node will be used to replace $T_{s}$ .

Table 4
DRDF algorithm stand-alone version performance

Datasets	Spend	Time unit	Accuracy
Lending club	3013	Seconds	78.60%
Lending club all	338	Minutes	80.65%
Creditcard transaction v2	Out of memory	–	–

Based on the benchmark experiment results, Table 5 shows the performance indicators of DRDF, including runtime and speedup ratio, on three different scaled datasets tested in a cluster environment consisting of six nodes. It can be observed from the table that, in the limited cluster configuration environment, the performance of DRDF-spark is significantly better than DRDF. Under the condition of ensuring the accuracy indicator remains unchanged, the training speed is increased up to 3.53 times faster, and if the number of nodes continues to increase, the speedup ratio will continue to improve.

Regarding the results of the Creditcard Transaction v2 dataset in Table 5, a special explanation is required. Because this dataset encountered an out-of-memory issue during the single-machine benchmark experiment and cannot obtain the $T_{s}$ data, which is the time consumed by the algorithm when running on the single machine version under limited resource configuration, in the calculation formula of the speedup ratio. Similarly, the distributed deep rotated forest also encountered an out-of-memory problem when running this dataset on a single node, and the problem was only resolved when running on a cluster of two nodes. Therefore, to solve the problem of being unable to obtain the $T_{s}$ data, this paper chooses to use $T_{2}$ , which is the runtime of DRDF-spark in a two-node cluster environment, to replace $T_{s}$ . Although this approach will reduce the speedup ratio, it is a relatively good solution in a limited configuration environment.

It can be clearly seen from Table 5 that as the number of nodes increases, the running time decreases continuously, and the corresponding speedup ratio also increases. Figure 11 shows the change of the speedup ratio more intuitively. In the Lending Club data set, before the fourth node is added, the speedup ratio increases significantly. After the fourth node is added, the speedup ratio increases relatively slowly, indicating that the benefits obtained by adding cluster nodes at this time are increasing. keeps decreasing. This is also very reasonable, because the amount of data in the Lending Club dataset is not huge, and the degree of parallelism is close to the upper limit at this time. On the contrary, increasing the number of nodes will also increase the cost of network communication, thereby increasing the running time of the program. In addition, because the Lending Club All data set has slightly increased the data size, the overall performance of the speedup ratio index is better than that of the Lending Club data set, and there is still a slight upward trend after the sixth node, but not obvious. Subsequently, this problem was also verified on the Creditcard Transaction v2 dataset with a huge amount of data. Since the Creditcard Transaction v2 dataset is huge, it can be seen that when the number of nodes increases to the sixth, the speedup ratio still increases significantly. If you continue to increase the number of nodes, you can still get higher returns.

Table 5

The performance of DRDF-spark on each dataset

Datasets #node	Spend/speedup ratio						Unit	Accuracy (%)
	1	2	3	4	5	6
Lending club all	272	160	131	117	110	105	Minute	80.68
	1.24	2.11	2.58	2.89	3.09	3.21	Ratio
Lending club	2810	1853	1493	1314	1233	1201	Second	78.55
	1.07	1.63	2.02	2.29	2.44	2.51	Ratio
Creditcard v2	–	1018	488	377	321	289	Minute	99.92
	–	–	2.12	2.70	3.17	3.53	Ratio

Table 6

The Performance of DRDF-spark on each dataset

Datasets #node	Spend/speedup ratio
	1	2	3	4	5	6
Lending club all	1.24	1.06	0.86	0.72	0.62	0.53
Lending club	1.07	0.81	0.67	0.57	0.49	0.42
Creditcard v2	–	–	0.70	0.68	0.63	0.58

Figure 11.

The schematic diagram of DRDF-spark speedup ratio.

Figure 12.

The schematic diagram of DRDF-spark scalability ratio.

4.2.3 The experiment of scalability

To evaluate the performance of the DRDF-Spark algorithm from different perspectives, the experiment continues to assess the scalability of DRDF-Spark, and the detailed data and visualization of the scalability ratio for the three datasets are shown in Table 6 and Fig. 12, respectively. As the Creditcard Transaction v2 dataset has no acceleration ratio data at node numbers 1 and 2, there are no scalability index data, represented by a horizontal line in the table. The algorithm’s scalability ratio of DRDF-Spark decreases with an increase in the cluster’s computational nodes, as shown in Fig. 12, and with a greater number of computation nodes, the scalability tends to be more moderate. According to the distributed index, when the scalability is better, it results in a slower decline trend of the Efficiency value. It can be observed from Fig. 12 that the scalability ratio of the Creditcard Transaction v2 dataset with a massive amount of data decreases very slowly as the number of nodes increases, while the Lending Club dataset has the fastest declining trend, and the Lending Club All dataset is next. This suggests that when the DRDF-Spark algorithm deals with larger data sizes, its scalability improves to a certain extent.

5. Conclusion

Credit fraud detection is an important means to prevent credit risks and is also one of the most important applications in the financial field. In recent years, with the continuous development of financial technology and internet technology, domestic and foreign researchers have conducted in-depth research on credit fraud detection models through large amounts of data and machine learning techniques, which have been widely applied in banks, internet finance platforms, online payment companies, etc., and achieved good results.

This paper mainly studies the application of deep forest in the field of fraud detection. It proposes a dense deep rotation forest algorithm (DRDF) based on the improved RotBoost, which utilizes the advantages of deep forest and the characteristics of application scenarios, aiming to improve the accuracy and stability of classification and reduce the risk of financial platforms.

Firstly, gcForest has a similar problem to deep neural network models when dealing with data without spatial or temporal correlations, which is the inability to use multi-scale scanning layers to improve the diversity of samples through sliding window mechanisms. Therefore, this paper introduces the improved RotBoost algorithm into the cascade layer, which not only constructs diversified classifiers but also endows the model with the idea of weight, thereby enhancing the representation learning ability of the model. Experimental results demonstrate that the proposed DRDF has better predictive classification performance than other deep forest models and some mainstream ensemble algorithms. However, in the era of big data, single-machine algorithms will face problems such as limited computing and storage capabilities and poor scalability. Based on the compatibility of the construction process of the rotation forest with distributed computing, a distributed parallel DRDF algorithm based on Spark is proposed to adapt to the credit fraud detection scenario under the background of massive data. Experimental results verify that the training speed of DRDF-Spark is up to 3.53 times faster, and the acceleration ratio will continue to increase if the number of nodes is further increased.

However, in the parallel process of the DRDF-spark algorithm, there are three parallel operations that require a large amount of memory space. The maximum dataset used in the experiment is 6.6 GB, and if the data volume continues to increase to the TB level, it may face memory computing bottlenecks. Even with the addition of pre-aggregation mechanisms, it seems that such problems cannot be solved. Therefore, memory optimization will be a research direction in the future. In addition, in the research on the setting of parallelism, although the method of multiple forests sharing a rotation matrix balances the model between parallelism and communication overhead in the process of implementing distributed deep rotation forests, the segmentation granularity, that is, the setting of parallelism, still needs to be manually set. Considering that grid search takes a long time, it is hoped that a linear adaptive algorithm can be found to find the best segmentation granularity, which requires further research.

References

Jia

Liu

Gan

Vong

C.-M.

Pecht

, A deep forest-based fault diagnosis scheme for electronics-rich analog circuit systems, IEEE Transactions on Industrial Electronics 68(10) (2021), 10087–10096. doi: 10.1109/TIE.2020.3020252.

Zhou

Z.-H.

Feng

, Deep Forest: Towards An Alternative to Deep Neural Networks, in: Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17, 2017, pp. 3553–3559. doi: 10.24963/ijcai.2017/497.

Huang

Wang

Zhang

, Improved deep forest mode for detection of fraudulent online transaction, Comput. Informatics 39(5) (2020), 1082–1098. doi: 10.31577/cai_2020_5_1082.

Srivastava

Singh

A.K.

, Fraud detection in the distributed graph database, Clust. Comput 26(1) (2023), 515–537. doi: 10.1007/s10586-022-03540-3.

Brown

Mues

, An experimental comparison of classification algorithms for imbalanced credit scoring data sets, Expert Systems with Applications 39(3) (2012), 3446–3453. doi: 10.1016/j.eswa.2011.09.033. https://www-sciencedirect-com-443.web.bisu.edu.cn/science/article/pii/S095741741101342X.

Roy

Sun

Mahoney

Alonzi

Adams

Beling

, Deep learning detecting fraud in credit card transactions, in: 2018 Systems and Information Engineering Design Symposium (SIEDS), 2018, pp. 129–134. doi: 10.1109/SIEDS.2018.8374722.

Papouskova

Hajek

, Two-stage consumer credit risk modelling using heterogeneous ensemble learning, Decision Support Systems 118 (2019), 33–45. doi: 10.1016/j.dss.2019.01.002. https://www-sciencedirect-com-443.web.bisu.edu.cn/science/article/pii/S0167923619300028.

Feng

Xiao

Zhong

Dong

Qiu

, Dynamic weighted ensemble classification for credit scoring using Markov Chain, Appl. Intell 49(2) (2019), 555–568. doi: 10.1007/s10489-018-1253-8.

Zhang

Y.-L.

Zhou

Zheng

Feng

Liu

Zhang

Chen

Y.A.

Zhou

Z.-H.

, Distributed deep forest and its application to automatic detection of cash-out fraud, ACM Trans. Intell. Syst. Technol 10(5) (2019). doi: 10.1145/3342241.

10.

Huang

Wang

Zhang

, Improved deep forest mode for detection of fraudulent online transaction, Comput. Informatics 39(5) (2020), 1082–1098. doi: 10.31577/cai_2020_5_1082.

11.

Guo

Liu

Shang

, BCDForest: A boosting cascade deep forest model towards the classification of cancer subtypes based on gene expression data, BMC Bioinform 19-S(5) (2018), 118:1–118:13. doi: 10.1186/s12859-018-2095-4.

12.

Gao

Liu

Wang

Hong

, An improved deep forest for alleviating the data imbalance problem, Soft Comput 25(3) (2021), 2085–2101. doi: 10.1007/s00500-020-05279-8.

13.

Yang

Jiang

Zhou

, Multi-Label Learning with Deep Forest, in: ECAI 2020 – 24th European Conference on Artificial Intelligence, 29 August–8 September 2020, Santiago de Compostela, Spain, August 29–September 8, 2020 – Including 10th Conference on Prestigious Applications of Artificial Intelligence (PAIS 2020) Giacomo

G.D.

Catalá

Dilkina

Milano

Barro

Bugarín

Lang

, eds, Frontiers in Artificial Intelligence and Applications, Vol. 325, IOS Press, 2020, pp. 1634–1641. doi: 10.3233/FAIA200274.

14.

Wang

Dai

Xiong

Wei

D.-Q.

, MLCDForest: Multi-label classification with deep forest in disease prediction for long non-coding RNAs, Briefings in Bioinformatics 22(3) (2020), bbaa104. doi: 10.1093/bib/bbaa104.

15.

Wang

Yang

, Learning from Weak-Label Data: A Deep Forest Expedition, in: The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7–12, 2020, AAAI Press, 2020, pp. 6251–6258. https://ojs.aaai.org/index.php/AAAI/article/view/6092.

16.

Mondrian Deep Forest, Vol. 57, 2020, 1594. ISSN 1000-1239. doi: 10.7544/issn1000-1239.2020.20200490.

17.

Rodriguez

J.J.

Kuncheva

L.I.

Alonso

C.J.

, Rotation forest: A new classifier ensemble method, IEEE Transactions on Pattern Analysis and Machine Intelligence 28(10) (2006), 1619–1630. doi: 10.1109/TPAMI.2006.211.

18.

Zhang

C.-X.

Zhang

J.-S.

, RotBoost: A technique for combining Rotation Forest and AdaBoost, Pattern Recognition Letters 29(10) (2008), 1524–1536. doi: 10.1016/j.patrec.2008.03.006. https://www-sciencedirect-com-443.web.bisu.edu.cn/science/article/pii/S0167865508001098.

19.

Azeroual

Nikiforova

, Apache spark and mllib-based intrusion detection system or how the big data technologies can secure the data, Information 13(2) (2022). doi: 10.3390/info13020058. https://www.mdpi.com/2078-2489/13/2/58.

20.

Patinyasakdikul

Eberius

Bosilca

Hjelm

, Give MPI Threading a Fair Chance: A Study of Multithreaded MPI Designs, in: 2019 IEEE International Conference on Cluster Computing (CLUSTER), 2019, pp. 1–11. doi: 10.1109/CLUSTER.2019.8891015.

21.

Chen

Wang

Cai

Mondal

S.K.

Sahoo

J.P.

, BLB-gcForest: A high-performance distributed deep forest with adaptive sub-forest splitting, IEEE Transactions on Parallel and Distributed Systems 33(11) (2022), 3141–3152. doi: 10.1109/TPDS.2021.3133544.

22.

Zhu

Yuan

Huang

, ForestLayer: Efficient training of deep forests on distributed task-parallel platforms, Journal of Parallel and Distributed Computing 132 (2019), 113–126. doi: 10.1016/j.jpdc.2019.05.001. https://www-sciencedirect-com-443.web.bisu.edu.cn/science/article/pii/S0743731518305392.

Application research of credit fraud detection based on distributed rotation deep forest

Abstract

Keywords

1. Introduction

2.1 Credit fraud detection

2.2 Deep forest

3.1 Single layer structure of cascade layer

4.1 The experiments of DRDF

Table 3 Introduction of dataset size

(1) Speedup ratio

(2) Scalability

Table 4 DRDF algorithm stand-alone version performance

5. Conclusion

References

Table 3
Introduction of dataset size

Table 4
DRDF algorithm stand-alone version performance