Comparative study on credit card fraud detection based on different support vector machines

Abstract

Credit card fraud is the new financial fraud crime accompanied by the gradual development of the economy which causes billions of dollars of losses every year. Credit card fraud case not only seriously violated the cardholder benefits and financial institutions, but also undermined the credit management order. However, fraudsters keep exploring new crime strategies constantly which exacerbates the crime rate of fraud. Thus, a predictive model for credit card fraud detection is essential to minimize its losses. By distinguishing between fraud and non-fraud, machine learning is one of the most efficient solutions for detecting fraud. Support vector machines have proven to be a novel algorithm with excellent performance. Nevertheless, the performance of SVM depends largely on the correct choice of model parameters (C and g), which could cause that the false positive was very high if the kernel function type and parameter cannot be selected properly. In this paper, based on the real transaction data of the credit card business, firstly, it will find the optimal kernel function suitable for the data set. Secondly, this paper will propose the method of optimizing the support vector machine parameters by the cuckoo search algorithm, genetic algorithm and particle swarm optimization algorithm. Last but not least, the Linear kernel function was found to be the best kernel function with an accuracy rate of 91.56%. Furthermore, the Radial basis function is used to optimize the kernel function, which can improve the accuracy from 42.86% to the highest accuracy rate of 98.05%. Compared with CS-SVM and GA-SVM, PSO-SVM has the best overall performance.

Keywords

Credit card fraud fraud detection support vector machine kernel function

1. Introduction

The e-commerce has grown by leaps and bounds into a remarkably successful communications medium a visible impact on the daily lives of people in many areas such as business, working, and amusement. However, at the same time, all kinds of credit card fraud cases are increasing which brings bad faith to cardholders and makes a great loss to bank [7]. According to Robertson [6], global credit card fraud losses increased from $7.6 billion in 2010 to $21.81 billion in 2015. By 2020, global credit card fraud losses are expected to reach $31.67 billion. In addition, credit card fraud is related to organized crime, terrorist activities, and drug trafficking which poses a threat to the society [4].

With the development of fraud detection technology, fraudsters are improving the concealment of fraud and avoiding being discovered constantly. Therefore, credit card fraud detection methods need to be continuously innovated to improve the accuracy of fraud detection [18]. Credit card fraud detection methods are divided into two categories: supervised and unsupervised. In the supervised fraud detection method, models are estimated based on samples of fraud and legitimate transactions, and new transactions are classified as fraudulent or legal. In unsupervised fraud detection, outliers or unusual transactions are identified as potential fraudulent transaction cases. Both methods of fraud detection can predict the likelihood of fraud in any given transaction [21, 24]. In practice, credit card fraud identification model is actively used [22]. Most credit card fraud behavior recognition models focus on case-based reasoning [20], neural network [31], random forest [23], logistic regression [12], as well as support vector machine [26]. The focus of these studies is on the use of machine learning models to identify data that aggregates the performance of fraud predictions in every transaction. Diverse machine learning methods have great differences in recognition effects for different data sets.

Support Vector Machine (SVM) is a supervised machine learning algorithm for data classification problems. It is widely used in many fields, such as image recognition [25], credit evaluation [5] as well as public safety [16]. Compared to other classifiers, SVM can solve linear and nonlinear binary classification problems, which finds a hyperplane that distinguishes the input data in the support vector. The kernel function of the support vector machine and its parameters have a particularly important influence on the support vector space and classification effect in the calculation process. For different types of data sets, diverse kernel functions have different performance. We select four common kernel functions for comparative study. Almost all artificial intelligence algorithms inevitably require careful tuning of parameters. Optimal parameters have become an important factor hindering the improvement of algorithm performance. In order to improve processing efficiency, in recent years, some experts have proposed methods for automatically optimizing parameters, such as machine learning. In the SVM, the parameters of the kernel function directly affect the classification effect. We have selected a variety of meta heuristic algorithms to adjust the SVM parameters. Among them, three kinds of Cuckoo Search (CS) algorithms, Genetic Algorithms (GA) and Particle Swarm Optimization (PSO) algorithms with better performance are compared to select the most. An optimization solution for this data set. In this paper, we provide two contributions: first is to solve the problem of credit card fraud identification in high-dimensional, multi-noise data. Second is to find the optimal support vector machine to improve the accuracy of identifying credit card fraud through different optimization schemes.

The rest of the paper is organized as follows. The second section summarizes the related work based the research on the SVM. The third section describes four kinds of data mining technology used in this study. The fourth section discusses three improved support vector machine algorithms and their basic implementation process. The fifth section explores the experimental setup and performance measures which used in this comparative study. The last section demonstrates the results and relative issues of further research.

2. Related work

As discussed in the “Introduction” section, many SVM speed optimization techniques have been proposed, most of which solve optimization problems through different methods, including: instance selection, parameter optimization and feature selection [1]. In this paper, we choose parameter optimization. Specifically, we can consider the process of parameter adjustment as the maximization of the black box function, the parameters of the model as the independent variables of the function, and the generalization ability of the model as the dependent variable of the function. The maximum value of the function is obtained by an optimization method, thereby obtaining a set of optimal parameters [8]. We choose natural heuristic algorithms to optimize the parameters. Some existing natural heuristic technologies are mainly focused on: evolutionary algorithms (EA) [31], ant colony optimization (ACO) [17] and artificial immune system (AIS) [15]. We choose particle swarm optimization algorithms, cuckoo search algorithms, and genetic algorithms to optimize support vector machines. We summarize the advantages of these methods in processing credit card fraud data and discuss their limitations, and try to find the best algorithm to improve the classification performance of support vector machines.

3. Materials and methods

3.1 Support vector machine

SVM is an effective machine learning tool for pattern classification and regression that minimizes both prediction error and model complexity [23]. The SVM is based on formalized classification boundaries that are separated by points with different labels, thereby maximizing the boundaries of the closest data points. The classification boundaries defined by the hyperplane will result in different support vectors.

The support vector machine was originally proposed to study the linear separability problem, assuming a training set of size $({x_{i},y_{i}}),i=1,2,3,\cdots,l,x\in R^{n},y\in\{{+1,-1}\}$ , 1 is the number of samples, and $n$ is the input dimension. When linearly separable, the optimal classification hyperplane is:

$\displaystyle{\omega x}+{b}=0$ (1)

At this time, the classification interval is $\frac{2}{||\omega||}$ , and it is obvious that when $||\omega||$ takes the minimum value, the classification interval is the largest. Classification problems can be described as solving the following constrained optimization problems:

$\displaystyle\left\{{{\begin{array}[]{l}{\min\frac{||\omega||^{2}}{2}}\\ {s.t.\ y_{i}({{\omega}x_{i}+{b}})-1\geqslant 0,i=1,2,3,\cdots,l}\\ \end{array}}}\right.$ (2)

It is worth mentioning that if the majority of samples in the data set are linearly separable, only a few samples (possibly abnormal points) lead to the failure to find the optimal classification hyperplane. For such cases, the usual practice is introduced non-negative slack variables $\xi_{i},i=1,2,3,\cdots,l$ , and correct the optimization objectives and constraints, namely:

$\displaystyle\left\{{{\begin{array}[]{l}{\min\frac{||\omega||^{2}}{2}+C\sum_{i% =1}^{l}\xi_{i}}\\ {s.t.\left\{{{\begin{array}[]{l}{y_{i}({{\omega}x_{i}+{b}})\geqslant 1-\xi_{i}% }\\ {\xi_{i}\geqslant 0}\\ \end{array}},i=1,2,3,\cdots,l}\right.}\\ \end{array}}}\right.$ (3)

In Eq. (3), $C$ is a penalty factor, which plays an important role in controlling the degree of penalty of the wrong sample, thus achieving a compromise between the proportion of the wrong sample and the complexity of the algorithm. The larger $C$ , the greater the possibility for misclassification. By solving the above optimization problem by Lagrange multiplier method, the optimal function can be obtained as follows:

$\displaystyle{f}({x})=\text{sgn}\left[{\sum_{i=1}^{l}y_{i}\alpha_{i}({x\cdot x% _{i}})+b}\right]$ (4)

In the Eq. (4), $\alpha$ is a LaGrange coefficient. When testing the input test sample $x$ , the category of $x$ is determined by Eq. (4). According to the ${K}-{T}$ condition, the solution to the above optimization problem must satisfy:

$\displaystyle\alpha_{i}({y_{i}({{\omega}x+{b}})-1})=0$ (5)

Therefore, for most samples $\alpha_{i}$ will take a value of zero, only the support vector machine $\alpha_{i}$ is not zero, they usually occupy a small proportion of the total sample. In this way, only a small number of support vectors are needed to complete the correct sample classification.

In the case of nonlinear classification problems, the support vector machine maps the samples to a high-dimensional space H by the kernel function ${K}(x_{i}\cdot x_{j})$ , and then classifies the original problem in H. Table 1 is the four commonly used kernel functions. The process and method of finding the optimal classification hyperplane in the high-dimensional feature space is similar to the linear separable SVM case, except that the dot product in the high-dimensional feature space is replaced by the kernel function, thereby greatly reducing the computational complexity. According to the Mercer condition, the corresponding optimal decision function becomes:

$\displaystyle{f}({x})=\text{sgn}\left[{\sum_{i=1}^{l}y_{i}\alpha_{i}K({x\cdot x% _{i}})+b}\right]$ (6)

Table 1

Four common kernel functions

	Kernel function	Expression
1	Linear kernel function	$K(x,x_{i})={x}x_{i}$
2	Polynomial kernel function	$K(x,x_{i})=({{x}x_{i}+1})^{d}$
3	Radial basis function	$K(x,x_{i})=\text{exp}\left({-\frac{\|\|x-x_{i}\|\|^{2}}{2\sigma^{2}}}\right)$
4	Sigmoid kernel function	$K(x,x_{i})=\text{tanh}({k({{x}x_{i}})+\theta})$

3.2 Optimization algorithms

3.2.1 Cuckoo Search algorithm (CS)

The Cuckoo Search algorithm is a new optimization algorithm proposed by scholars Yang and Deb from University of Cambridge in 2009 [28]. The natural process of the cuckoo nesting parasitization is simulated, the parameters of the problem to be solved are compiled into a nest, and multiple nests constitute a population. Individuals in the population update the population by selecting the bird’s nest by Levy flight and discarding the bird’s nest with a certain probability. After several iterations, until the optimal solution is obtained. To simplify the description of the new CS, we now use the following four idealization rules [27]:

Each cuckoo bird lays an egg standing for a design solution at a time, and dumps its egg in the nest randomly chosen from hosts.

The best nests with high quality eggs (better solution) will be passed to the next generation.

The number of available host nests is limited to n, and a host bird can recognize the egg of cuckoo bird with a probability $p_{a}\in[{0,1}]$ .

In this case, it can either throw the egg away or abandon the nest in order to build a completely new nest in a new location.

The path and location update formula for the cuckoo nest is as follows:

$\displaystyle x_{i}^{({t+1})}=x_{i}^{(t)}+\beta\oplus L(\lambda),i=1,2,3,% \cdots,n$ (7)

In Eq. (7), $x_{i}^{(t)}$ represents the position of the i-th bird’s nest in the t-th generation; $\oplus$ represents site-to-site multiplication; $\beta$ represents the step control, which is used to control the step size, and its value obeys the normal state. Distribution; $L(\lambda)$ represents the Levi flight random search path, and $L\sim\zeta=t^{-\lambda}(1<\lambda\leqslant 3)$ , where $\zeta$ represents the random step size obtained by Levi flight. After the position update, the random number $\gamma\in[{0,1}]$ is compared with $p_{a}$ . If $\gamma>p_{a}$ , then $x_{i}^{({t+1})}$ is changed, and vice versa. Finally, the set of bird nest positions $y_{i}^{({t+1})}$ with excellent test values is retained. At this time, $y_{i}^{({t+1})}$ is still recorded as $x_{i}^{({t+1})}$ , and refine the Eq. (7) to get the Eq. (8):

$\displaystyle x_{i}^{t+1}=x_{i}^{t}+\textit{stepsize}\times({\delta-\textit{% best}}),i=1,2,3,\cdots,n$ (8)

In Eq. (8), stepsize represents the step size produced by Levy flight, $\delta$ represents the position of a certain nest, and best represents the best position in the current nest [30].

3.2.2 Particle Swarm Optimization Algorithm (PSO)

In 1995, Kennedy and Eberhart proposed the PSO method [13], a bionic algorithm that simulates birds looking for food. This is a global search algorithm for searching for optimal values of historical and social information particles that exist between each other. The algorithm has many advantages, such as fast convergence, simple concept, and so on. PSO techniques have been used to solve the problem of model selection problem support vector machines [14], where each set of SVM parameters is trained as particles.

In the PSO algorithm, each particle represents a possible solution and has an adaptation value determined by the optimized function, and all particles form a population (Swarm). The particles in the space jointly determine the flight direction and speed according to their flight experience and population experience, in order to find the optimal solution. Assuming that the solution is solved in the D-dimensional search space, the population consists of $n$ particles.

$\displaystyle\textit{Swarm}=\left\{{x_{1}^{(k)},x_{2}^{(k)},\cdots,x_{n}^{(k)}% }\right\}$ (9)

The velocity vector and position vector of the i-th particle in space at time $K$ are:

$\displaystyle v_{i}^{(k)}=\left\{{v_{i1}^{(k)},v_{i2}^{(k)},\cdots,v_{id}^{(k)% }}\right\},i=1,2,3,\cdots,n$ (10) $\displaystyle x_{i}^{(k)}=\left\{{x_{i1}^{(k)},x_{i2}^{(k)},\cdots,x_{id}^{(k)% }}\right\},i=1,2,3,\cdots,n$ (11)

Equations (10) and (11) respectively describe the motion of the particle in each dimension space. In each evolution, each particle updates itself by tracking two local optimal solutions, one being the optimal solution (pBest) found by the particle itself. $P_{i}=\{{P_{i1},P_{i2},\cdots,P_{id}}\}$ . The other is the optimal solution (gBest) found by the entire group. The particles update their position and speed according to the following formula:

$\displaystyle v_{ij}^{({k+1})}=\omega v_{ij}^{(k)}+c_{1}r_{1}\left[{P_{ij}-x_{% ij}^{(k)}}\right]+c_{2}r_{2}\left[{P_{gj}-x_{ij}^{(k)}}\right],$ (12) $\displaystyle x_{ij}^{({k+1})}=x_{ij}^{(k)}+v_{ij}^{({k+1})},j=1,2,\cdots,d$

In the Eq. (3.2.2), $\omega$ is an inertia weight, $c_{1}$ and $c_{2}$ are acceleration factor of PSO, and $r_{1}$ and $r_{2}$ are random numbers between (0, 1).

3.2.3 Genetic Algorithm (GA)

The Genetic Algorithm is a computational model that simulates the natural evolution of Darwin’s biological evolution and the evolutionary process of genetic mechanisms. It uses evolutionary populations of selection, crossover, and mutation mechanisms [11, 29]. It is one of the powerful tools for getting the optimal kernel function parameters.

Population initialization

Since the genetic algorithm cannot directly deal with the parameters of the problem space, it is necessary to express the feasible solution of the required problem into a chromosome or an individual of the genetic space by coding.

Fitness function

The fitness function is a criterion for distinguishing the quality of individuals in a group. It is the only basis for natural selection and is transformed by the objective function.

Choose the operation

The selection operation selects a good individual from the old group with a certain probability to form a new population to breed to obtain the next generation of individuals. The probability that an individual is selected is related to the fitness value. The higher the individual fitness value, the greater the probability of being selected. The genetic algorithm selection operation has many methods such as roulette method and tournament method. In this case, the roulette method is selected, that is, the selection strategy based on the fitness ratio, and the probability that the individual $i$ is selected is:

$\displaystyle p_{i}=\frac{F_{i}}{\sum_{j=1}^{N}F_{j}}$ (13)

Where $F_{i}$ is the fitness value of individual $i$ and $N$ is the number of individual populations.

Crossover operation

Crossover operation refers to randomly selecting two individuals from the population, and inheriting the excellent characteristics of the parent string to the sub-string through the exchange combination of the two chromosomes, thereby generating a new excellent individual. Since the individual uses real number coding, the crossover operation uses the real number intersection method, and the crossover operation method of the kth chromosome $a_{k}$ and the 1st chromosome $a_{l}$ in the $j$ position is:

$\displaystyle a_{kj}=a_{ij}({1-b})+a_{lj}b$ (14) $\displaystyle a_{lj}=a_{lj}({1-b})+a_{kj}b$ (15)

Where $b$ is a random number in the interval [0, 1].

Mutation operation

The main purpose of the mutation operation is to maintain population diversity. The mutation operation randomly selects an individual from the population and selects one point in the individual to mutate to produce a better individual. The operation method of mutating the j-th gene $a_{ij}$ of the i-th individual is

$\displaystyle a_{ij}=\left\{{{\begin{array}[]{l}{a_{ij}+({a_{ij}-a_{\max}})% \ast f(g),r\geqslant 0.5}\\ {a_{ij}+({a_{\min}-a_{ij}})\ast f(g),r<0.5}\\ \end{array}}}\right.$ (16)

Where $a_{\max}$ is the upper bound of the gene $a_{ij}$ and $a_{\min}$ is the lower bound of the gene $a_{ij}$ . ${f}({g})=r_{2}\left({1-\frac{g}{G_{\max}}}\right)^{2}$ , $r_{2}$ is a random number, $g$ is the current number of iterations, $G_{\max}$ is the maximum number of evolutions, and $r$ is a random number in the interval [0, 1].

The crossover probability $p_{c}$ of GA and the probability of variation $p_{m}$ have a great influence on its performance. The larger $p_{c}$ , the faster the new individual is produced. However, the excessive $p_{c}$ assembly makes the individual structure with high adaptability quickly destroyed; the small $p_{c}$ will make the search process slow and stagnant. If the mutation probability $p_{m}$ is too small, it is not easy to generate a new individual structure; when $p_{m}$ is too large, GA becomes a pure random search algorithm. Therefore, an adaptive method must be used to make the crossover probability and the mutation probability change with the change of fitness.

$\displaystyle p_{c}=\left\{{{\begin{array}[]{l}{p_{c1}-\frac{p_{c2}({f_{c}-f_{% \text{avg}}})}{f_{\max}-f_{\text{avg}}},f_{c}\geqslant f_{\text{avg}}}\\ {p_{c1},f_{c}<f_{\text{avg}}}\\ \end{array}}}\right.$ (17) $\displaystyle p_{m}=\left\{{{\begin{array}[]{l}{p_{m1}-\frac{p_{m2}({f_{m}-f_{% \text{avg}}})}{f_{\max}-f_{\text{avg}}},f_{m}\geqslant f_{\text{avg}}}\\ {p_{m1},f_{m}<f_{\text{avg}})}\\ \end{array}}}\right.$ (18)

Algorithm 1: CS-SVM
Input: Dataset X, Step size $\textit{Step}_{\min}$ and $\textit{Step}_{\max}$ the number of iterations N.
Output: Credit card fraud results
Step 1: Set the initial probability parameter $p_{a}$ to 0.25, calculate the fitness of each set of nest positions corresponding to the training set. Find the best bird nest at present and obtain the position $x_{i}^{(t)}$ and the best fitness $F_{\max}$ .
Step 2: Retains the position $x_{i}^{(t)}$ of the optimal nest of the previous generation, and calculates the Levy flight step according to Eqs (7) and (8), and uses Levy flight to update the position of other nests to obtain a new set. Nest position and calculate their fitness $F$ .
Step 3: According to the fitness $F$ , the position of the new bird’s nest is compared with the position of the previous generation bird’s nest $p_{i-1}$ , and the position of the bird’s nest is replaced by a better bird’s nest position to obtain a relatively new nest position.
Step 4: Compares the random number $\gamma$ with $p_{a}$ , preserves the nests with less probability of discovery in $p_{t}$ , and updates the nests with higher probability of discovery, calculates the fitness of the new nest, and adapts to the position of the nest in $p_{t}$ . For comparison, replace the poor position with a better nest position to obtain a new set of better nest position $p_{t}$ .
Step 5: Finds the optimal nest position $x_{i}^{(t)}$ in step 4, and determines whether the fitness $F$ satisfies the requirement. If the requirement is met, the search is stopped, and the global best fitness $F_{\max}$ and its corresponding are output. The optimal nest $x_{i}^{(t)}$ ; if the requirements are not met, return to step 2 to continue searching.

Algorithm 2: PSO-SVM
Input: Dataset X set the number of population iterations, the number of particles, the optimal range of C and g, and randomly set the initial position of the particle $x_{i}^{0}$ and the initial velocity $v_{i}^{0}$ within the allowable range.
Output: Credit card fraud results
Step 1: Fitness calculation. According to the fitness function of the particle, the fitness value of each particle after each iteration is calculated, and the individual extremum is updated.
Step 2: Iterative optimization. The velocity of the particle is updated according to the Eq. (10), and the position of the particle is updated according to the Eq. (11).
Step 3: Evaluates the particle fitness. Compare the fitness value $F$ of the current position of each particle in the population with the historical optimal value pBest. If $F<$ pBest, then pBest $=F$ , otherwise $F$ remains unchanged.
Step 4: Compares the historical best fitness value pBest and the population optimal value gBest of each particle in the population. If pBest $<$ gBest, then gBest $=$ pBest, otherwise gBest remains unchanged.
Step 5: Gets the best combination of parameters and builds the optimal model. If the end condition is met, the iteration is stopped, otherwise steps 15 are repeated.

4. Modeling

Algorithm 3: GA-SVM
Input: Dataset X evolution algebra $N$ , crossover probability $p_{c}$ , and mutation probability $p_{m}$ .
Output: Credit card fraud results
Step 1: Calculates the fitness value, and the roulette method selects individuals, implements crossover and mutation, and obtains the optimized fitness value.
Step 2: Judges the number of evolutions and evaluates the fitness. If the evolutionary algebra or fitness value is met, C and g are output. Otherwise, go to step 1.
Step 3: Produces a new population. Assign values to C and g to calculate the fitness of each individual after training.
Step 4: If the set stop training condition is reached, output C and g to exit the optimization program. Otherwise, go to step 2.
Step 5: Determine whether the cross-validation accuracy rate meets the set conditions. Satisfy the next step, otherwise go to step 1.

Figure 1.

Optimization principle.

In order to build an effective SVM model, the parameters of the parameters (C and g) need to be pre-selected [19] The determination of parameter C requires a trade-off between training error and complexity. If the C value is larger, the prediction accuracy of the training samples will be higher. Nevertheless, this could cause the overtraining problems. C is the penalty coefficient, which can be understood as the weight of the preference of the two indicators (interval size, classification accuracy) in the optimization direction, that is, the tolerance of the error. The higher the C, the more the tolerance is not tolerated and the over-fitting is easy; the smaller the C, the easier it is to fit, the C is too large or too small, and the generalization ability is worse. g is a parameter that comes with the RBF function as a kernel function. Implicitly determines the distribution of data after mapping to a new feature space. The larger the g value, the fewer support vectors The smaller the g value, the more support vectors. The number of support vectors affects the speed of training and prediction. Therefore, the parameters (C and g) have a great impact on the efficiency and generalization of the SVM model. At present, the selected parameters are lack of mature theoretical guidance, mainly based on experience. Grid search technology is the most commonly used method of searching for parameters, but grid search technology wastes time and does not work well. Therefore, we choose three optimization algorithms, CS, GA, and PSO, to find the optimal parameters of the SVM model. The optimization schematic is shown in Fig. 1.

5. Results and discussion

5.1 Preparing data for models

Organizing the data set, there are 514 sets of data, each of which has 30 attributes. Since these data are collected in the law enforcement department in China, they have certain privacy. It can be observed that it is a unbalanced data set. These 30 attributes are represented by V1, V2, …, V30 respectively. Randomly scrambled data, and then 70% of the data were selected as training sets, a total of 360, 30% as test sets, a total of 154. In order to verify the scientific nature of the above model and the effectiveness of the algorithm, the programs of CS-SVM algorithm, GA-SVM algorithm and PSO-SVM algorithm were written by MATLAB R2017, and solved on the same computer. Experimental environment: Windows 10: Inter (R) Core (TM), i5-5200U CPU, 2.20 GHz, 4 GB of memory. Table 2 is model parameters.

Table 2
Model parameters

Variable	Values
Maximum iteration	50
Number of experiments	10
Population size	20
Crossover probability $p_{c}$	0.9
Mutation probability $p_{m}$	0.01
Learning factor $c_{1},c_{2}$	2
Inertia weight	$1\leqslant\omega\leqslant 2$
$p_{a}$	0.25

5.2 Evaluation measures

Evaluation metrics include accuracy, precision, recall and F-Measure. The confusion matrix is an indicator of the results of the evaluation, which is part of the model evaluation. The confusion matrix is shown in the below [1].

The confusion matrix consists of the following measures:

True Positive (TP): A test result that detects the condition correctly when the condition is present. True Negative (TN): A test result that does not detect the condition when the condition is absent. False Positive (FP): A test result that detects the condition when the condition is absent. False Negative (FN): A test result that does not detect the condition when the condition is present.

The various evaluation measures are defined as follows:

Accuracy: It is the number of correct predictions made divided by the total number of predictions made.

$\displaystyle\text{Accuracy}=\frac{\text{TP}+\text{TN}}{\text{TP}+\text{FN}+% \text{FP}+\text{TN}}$ (19)

Precision: It is the number of positive predictions divided by the total number of positive class values predicted.

$\displaystyle\text{Precision}=\frac{\text{TP}}{\text{TP}+\text{FP}}$ (20)

Recall: It is the number of positive predictions divided by the number of positive class values in the test data.

$\displaystyle\text{Recall}=\frac{\text{TP}}{\text{TP}+\text{FN}}$ (21)

F-Measure: The F-Measure conveys the balance between the precision and the recall.

$\displaystyle\text{F-Measure}=\frac{2\times\text{Precision}\times\text{Recall}% }{\text{Precision}+\text{Recall}}$ (22)

ROC curve: Receiver operating characteristic curve, referred to as ROC curve, is the horizontal axis of False positive rate, the probability of hitting the vertical axis, and the curve drawn by the tester under different stimulation conditions due to different judgment criteria. AUC represents the area under the ROC curve, between 0.5 and 1. For a perfect classifier, the value of AUC should be 1. AUC as a numerical value to visually evaluate the quality of the classifier. The larger the AUC value, the better the classification effect. If the AUC value is larger, the classification effect will be better.

5.3 Nuclear function evaluation

Four kernel functions are tested respectively, and then the classification accuracy and ROC curve are compared to find the most suitable kernel function. Figure 2a is the classification accuracy corresponding to the four kernel functions, and Fig. 2b is the ROC curve corresponding to the kernel function in Fig. 2a. It can be observed that the Linear kernel function has the highest correct rate of 91.56%, the AUC value is 0.98718, and the closest among the four kernel functions is 1. Therefore, the performance of the Linear kernel function is the best.

Figure 2.

The classification accuracy and ROC curve corresponding to the four kernel functions.

5.4 Improved SVM evaluation

The CS, GA, and PSO algorithms are selected for the four kernel functions to optimize the operation, and the classification accuracy of the four kernel functions is compared. The classification function of the kernel function is effective, and it is not necessarily the best performance after optimization. For instance, the classification accuracy of the Linear kernel function before optimization is 91.56%. After optimization, the correct rate could raise to 93.51%. The correct classification rate of Radial basis function before optimization is 42.86%, and the correct rate becomes 98.05% after optimization. It is the optimal value after optimization of four kernel functions, and the classification accuracy rate is increased by 129%. The performance comparison of different types of kernel functions after optimization is shown in Table 3.

Table 3
Comparison of different types of kernel functions after optimization

Kernel function	CS-SVM	GA-SVM	PSO-SVM
Linear kernel function	93.51%	93.51%	93.51%
Polynomial kernel function	92.86%	94.81%	94.16%
Radial basis function	98.05%	98.05%	98.05%
Sigmoid kernel function	64.29%	64.29%	92.21%

Table 4

Optimization algorithm evaluation data

Models	Accuracy	Precision	Recall	F-Measure	Time
CS-SVM	98.05%	100%	94.12%	96.97%	45.6s
GA-SVM	98.05%	98%	96.08%	97.03%	22.8s
PSO-SVM	98.05%	100%	94.12%	96.97%	20.6s

Figure 3.

Fitness curve.

Figure 4.

Fitness function comparison.

According to the optimization conclusions of different kernel functions in Table 3, the Radial basis function is selected as the kernel function, and the classification performance of the three optimization algorithms is evaluated in detail. Figure 3 presents the Fitness curve for different optimization algorithms. Figure 4 is the fitness function comparison. GA-SVM can find the best fitness in the 5th generation, CS-SVM can find the best fitness in the 15th generation, and PSO-SVM can find the best fitness in the 25th generation. But comparing the best fitness of the three optimization algorithms, we can find that PSO-SVM is the best of the three optimization algorithms.

Table 4 and Fig. 5 are Optimization algorithm evaluation data. The accuracy of CS-SVM, GA-SVM, and PSO-SVM algorithms is the same, both are 98.05%. In Precision, CS-SVM is the same as PSO-SVM, both are 100%, and GA-SVM is 98%. In terms of Recall, CS-SVM is the same as PSO-SVM, both of which are 94.12% and GA-SVM is 96.08%. In terms of ROC curve, the AUC value of CS-SVM is 0.9969, the AUC value of GA-SVM is 0.996, and the AUC value of PSO-SVM is 0.9967. The AUC values of the three optimization algorithms are very close which indicates that the performance of the three optimization algorithms is very close. However, there are conspicuous differences in terms of runtime comparison. The CS-SVM is 45.6 seconds, the GA-SVM is 22.8 seconds, and the PSO-SVM is 20.6 seconds. It can be found that the PSO-SVM has a particularly potent effect in operation. Considering comprehensively, PSO-SVM is the best algorithm for classification efficiency.

Figure 5.

Optimization algorithm evaluation.

6. Conclusion

In this paper, we take the real transaction data of credit card business as an example, firstly find the best kernel function suitable for transaction data, and then apply the cuckoo search algorithm, genetic algorithm and particle swarm optimization algorithm to optimize the parameters of support vector machine. Finally, the Linear kernel function is found to be the best kernel function with an accuracy rate of 91.56%. The optimization algorithm is used to optimize the accuracy rate of 93.51%, which is not the highest accuracy. However, the Radial basis function is used to optimize the kernel function, which can improve the accuracy from 42.86% to the highest accuracy rate of 98.05%. Compared with CS-SVM and GA-SVM, PSO-SVM has the highest accuracy, the highest accuracy and the shortest running time. Thus, PSO-SVM demonstrates the best performance.

Data with different sources or different structures requires a kernel function corresponding to it, and finding the optimal kernel function is the first step. Sometimes we just need to find a kernel function that meets our requirements without having to continually optimize. If the kernel function you chose cannot meet your requirements, you can look for different optimization algorithms to optimize, such as CS, GA, PSO and so on. Through comparative research, PSO-SVM is the optimal algorithm for this data set. In the future, we will continue to look for new algorithms to optimize SVM and continuously improve the classification performance of SVM.

Footnotes

Acknowledgments

This work is supported by the National Key R&D Program of China (No. 2020YFC1522603), National Natural Science Foundation of China (71904194) and Beijing Advanced Discipline Construction Project of National Security Science, People’s Public Security University of China.

References

Akinyelu

A.A.

and Adewumi

A.O.

, On the performance of cuckoo search and bat algorithms based instance selection techniques for SVM speed optimization with application to e-fraud detection, KSII Transactions on Internet &Information Systems 12(3) (2018).

Chaudhuri

, Modified fuzzy support vector machine for credit approval classification, AI Communications 27(2) (2014), 189–211.

Burges

C.J.

, A tutorial on support vector machines for pattern recognition, Data Mining and Knowledge Discovery 2(2) (1998), 121–167.

Everett

, Credit card fraud funds terrorism, Computer Fraud & Security 2003(5) (2003), 1.

C.-J.

and Yang

Y.-P.

, A batch-mode active learning SVM method based on semi-supervised clustering, Intelligent Data Analysis 19(2) (2015), 345–358.

Robertson

, The Nilson Report, HSN Consultants, Inc., [online] https://www.nilsonreport.com/upload, 2017.

Wang

Chen

and Chen

, Credit card fraud detection strategies with consumer incentives, Omega 88 (2019), 179–195.

Wang

Xiao

Chen

and Havyarimana

, Bayesian optimization of support vector machine for regression prediction of short-term traffic flow, Intelligent Data Analysis 23(2) (2019), 481–497.

Zitzler

Laumanns

and Thiele

, SPEA2: improving the strength pareto evolutionary algorithm, TIK-report 103 (2001).

10.

Jain

Sharma

and Agarwal

, Spam detection in social media using convolutional and long short-term memory neural network, Annals of Mathematics and Artificial Intelligence 85(1) (2019), 21–44.

11.

Wallrafen

Protzel

and Popp

, Genetically optimized neural network classifiers for bankruptcy prediction-an empirical study, in: Proceedings of HICSS-29: 29th Hawaii International Conference on System Sciences, IEEE, Vol. 2, 1996, pp. 419–426.

12.

Awoyemi

J.O.

Adetunmbi

A.O.

and Oluwadare

S.A.

, Credit card fraud detection using machine learning techniques: A comparative analysis, in: 2017 International Conference on Computing Networking and Informatics (ICCNI), IEEE, 2017, pp. 1–9.

13.

Kennedy

and Eberhart

, Particleswarmoptimization, in: ProceedingsofICNN’95-International Conference on Neural Networks, IEEE, Vol. 4, 1995, pp. 1942–1948.

14.

Tian

and Gu

, Anomaly detection combining one-class SVMs and particle swarm optimization algorithms, Nonlinear Dynamics 61(1–2) (2010), 303–310.

15.

Timmis

Neal

and Hunt

, An artificial immune system for data analysis, Biosystems 55(1–3) (2000), 143–150.

16.

Kianmehr

and Alhajj

, Effectiveness of support vector machine for crime hot-spots prediction, Applied Artificial Intelligence 22(5) (2008), 433–458.

17.

Dorigo

Birattari

and Stutzle

, Ant colony optimization, IEEE Computational Intelligence Magazine 1(4) (2006), 28–39.

18.

Zareapoor

Shamsolmoali

et al., Application of credit card fraud detection: based on bagging ensemble classifier, Procedia Computer Science 48 (2015), 679–685.

19.

Lin

, Support vector regression: Systematic design and performance analysis, Unpublished Doctoral Dissertation, Department of Electronic Engineering, National Taiwan University, 2001.

20.

Wheeler

and Aitken

, Multiple algorithms for fraud detection, in: Applications and Innovations in Intelligent Systems VII, Springer, 2000, pp. 219–231.

21.

Bolton

R.J.

Hand

D.J.

et al., Unsupervised profiling methods for fraud detection, Credit scoring and credit control VII, 2001, 235–255.

22.

Bhattacharyya

Jha

Tharakunnel

and Westland

J.C.

, Data mining for credit card fraud: a comparative study, Decision Support Systems 50(3) (2011), 602–613.

23.

Xuan

Liu

Zheng

Wang

and Jiang

, Random forest for credit card fraud detection, in: 2018 IEEE 15th International Conference on Networking, Sensing and Control (ICNSC), IEEE, 2018, pp. 1–6.

24.

Jha

Guillen

and Westland

J.C.

, Employing transaction aggregation strategy to detect credit card fraud, Expert Systems with Applications 39(6) (2012), 12650–12657.

25.

Ajina

Yampolskiy

R.V.

and Amara

N.E.B.

, Evaluation of SVM classification of avatar facial recognition, in: International Symposium on Neural Networks, Springer, 2011, pp. 132–142.

26.

Mareeswari

and Gunasekaran

, Prevention of credit card fraud detection based on HSVM, in: 2016 International Conference on Information Communication and Embedded Systems (ICICES), IEEE, 2016, pp. 1–4.

27.

Yang

X.-S.

and Deb

, Engineering optimisation by cuckoo search, International Journal of Mathematical Modelling and Numerical Optimisation 1(4) (2010), 330–343.

28.

Yang

X.-S.

and Deb

, Cuckoo search via Lévy flights, in: 2009 World Congress on Nature & Biologically Inspired Computing (NaBIC), IEEE, 2009, pp. 210–214.

29.

and Kong

, Application of GA-SVM method with parameter optimization for landslide development prediction, Natural Hazards and Earth System Sciences 14(3) (2014), 525.

30.

Zhang

Wang

and Zhang

, Short-term electric load forecasting based on singular spectrum analysis and support vector machine optimized by Cuckoo search algorithm, Electric Power Systems Research 146 (2017), 270–285.

31.

Sahin

and Duman

, Detecting credit card fraud by ANN and logistic regression, in: 2011 International Symposium on Innovations in Intelligent Systems and Applications, IEEE, 2011, pp. 315–319.

Comparative study on credit card fraud detection based on different support vector machines

Abstract

Keywords

1. Introduction

2. Related work

3. Materials and methods

3.1 Support vector machine

3.2.1 Cuckoo Search algorithm (CS)

Population initialization

Fitness function

Choose the operation

Crossover operation

Mutation operation

5.1 Preparing data for models

Table 2 Model parameters

Table 3 Comparison of different types of kernel functions after optimization

Footnotes

Acknowledgments

References

Table 2
Model parameters

Table 3
Comparison of different types of kernel functions after optimization