An advancement in AdaSyn for imbalanced learning: An application to fraud detection in digital transactions

Abstract

Imbalanced Learning is a significant issue in machine learning, affecting the performance and accuracy of binary or multi-classification algorithms, especially in large-scale data handling and classification. There are some popular techniques to covert this imbalanced data into a balanced one such as undersampling, under-sampling with tomek links, randomized oversampling, synthetic minority oversampling technique (SMOTE), and adaptive synthetic generation (ADASYN). Generally, the ADASYN algorithm could be used to propagate minority sample points to rise the imbalanced ratio between majority and minority sample points, but in some cases, it may conflict with decision boundary points and noisy points. This paper proposed a Refitted AdaSyn Algorithm (RAA) with Gaussian Distribution (GD). So that new minority samples are distributed much closer to the center of the minority sample to spotlight the conflicts. The classification accuracy has improved with RAA over formal ADASYN. For examining the proposed work the imbalanced benchmark datasets like European, Banksim, Paymentcard, and UCI credit card are considered. Vanilla Generative Adversarial Network (GAN) is a deep learning model used to classify fraud and non-fraud transactions, demonstrating significant differences between balanced and imbalanced learning approaches and achieving an accuracy of 97.5% on dataset DS4.

Keywords

Imbalanced learning synthetic minority oversampling technique adaptive synthetic refitted Adasyn algorithm (RAA)

1 Introduction

Nowadays, with the increase in usage of the internet [1] and smart devices, the number of digital transactions started heading up to the north. As a result of the covid-19 pandemic, governments, and companies showing digital transactions as a safer one which resulted in a hike in digital transactions exponentially. The increased stock in digital payments allowed digital fraudsters to show their prowess. In India, FY 2021-22 has seen 83,638 cases of banking frauds, which involves Rs 1.38 lakh crore. Out of this, only Rs 1,031.31 crore has been recovered till now [2]. It is suspected that most of the attempts for digital fraudulent transactions are happening in Mumbai, Delhi, and Chennai in India. Worldwide, it is estimated that fraudulent infect in the banking sector will hit to 71.7% in 2025.

The average time taken to detect the digital transaction fraud from the date of occurrence was 2 years. If the value of the fraud is more than 100 crore it might take 4 years in time to detect the fraud from the date of occurrence. With a greater emphasis on financial inclusion and customer safety, and in light of the recent surge in customer complaints about unauthorized transactions resulting in debits to their cards, the ideal for determining customer accountability in these prospects have been reviewed. Bank systems and procedures must be designed to make customers feel secure when conducting electronic banking transactions. To achieve this, financial banks are also insisted on appropriate procedures and systems to ensure the safety and security of digital banking transactions [3]. As in recent advancements of technology, many robust mechanisms require various algorithms such as data mining, machine learning, deep learning, and work on massive amounts of data to detect fraud in digital transactions [4]. The researchers are repeatedly working on developing such innovative methodologies to overcome the fraud problems meanwhile, focus on data availability. The best fraud prediction depends only on model learning which requires input as of massive data.

In today’s world, imbalanced learning is critical [5] issue because most machine learning and deep learning algorithms assume that data is distributed equally. When there is a class imbalance, the machine learning classifier is more biased towards the majority class, resulting in incorrect classification of the minority class. So, unable to judge correctly whether that belongs to what kind of transaction it is (either abnormal transaction or normal transactions). Our proposed model concentrated more on this regard to balance the majority and minority sample data points. For distinct applications, academicians gave their thoughts and worked with suitable approaches for imbalanced learning [6] to solve single class and multi-class classification problems [7]. Different authors explored their knowledge and dealt with such an imbalanced problem in various views like dataset with overlapping, overfitting, noisy sample reduction, weighted entropy function, evaluation metrics, add appropriate weights in deep learning model, cost function-based approaches, etc.

In our proposed method, renovated ADASYN algorithm with Normal Distribution (ND) to balance the ratio between majority and minority is discussed. How to overcome the problems of noisy reduction and conflicts in decision boundary points that are more relevant to the majority samples are explored in this work. Here, the training dataset has imbalanced because lack of minority samples in the region. To increase the importance of the specific region in the decision boundary which is related to the minority class, the RAA is used to generate synthetic instances for the minority class to balance the training dataset.

This research article can be elaborated in three-level as shown in Fig. 1 a) Hyper Feature selection b) RAA with GD, and c) Classification.

Fig. 1

MLP: Overall framework of model.

Feature Selection: Feature selection is a method to extract the essential features from the set of variables and it can also be reduced the dimensionality space. The FireFly Feature Selection (FFS) Method [8] is the best feature selection technique based on the light intensity and information theory. The best features always yield good performance.

RAA: This is the modified ADASYN algorithm for best fit in the distribution of minority samples in the region using GD. The GD affects the synthetic data generation (belonging to minority samples) by scheming the mean and Standard Deviation (SD). Now will able to get the well distribution over majority and minority samples which is needful for classification. This skewed Class distributions tend to improve the results of machine learning and deep learning algorithms.

Classification: Classification performs a vital role to know the performance of any model. This predictive modelling problem where a class label is predicted for a given example of input data. So this categorizing process is performed on a given set of transactional data that determine whether it is fraudulent or non-fraudulent.

This paper is outlined as follows: Sect. II is the Related works and review of imbalanced techniques; Sect. III about description of related existing methods; Sect. IV describes the proposed model RAA; Sect. V evaluation metrics; at sect. VI. The results and comparison with other results are described; at the end in sect. VII the conclusion will be discussed.

2 Literature survey

Imbalanced dataset persist by various reasons. In general cases, this caused with incomplete data collections due to privacy related concerns. The studies have begun to focus on analyzing certain characteristics of imbalanced data to solve the problem of imbalances with real data. So, here discuss some thoughts of researchers and scientists. Leraning from misproportional data is vital challenge to digital fraud prediction tasks. When data dealt with hetrogeneous types, many authors are devoted to design an effective data fusion approach, proposing new inventions and algorithms.

C. Fabrizio et al. [9] proposed Stochastic Semi-Supervised learning (SSSL). It is a worth method that add number of majority non fraud samples to the training set. They achieved goal by some oversampling techniques, and given more focus on the majority class (as target class) and not the minority class. In order to assess how SSSL situates in conventional oversampling, authors considered two oversampling approaches such as Random Oversample (ROS) and SMOTE. G. Anjana et al. [10] have been proposed different imbalanced learning techniques which are SMOTE, ADASYN, Borderline-SMOTE, Safe-Level SMOTE and compared results. They combined the border line SMOTE with tomek links technique which reduces the noisy data. The Border line SMOTE used to generate the synthesized positive samples (minority samples).

Rafiq et al. [11] applied latest approach SMOTE to improve the balancing strategy in the binary classification problem on credit card fraud dataset.

Sara makki et al. [12] conduct an experimental study on application which handle the imbalance classification problem. They proved weaknesses of imbalanced dataset and summarized results with the help of fraud labeled imbalance data. The paper revealed about existing approaches. They produced results in a huge number of false alarms and that are costly to financial institutions. W. W. Soh et al. [13] expressed a new way of preprocessing i.e, oversampling called SAS which is applied on imbalanced problem while introducing bias resulted in where prediction not done correctly. Various data mining models are used for classifying the fraud and non fraud data. S. Bagga et al. [14] taken highly skewed european dataset and tested results with approach called pipeling and combined with ensembled learning that generates good accuracy by comparing with different supervised algorithms such as Linear Regression (LR), K Nearest Neighbor (KNN), Random Forest (RF), Naive Base (NB), Multilayer Perceptron (MLP), Adaboost and ensemble learning.

L. Samorjit et al. [15] proposed a new way to generate the artificial data i.e, GAN. The available various kinds of GAN’s approaches are discussed by [15]. Author considered hyperparameter i.e, Gradient Penalty with Wasserstein GAN and Conditional GAN for credit card fraud detection. The GAN approach performed well in balancing the data as well as classification. A. Puri & M. K. Gupta [5] expressed a new hybrid view to tackle the noisy class imbalanced problem i.e, KMeans-SMOTE-ENN. This is happened in three steps a) initial clustering b)SMOTE used in each cluster c)noise removal. Here [5], used KMeans-SMOTE as oversampling technique and modified Edited Nearest Neighbour (ENN) for noise removal. They tested results on different binary imbalanced data sets.

Kyoungok Kim [16] introduced a new technique called noise avoidance SMOTE. This method is efficient for imbalance learning. They used modified smote with ensemble methods such as NASBoost and NASBagging. To address the problem of imbalance learning SMOTE such as Borderline-SMOTE and Safe-level-SMOTE which defines noise samples in a minority class means that cannot be drawn from for sampling. The experimental results has shown that better improvement by applied this ensemble process on 16 different data sets. G A Pradipta [17] modified a SMOTE with safe radius parameter to consider for distance measurement. This method achieves the good results on 5 data sets out of 16 data sets were examined. Here, the synthetic data has generated based on the radius calculation between minority samples. The proposed method taken in categories safe and noise and new synthetic data generated in the form of SAFE category data. By using formula of euclidean distance calculated distance between selected minority sample and all majority samples.

S. Wang [18] explored a new knowledge on imbalanced learning i.e improved SMOTE based on ND. Here, the minority samples are distributed towards the center point. The algorithm has given higher probability of marginalization over the expanded data. RF used for classification. The experimental results gave better classification rate. Out of Bag (OOB) scoring error parameter also considered to this problem with other metrics like true positive rate, false positive rate etc. They applied on some bench mark data sets like Pima, WPDC, WDBC, Inosphere, and Breast cancer wisconsin.

H. Zhu et al. [20] used Weighted Extreme Learning Machines (WELM) with a dandelion algorithm with probability-based mutation and compared with various intelligent optimization methods such as genetic algorithm, particle swarm optimization, bat algorithm and compared their performance on 14 imbalanced dataset to solve the problem of imbalanced classification.

F. Itoo et al. [21] compared the empirical results and classify the fraud and non-fraud transactions of the credit card dataset by dividing data set into 3 different ratios. They applied Random Under Sampling method (RUS) and to checked out whether the performance is improved by method. LR, KNN and NB models were used to test the model. A Singh [22] provided comparative study on various measures like accuracy,error, F1 score etc. of different supervised models and explained about popular imbalanced learning techniques.

A robust and novel method developed by B Baesens et al. [23] to deal with imbalance problem called robRose i.e, a more robust ROSE algorithm that does not oversample minority outcasts and subsequently taken the covariance structure of data into consideration. Robust statistics is a useful tool for anomaly detection i.e, outliers. Outliers are samples which are deviate from minority and with majority samples. They applied this technique on customer churn prediction.

Z. Li et al. [24] stated and expressed their view to handled the problem of class imbalance with overlapping samples. They used a hybrid method called Dynamic Weighted Entropy (DWE). They elaborated hybrid model in two steps such as the divided and conquer. Initially they identified anomalies from overlapping structures (both majority and minority) and subsequently, train the classifier on the overlapping subset.

E.B. Fatima et al. [25] taken the classs imbalanced data which has major issue and the problem faced in many different areas. They presented the solution using three feature selection algorithms namely, Reduce overlapping with No-Sampling, Reduce oversampling with SMOTE and Reduce overlapping with ADASYN. The simulation results have shown that the false discovery rate were minimized.

Hadeel Ahmed et al. [26] came out with new undersampling imbalanced framework i.e, fuzzy c-means clustering along with Similarity Based Selection (SBS). After performed clustering, SBS selecting and combining the instances that have similar features according to the desired ratios 25:75, 50:50, 34:66. After getting the balanced dataset, given training data to ML algorithms like ANN, LR, KNN, NB to know the accuracy.

3 Description of related existing methods

3.1 Balanced technique: ADASYN

An Adaptive Synthetic Minority Oversampling Technique is known as ADASYN. We propose an compatible method to expedite learning from imbalanced data sets, inspired by the success of modern synthetic approaches such as SMOTE, SMOTEBoost, and DataBoostIM rather than different from SMOTE. Generating synthetic data samples as shown in the Fig. 2 based on minority class density distributions improves the determination limits of the original data. To understand the concept of imbalanced problem explains through filled circles with blue, red and yellow are called as majority (non-fraud), minority (fraud) and synthetic fraud samples respectively. Minority (Fraud) samples are defined as M_R1, M_R2, … M_R10 and P₁, P₂, … P₁0 are represented as synthetic minority samples. The concise imbalanced model well in making equal of fraud and non fraud samples.

Fig. 2

Pictorial representation of majority vs. minority vs. synthetic samples.

Density distribution function helps in determine the number of unnatural samples that are required to be generated interrelated to each fraud samples. The goal here is two fold: reducing learning bias of the original data set and learning adaptive. Algorithm ADASYN [27] describes the proposed algorithm for the binary classification problem. The number of synthetic data generates towards of minority samples are calculated by the Equation (1) and then M_J refers to majority samples,M_R minority samples and G can store # number of synthetic samples. If β = 1 means that the dataset denotes fully balanced.

$G = (M_{J} - M_{R}) X β$ (1)

Algorithm 1 illustrated how it was works in terms of synthetic data generation based on density distribution. And also we can turn aside from the creation of the more border points which are difficult to classify. That means, Prediction does not meet the belief in discovering fraud transactions.

Algorithm 1 Adaptive Synthetic Generation code:

Choose the number of nearest neighbors as k, β which denote class balanced level after synthetic data generation, respectively.

1. M_J express the number of non-fraud samples (majority class) and let M_R denote the number of frauds samples (minority class). Calculate G = (M_J - M_R) Xβ

2. C_iwherei = 1, …, M_R indicate the observations native to the minority class and A express the set of all C_i.

3. Measure the distance function (euclidean) between C_i and all other objects of A. Then capture the KNN’s of C_i.

4. S_ik express the set of kNN’s of C_i.

5. Determine δ_i where the number of observations in KNN’s region of C_i. Calculate the ratio S_i defined as $\hat{S_{i}} = δ_{i} / K$ where i = 1, … M_R.

6. Normalize S_i with $\cap S_{i} = S_{i} / \sum_{i = 1}^{M_{R}}$ where $\hat{S_{i}}$ is a Probability or density function.

7. Calculate g_i= $\hat{S_{i}}$ X G indicates the set of artificial observations that are produced for each C_i.

8. g_i randomly generated synthetic samples denoted C_ijj = 1, … g_i from S_ik with restoration.

9. Stop algorithm.

3.2 Classifier: Generative adversarial networks (GAN’s)

GAN is one family of models in the deep generative story. This model is best fitted with ND region. GAN’s take a different approach for dimensional problem where the idea is to sample from a simple tractable distribution (z ∼N(0,1)) and then learn a complex transformation from this to the training distribution. In other words, take a (aZ ∼ N (0, 1)), learn to make a series of complex transformations on it. So, that the output looks as if it came from our training distribution. To train a such a complex normal distribution transformation through a neural network i.e, GAN [28]. It can be train by using a two player game. One is Generator and Second one Discriminator.

Generator (G): The job of the generator is it has to take Z which the data is normally distributed. The Generator used to produce transaction data which look so natural, that the discriminator thinks that the transactional data comes from the real data distribution. Generator was parameterized by G_φ. So, these are the parameters of the generator this could be a feed forward neural network or convolution neural network (CNN). A neural network based generator which takes an input a noise vector z ∼ N(0,1)) and produces Equation (2) as shown in below. $G_{φ} (Z) = X$ (2)

Discriminator (D): It is best in distinguishing between original and fake transactional data. Generator was parametarized by D_θ. A neural network based discriminator which could take as input a real X or a generated X = G_φ (Z) and classify the input Equation (3) as real or fake.

D (G_{φ} (Z))

(3)

The given an image generated by the G as G_φ (Z), the discriminator assigns a score D (G_φ (Z)) to it. The goal of the G is to minimize this expected loss, overall possible values of Z. The D has to assign high score to real data and low score to generated data. So, when combining the objective functions of G & D as write as in the Equation (4) as follows then subsequently get a minmax game.

\begin{matrix} min_{φ} max_{θ} [E_{x \sim P_{d} ata} \\ \log D_{θ (x)} + E_{z \sim P (z)} \log 1 - D_{θ} ((G_{φ} (Z))] \end{matrix}

(4)

4 Proposed method

The imbalanced learning has been a great challenge in many applications. It has been a great impact in fraud detection especially happening while digital transactions taken place. An architecture of MLP can be a combination of DPP, RAA and GAN which shown in Fig. 1. The MLP is called as learning about learning. The preprocesed information is given as input to the DPP. DPP explained in detail in Sec. 4.2. Here, the output of DPP is fed to the RAA which is resulted in balanced proportion of majority and minority sample data by propagating the synthesized data. This balanced data set is forward to the GAN for classification. From the result, the performance of RAA can be observed.

4.1 DataSet description

The Digital transactional fraud detection datasets from kaggle and UCI repositories consists of anonymized credit card and payment transactions labeled as fraudulent or genuine. Applied four Imbalanced datasets DS1 [29], DS2 [30], DS4 [31] and DS3 [32] respectively as European, BankSim, UCI Credit card and Payment type to proposed algorithm to get balanced in improving classification rate. All these skewed datasets were highly imbalanced. The given Table 1 shows more description about skewed datasets and ratios of majority and minority samples. Here Majority samples are non-fraud transactions and minority indicates fraud transactions.

Table 1
DS1: Comparison of accuracy on different classifiers

Datasets #Samples #Features #Reduced Features #Non-Frauds #Frauds #Ratio

DS1:European Dataset 284807 31 10 284315 492 577.88

DS2:BankSim Dataset 594643 10 8 587443 7200 81.59

DS3:Payment Type Dataset 1048573 10 8 1047432 1141 917.99

DS4:UCI Credit card 30000 25 11 23364 6636 3.52

The European data set contains total observations are 284807 with 31 features including target feature and out of these only 492 are fraud transactions. The ratio between them is 577.88. Another Kaggle banksim data set with transactions 594643 as number of observations and out of these 7200 are considered as fraud with 81.59 ratio rate. Some other Payment type data set contains total 1048573 with 10 feature variables and among all these observations 1141 are fraud samples. This is vert high skewed data set with 917.99 imbalance ratio. The final data set UCI credit card with 25 features having 30000 overall samples and out of these 6636 are fraudulent with 3.52 ratio. Tables 7 to 10 depicts various imbalanced techniques that are applied on different highly imbalanced data set.

4.2 DataPreprocessing (DPP)

The input to any model required in a reliable manner. Data cleaning applied on five fraud data sets related to payment transactions. As a result, we can make the data more adaptable based on our model. To our proposed model, initially done the process of finding missing values, tomek links removal [6] i.e, eliminate the noisy points. This process addresses the paired of nearby majority and minority sample. So, by the technique to apply, removes majority sample from the region. This process is know as under sampling with tomek links. The most important criteria for data sets are reduce the features by applying best feature selection algorithm. For our problem recommended suitable technique called FFS which works based on the light intensity and information theory.

4.3 Balanced technique: Refitted ADASYN algorithm (RAA)

A modified ADASYN algorithm known as the RAA. It can be used to generate unnatural or artificial synthetic sample for improving the imbalanced ratio which leads to fraud samples. At the time of synthetic data generation, the algorithm i.e, SMOTE is created bridges between both observations due to embedness of frad and non fraud observations.

This bridge enlightened one as noisy samples where the fraud observations has gotten more non fraud class observations as neighbours. The decision boundary points has meant where the fraud observation gets equal neighbouring distance as measured in [7] for both fraud and non fraud observations. Meanwhile, SMOTE could be worked better with distribution of observations on region with the help of ND to avoid marginalization problem [18]. So, it might works well on some imbalanced datasets. Clearly, it is said that the ND recommended over the density distribution for marginalization data.

An algorithm 2 given here is the modified ADASYN with GD. Here, the terms are mean, median and mode are considered as equal. So that it could distributed observations well. In Equation (5), the GD formula is expressed [19].

$P (X) = \frac{1}{σ 2 π} \exp [- \frac{1}{2} (\frac{x - μ}{σ})^{2}]$ (5)

The value of P(X) or gaussian function in Equation (5) is equal to this normalizing function when this exponential value will be equal to 1. The overall performance of the function is depended on mean (μ) or expectation of the distribution, SD (σ) and variance (σ²).

The Equation (6) is normalized version of the actual data to avoid errors caused by different marginal dimensions, where as $\underset{ij}{S^{'}}$ is the i-th sample point under the j-th feature of the original data, S_jmin and S_jmax are the min and max value in the j-th feature respectively.

$S_{ij}^{'} = \frac{S_{ij} - S_{jmin}}{S_{jmax} - S_{jmin}}$ (6)

There after the center point $x_{Center}^{'}$ of fraud observations using Equation (7) is calculated.

$x_{Center}^{'} = (\frac{1}{n} \sum_{i = 1}^{n} {x^{'}}_{i 1}, \frac{1}{n} \sum_{i = 1}^{n} {x^{'}}_{i 2}, \dots \frac{1}{n} \sum_{i = 1}^{n} {x^{'}}_{ir})$ (7)

where n is the ’n’ number of samples in the fraud observations and r is the number of features in fraud observations. Then estimate the GD of n X 1 dimensional normalized minority observations under each feature.

Let σ₀ denote the SD vector of minority normalized data samples which is shown in Equation (8). $σ_{0} = (σ 1^{0}, σ 2^{0}, \dots \dots σ r^{0})$ (8) Where σ⁰_i (i =1,2,3, …, r) is the SD of the i-th feature. Genration of synthetic data based on interpolation formula as shown in Equation (9). $P_{i} = {x^{'}}_{i} + f (x) . ({x^{'}}_{C} enter - {x^{'}}_{j})$ (9) where P_i(i=1,2,3, …, r) is a newly generated artificial minority observations. According to Equation (9), P(X) is the main control part of the synthetic data generation. When the value of f (x) is equal to 1, p_i is the minority or fraud sample indicated center point i.e, $x_{Center}^{'}$ . If f(x) has a highest probability of taking values close to 1, then the expanded minority samples will be native or closer to the centre point $x_{Center}^{'}$ . Let f (x) is a random number obeying ND with mean value of μ= 1 and SD (σ).

All normal density curves satisfy the following property which is often referred as the Empirical rule: 65-95-99.7%.

If the curve lies -65% then the observations are all within one SD of the mean, i.e, between (μ - σ and (μ + σ).

If the curve lies -95% then the observations are all within two SD the mean, i.e, between (μ - 2σ and (μ + 2σ).

If the curve lies -99.7% then the observations are all within three SD of the mean, i.e, between (μ - 3σ and (μ + 3σ).

Thus, for a GD, almost all of the values are fallen within three SD’s of the mean.

Algorithm 2 Refitted ADASYN Algorithm (RAA):

Input: Imbalanced dataset

Output: Balanced dataset with fraud and non-fraud transactions.

Initialization: M_J -> # Majority Samples & M_R -> # Minority Samples

Standardize the original data $\underset{ij}{S^{'}} = \frac{(S_{ij} - S_{jmin})}{(S_{jmax} - S_{jmin})} .$

Calculate ratio of M_J & M_R d = M_R/M_J

Quantity of synthetic data to generate G = (M_J - M_R) × β where β = 1

Identify center point $x_{Center}^{'}$ of minority Samples. $x_{Center}^{'} = (\frac{1}{n} \sum_{i = 1}^{n} x_{i 1}^{'}, \frac{1}{n} \sum_{i = 1}^{n} x_{i 2}^{'}, \dots \frac{1}{n} \sum_{i = 1}^{n} x_{ir}^{'})$

Estimate the GD of n × 1 dimensional normalized minority samples under each feature.

σ₀ = (σ1⁰, σ2⁰, …… σr⁰)

Synthesize samples based on interpolation formula

$P_{i} = \underset{i}{x^{'}} + f (x) \cdot \underset{Center}{x^{'}} - \underset{j}{x^{'}}$

f(x) · G

Stop the algorithm when it reaches imbalanced ratio

5 Evaluation metrics

In general, any classification model is developed in two phases Train and Test phase.The trained model is evaluated using the test data in the testing phase in order to enhance the performance of the model. The confusion matrix hyper parameters are more important to calculate basic metrics like accuracy and F1 Score, precision, True Positive Rate (TPR), False Positive Rate (FPR), Likelihood Ratios Positive (LR+), Likelihood Ratios Negative (LR-) as shown in the equations Equations (10) to (16).

5.1 Confusion matrix

Confusion Matrix [33] is an important criterion for result measurement techniques or calculating the performance of a classification algorithm and which gives the summarizing helps in understanding what the classification model estimating correct and what types of errors it is making.

T _POS the number of samples, means Actual and Predicted has same number of values. T _NEG the number of samples, both the values of actual and predicted are negative. F _POS the number of samples, actual labels as negative and predicted as positive i.e., called Type I Error. F _NEG the number of samples, with actual labels as positive and predicted as negative. i.e., called as Type II Error.

The confusion matrix is an important criterion for calculating the performance of a classification algorithm, and knowing which criterion gives the summarization helps in understanding what the classification model is estimating correctly and what types of errors it is making. TPR is the number of samples that are both given the same value (i.e., positive) in actual and predicted, and TNEG is the number of instances that include both actual and predicted. Negative, FNEG, and FPOS represent the numbers in the classification rate of errors. F1 Score LR+ = Probability that a transaction has a fraud tested positive/probability that a transaction without the fraud tested positive. LR–= Probability that a transaction with the fraud tested negative/probability that a transaction without the fraud tested negative. FMR shows the ratio between LR+ and LR-. F1 Score have the harmonic mean of the precision and recall. The given performance metrics, such as accuracy, true fraud, false fraud, specificity, and precision, are expressed in Equations (10) to (16).

$Accuracy = \frac{T_{POS} + T_{NEG}}{T_{POS} + T_{NEG} + F_{POS} + F_{NEG}}$ (10)

$Precision = \frac{T_{POS}}{T_{POS} + F_{POS}}$ (11)

$TPR orRecall = \frac{T_{POS}}{T_{POS} + F_{NEG}}$ (12)

$FPR = \frac{T_{POS}}{T_{POS} + F_{POS}}$ (13)

$LR + = \frac{TPR}{FPR}$ (14)

$LR - = \frac{FNR}{TNR}$ (15)

$F 1 Score = \frac{2 * Recall * Precision}{Recall + Precision}$ (16)

6 Experimental results & discussions

6.1 Results of Part:1

In this empirical results section, we have explained in three parts. Initially, plotted balanced ratios of majority(green color) and minority samples(red color) in the form of scatter plots with clear visualization as shown in Figs. 3 to 6 are pictorial representations and compared results of the experiments in the second part with some selected Classic models of machine and deep learning SVM, RF [34], Extreme Learning Machine (ELM) [35], CNN [36] and GAN [28] to train on original data sets DS1 to DS4 for classification. The classification process gives prediction in terms of detecting fraud transaction. Third, applied original data samples to some of the under,over sampling techniques and proposed oversampling technique to make it as balanced. Subsequently, metric can be used to prove the proposed algorithm’s efficiency in terms of LR+ and LR–and this ratio clearly said that how the true positive increases and false rate decreases after balancing our dataset. At last, compare accuracy between before imbalanced and after balanced. All these experimental results presented in the forms of tables (Tables 2 to 5) and group bar graphs presented.

Fig. 3

DS1: Balanced pictorial representation of majority vs. minority.

Fig. 4

DS2: Balanced pictorial representation of majority vs. minority.

Fig. 5

DS3: Balanced pictorial representation of majority vs. minority.

Fig. 6

DS4: Balanced pictorial representation of majority vs. minority.

6.2 Results of Part:2

The various classification models that are trained by the original data sets DS1 to DS4. So, we can easily observe that the classification rate in fraud identification. With the imbalanced data set model can not be predicted a fraud transaction correctly. In such a case we required imbalanced techniques to balance the majority of non-fraud and minority of fraud transactions. Best oversampling, under sampling techniques and advanced RAA imbalanced technique were used in this experimental results. Hence, parameters chosen for comparison of results are accuracy, recall or TPR, FPR, F1-Score, LR+ and LR-. The results can be easily visualized from the bargraphs of four data sets.

The results are illustrated in Tables 2 to 5, applied various classical models on imbalanced data sets and that are given results by performing evaluation metrics to know the performance of the model on corresponding data set. While using original data sets, these classic models were performed well in some cases based on feature selection. Here, firefly algorithm used to extract best feature among all to identify fraud in such case. In Tables 2 to 5 classic models of machine and deep learning trained with data set DS1, DS2, DS3, DS4 respectively. Out of all these models GAN outperforms and followed by RF performed well. But in some cases observed that in Table 2 ELM gives more likelihood positive ratio even though getting less accuracy. In Table 3, GAN Performs well followed by CNN and RF. The Dataset 3 is very huge data set with large number of samples. In this a drastic inequality found by the ratio of majority of non fraud and minority of fraud transactions. GAN performed well on working with DS3 by 84.13%.

Table 2
DS1: Comparison of accuracy on different classifiers

Classifiers Accuracy TPR FPR LR+ LR-

SVM 0.864497712 0.8953766 0.565262937 1.584000235 0.240658972

RF 0.916579151 0.9434499 0.416840508 2.263335538 0.096971866

ELM 0.88907576 0.9066669 0.3557511 2.548599056 0.144871155

CNN 0.878542311 0.8998935 0.455885795 1.973945129 0.183980597

GAN 0.916785753 0.9406461 0.394244037 1.07580312 0.223018024

Table 3

DS2: Comparison of accuracy on different classifiers

Classifiers	Accuracy	TPR	FPR	LR+	LR-
SVM	0.878493968	0.934619557	0.512820513	1.822508137	0.134201961
RF	0.896099123	0.929508338	0.735294118	1.26413134	0.266301833
ELM	0.886409763	0.939559122	0.520231214	1.806041423	0.12597918
CNN	0.902462486	0.933032921	0.735294118	1.268924773	0.252986741
GAN	0.904144167	0.933032921	0.698529412	1.335710287	0.2221347

Table 4

DS3: Comparison of accuracy on different classifiers

Classifiers	Accuracy	TPR	FPR	LR+	LR-
SVM	0.765891157	0.816583414	0.323523542	2.524030896	0.271135209
RF	0.817848032	0.87736014	0.323523542	2.711889637	0.181292133
ELM	0.785423074	0.84645273	0.299107001	2.829932859	0.219073768
CNN	0.809264955	0.843197339	0.244086723	3.454498991	0.207434723
GAN	0.841372339	0.954392366	0.398342139	2.395911137	0.075803271

Table 5

DS4: Comparison of accuracy on different classifiers

Classifiers	Accuracy	TPR	FPR	LR+	LR-
SVM	0.92	0.944370271	0.296442688	3.185675716	0.079069221
RF	0.936666667	0.962913514	0.296442688	3.248228255	0.052712814
ELM	0.92	0.943390886	0.460829493	2.047158222	0.104992972
CNN	0.933333333	0.957543164	0.460829493	2.077868667	0.078744729
GAN	0.946666667	0.968381113	0.455729167	2.124904842	0.058094032

Table 5 worked on DS4 with 30000 transactions and was very small in the ratio of fraud and nonfraud. So, obviously that small ratio affects the accuracy while trained by various classic model. Compared to all the imbalanced datasets, it is the best in ratio. GAN, CNN, RF performed well. Hence, GAN outperforms when compared with other models even when working with different imbalanced data sets. It has shown effectiveness of every model drawn in Fig. 12.

6.3 Results of Part:3

This experiment is designed for the conversion process of imbalanced data set into balanced data set that affects the classification rate. There considered some well known imbalanced methods undersampling (US), oversampling (OS), mixture of US and OS, SMOTE and ADASYN. our proposed method RAA analysed and compared with all well known imbalanced methods. All the experimental results were preserved in Tables 6 to 9. The clear visualization represented results by group bar graphs as shown in Figs. 7 to 10.

Table 6
Balanced DS1: Accuracy values for imbalanced techniques

Classifiers US OS US+OS SMOTE AdaSyn RAA

SVM 0.86657631 0.867980773 0.867278543 0.870789693 0.871491923 0.898448896

RF 0.91713687 0.919693093 0.920648018 0.922403593 0.934060609 0.936125165

ELM 0.89652788 0.909482562 0.884904514 0.912993712 0.916504861 0.929102866

CNN 0.88139336 0.863837616 0.860326467 0.891926814 0.902460263 0.918569417

GAN 0.91872377 0.924416781 0.916123809 0.926653298 0.94433821 0.967313114

Table 7

Balanced DS2: Accuracy values for imbalanced techniques

Classifiers	US	OS	US+OS	SMOTE	ADASYN	RAA
SVM	0.88341093	0.88144476	0.87652829	0.88989355	0.89119286	0.897176962
RF	0.89722966	0.89977776	0.88486108	0.90457281	0.8983109	0.915915936
ELM	0.89357819	0.89650211	0.89311089	0.91134402	0.90898949	0.928528948
CNN	0.90563286	0.91235785	0.91458156	0.9265382	0.92981788	0.942582356
GAN	0.91874288	0.9342426	0.92448031	0.9352171	0.94195285	0.961661114

Table 8

Balanced DS3: Accuracy values for imbalanced techniques

Classifiers	US	OS	US+OS	SMOTE	AdaSyn	RAA
SVM	0.768341	0.78	0.80016438	0.83215679	0.8277188	0.886512648
RF	0.81964301	0.83093168	0.83653186	0.8583199	0.86179327	0.908447178
ELM	0.7965327	0.81168824	0.80234681	0.828327976	0.90274179	0.926567008
CNN	0.81778538	0.82612849	0.82688942	0.900411675	0.93185438	0.932289059
GAN	0.84452269	0.91294128	0.84863218	0.923117692	0.95211971	0.95613094

Table 9

Balanced DS4: Accuracy values for imbalanced techniques

Classifiers	US	OS	US+OS	SMOTE	AdaSyn	RAA
SVM	0.92332053	0.9245289	0.92162794	0.93551843	0.93681395	0.931666667
RF	0.93144132	0.94389605	0.93782748	0.94974503	0.93821953	0.963333333
ELM	0.92393198	0.92673222	0.93031216	0.9351702	0.94378319	0.946666667
CNN	0.93821073	0.93905321	0.9177218	0.94371889	0.93295442	0.943333333
GAN	0.94858288	0.9521482	0.9549931	0.96042165	0.96853104	0.975

Fig. 7

DS1: Evaluation metrics report of different classifiers on balanced dataset

Fig. 8

DS2: Evaluation metrics report of different classifiers on balanced dataset.

Fig. 9

DS3: Evaluation metrics report of different classifiers on balanced dataset.

Fig. 10

DS4: Evaluation metrics report of different classifiers on balanced dataset.

The effectiveness of proposed imbalanced method Called RAA designed to verify the effectiveness of the prediction rate that are compared with some state-of-the-art methods in terms of imbalanced techniques introduced above. This proposed RAA method is implemented with ND. This distribution minimizes the decision boundary points and noisy points while distributing the samples on feature space.

Our model can achieve the best performance on all data sets, because of its strong ability on learning the distribution of data set. That means, it well in distribution of samples rather than the state-of-the-art methods in terms of handling the problem of imbalanced learning.

Tables 6 to 9 depicts about imbalanced datasets 1 to 4 and adapted US, OS, mixture of US+OS, SMOTE, ADASYN and RAA to balanced our data set. Then fed balanced datasets on different models to improve the results. As shown in Table 6 classic under sampling and oversampling doesn’t give that much of result. An advanced over & undersampling like SMOTE, ADASYN, and RAA type balanced data given the better result with 92.66%,94.43%,96.73% respectively and followed by random forest. But in Table 8, CNN also works better on balanced datasets. GAN classifier works well on imbalanced data sets as well as balanced data sets. Table 9, proved RF, CNN and GAN improved the result of balanced datasets. The ELM also given best result on balanced dataset which is implemented by ADASYN algorithm. For DS4 Proposed RAA worked better on RF with 96.33% compared to GAN 97.5%. But, ultimately, most of the balanced data sets were implemented by vanilla GAN network in a better way and it was produced in the bar graph Fig. 10.

6.4 Results of Part:4

Results of Part 4 describes about how our proposed model works well on existing models. As part in our research work founded some existing imbalanced techniques VAE+GAN (Variational AutoEncoder) [37], OXGBoost+RandomSearchCV [38] (optimized XGBoost), DWE Overlapping+RF (Dynamic weighted Entropy) [24], robRose+LR (robust-Random oversampling examples) [23] and proposed RAA with GAN for effective classification. Table 10 specifies about comparison of RAA+GAN performance in terms of metrics accuracy and F1 score with different existing algorithms to all balanced datasets. Proposed model given Accuracy 96.73% for DS1, 96.16% for DS2, 95.61% to DS3 and 97.50% to DS4 as shown in the Fig. 11. And to know the performance of our proposed GAN, adopted two metrics as measures LR+ and LR–as shown in Equations (14) & (15). There is drastic change in the values between before unbalanced dataset and after getting the balanced dataset to improve proposed model as defined in Table 11. Therefore, comparison of unbalanced & balanced of LR+ and LR–are visualized in Fig. 12(a) & 12(b) through likelihood ratios of unbalanced dataset on GAN and a balanced dataset which is implemented by RAA using GAN are represented by the line graphs. This line graph shows the ratio of true positive on false positive (LR+) and the ratio of true negative on false (LR-) negative.

Fig. 11

Comparison of existing imbalanced models with proposed imbalanced technique RAA.

Fig. 12

Likelihood ratios.

Table 10

Comparison:Accuracy,F1 Score Measures of Existing imbalanced Techniques with RAA

Author	Methodology	DS1		DS2		DS3		DS4

		Acc	F1 Score	Acc	F1 Score	Acc	F1 Score	Acc	F1 Score
Tingfei et al. [37]	VAE+GAN	0.824443	0.903592	0.88758	0.937978	0.843448	0.909135	0.906977	0.949575
Priscilla et al. [38]	OXGBoost+Random Serch CV	0.912221	0.954013	0.925417	0.959913	0.87793	0.931509	0.92691	0.960804
Li Z Huang et al. [24]	DWE Overlapping +RF	0.945577	0.971978	0.908601	0.950336	0.907494	0.949105	0.961667	0.980096
Wang et al. [18]	SMOTE+ND+RF	0.964889	0.98197	0.927267	0.960946	0.950733	0.974477	0.968333	0.983691
Baesen et al. [23]	robRose+LR	0.950844	0.974575	0.949768	0.973491	0.945641	0.970707	0.95	0.973605
Proposed (RAA)	RAA+GAN	0.967313	0.983179	0.961661	0.980118	0.956131	0.980095	0.975	0.987169

Table 11

Generating values of LR+ & LR–for balanced and unbalanced GAN

Dataset-Classifier	LR+				LR-
	DS1	DS2	DS3	DS4	DS1	DS2	DS3	DS4
Unbalanced-GAN	1.075803	1.3357	2.39591114	2.1249048	0.223018	0.22213	0.0758033	0.058094
Balanced-GAN	2.385949	1.7507	2.83069786	3.2103659	0.0979832	0.04557	0.0314162	0.0176152

7 Conclusion & future scope

The major constraint of binary classification in digital fraud detection is an imbalanced distribution between majority of fraud and minority of non-fraud data observations. Because of this, many algorithms intended to use only the majority class observations and might been classify all observations as non-fraudulent, which yielded a higher overall accuracy but lower precision with respect to minority class of interest. So, this research work focused on distribution of samples in the feature region. Generally, more popular ADASYN, SMOTE has been followed density distribution. The proposed modified ADASYN technique (RAA) utilized ND which has overcome the problems of decision boundary points as well as noisy points too. So, RAA is an appropriate solution for this imbalanced problem. The misproportion datasets (DS1-DS4) were experimented by proposed RAA. Then, performance of RAA is proved with GAN network for efficient fraud & non fraud classification.

Furthermore, one can also apply this proposed RAA to any kind of application that requires the support to balance the dataset. Meanwhile, these are the limitations of this work that need to be processed in future work. other distribution techniques between fraud and non-fraud can be considered. It may not work effectively related to finding overlapping outliers. Hence, we need to improve this by invoking methods for maximum separation of data points.

References

Tero Pikkarainen , Kari Pikkarainen , Heikki Karjaluoto , Seppo Pahnila , Consumer acceptance of online banking: an extension of the technology acceptance model, Internet research, (2004), Emerald Group Publishing Limited.

Fraud Description, howpublished = https://www.businesstoday.in/union, (2021).

Lawal

O.M.

, Vincent

O.R.

, Agboola

A.A.A.

, Folorunso

, An improved hybrid scheme for e-payment security using elliptic curve cryptography, International Journal of Information Technology, 13 (2021), 139–153, Springer.

Tekkali

C.G.

, Vijaya

, A survey: Methodologies used for fraud detection in digital transactions, 2021 Second International Conference on Electronics and Sustainable Communication Systems (ICESC), 10 (2021), 1758–17653.

Puri

, Manoj

K.G.

, Improved hybrid bag-boost ensemble with K-means-SMOTE–ENN technique for handling noisy class imbalanced data, The Computer Journal, 65 (2022), 124–138, Oxford University Press.

Ning

, Zhao

, Ma

, A novel method for identification of Glutarylation sites combining Borderline-SMOTE with Tomek links technique in imbalanced data, IEEE/ACM Transactions on Computational Biology and Bioinformatics, (2021), IEEE publisher.

Sanjib Kumar Sahu , Pankaj Kumar , Amit Prakash Singh , Modified K-NN algorithm for classification problems with improved accuracy, International Journal of Information Technology 10 (2018), 65–70, Springer.

Rufai

K.I.

, Usman

O.L.

, Muniyandi

R.C.

, Oyinkanola

, Modelling credit card payment fraud detection system for financial institutions in Nigeria using an improved firefly algorithm,, Int J Inf Process Commun, 11 (2021), 9–25.

Fabrizio Carcillo , Yann-Aël Le Borgne , Olivier Caelen , Gianluca Bontempi , An assessment of streaming active learning strategies for real-life credit card fraud detection, 2017 IEEE International Conference on Data Science and Advanced Analytics (DSAA), (2017), pp. 631–639.

10.

Gosain

, Sardana

, Handling class imbalance problem using oversampling techniques: A review, 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI), (2017), pp. 79–85.

11.

Mohammed

R.A.

, Wong

, Shiratuddin

M.F.

, Wang

, Scalable machine learning techniques for highly imbalanced credit card fraud detection: A comparative study, booktitle=“PRICAI 2018: Trends in Artificial Intelligence”, (2018), Springer International Publishing”, pp. 237–246.

12.

Makki

, Assaghir

, Taher

, Haque

, -Said Hacid

, Zeineddine

, An experimental study with imbalanced classification approaches for credit card fraud detection, IEEE Access, 7 (2019), 93010–93022, IEEE.

13.

Wei Wen Soh , Rika Mohd Yusuf , Predicting credit card fraud on a imbalanced data,, International Journal of Data Science and Advanced Analytics, 1 (2019), 12–17.

14.

Bagga

, Goyal

, Gupta

, Goyal

, Credit card fraud detection using pipelining and ensemble learning, Procedia Computer Science, 173 (2020), 104–112, Elsevier.

15.

Somorjit

, Verma

, Variants of generative adversarial networks for credit card fraud detection, Trends in Computational Intelligence, Security and Internet of Things, (2020), pp. 133–143, Springer International Publishing.

16.

Kyoungok Kim , Noise avoidance SMOTE in ensemble learning for imbalanced data, IEEE Access, 9 (2021), 143250–143265, IEEE.

17.

Gede Angga Pradipta , Retantyo Wardoyo , Aina Musdholifah , Nyoman Hariyasa Sanjaya

, Radius-SMOTE: a new oversampling technique of minority samples based on radius distance for learning from imbalanced data, IEEE Access, 9 (2021), 74763–74777, IEEE.

18.

Shujuan Wang , Yuntao Dai , Jihong Shen , Jingxue Xuan , Research on expansion and classification of imbalanced data based on SMOTE algorithm, Scientific Reports, 11 (2021), 1–11, Nature Publishing Group.

19.

Adrian Hagan

O’

, Clustering with the multivariate normal inverse Gaussian distribution, Intelligent Systems Design and Applications, (2018), publisher=Computational Statistics & Data Analysis, pp. 18–30.

20.

Honghao Zhu , Guanjun Liu , Mengchu Zhou , Yu Xie , Abdullah Abusorrah , Qi Kang , Optimizing weighted extreme learning machines for imbalanced classification and application to credit card fraud detection, Neurocomputing, 407 (2020), 50–62, Elsevier.

21.

Itoo

, Singh

, and others, Comparison and analysis of logistic regression, Naï ve Bayes and KNN machine learning algorithms for credit card fraud detection, International Journal of Information Technology, 13 (2021), 1503–1511, Springer.

22.

Singh

, Ranjan

R.K.

, Tiwari

, Credit card fraud detection under extreme imbalanced data: A comparative study of data-level algorithms, Journal of Experimental & Theoretical Artificial Intelligence (2021), pp. 1–28, Taylor & Francis.

23.

Bart Baesens , Sebastiaan Höppner , Irene Ortner , Tim Verdonck , robROSE: A robust approach for dealing with imbalanced data in fraud detection, Statistical Methods & Applications, 30 (2021), 841–861, Springer.

24.

Zhenchuan Li , Mian Huang , Guanjun Liu , Changjun Jiang , A hybrid method with dynamic weighted entropy for handling the problem of class imbalance with overlap in credit card fraud detection, Expert Systems with Applications, 175 (2021), 114750, Elsevier.

25.

Boutkhoum Omar , Furqan Rustam , Arif Mehmood , Gyu Sang Choi and others, Minimizing the overlapping degree to improve class-imbalanced learning under sparse feature selection: application to fraud detection, IEEE Access 9 (2021), 28101–28110, IEEE.

26.

Hadeel Ahmad , Bassam Kasasbeh , et al., Class balancing framework for credit card fraud detection based on clustering and similarity-based selection (SBS), International Journal of Information Technology (2022), pp. 1–9, Springer.

27.

Jakob Brandt , Lanz

, Comparative review of SMOTE and ADASYN in imbalanced data classification, International Journal (2021).

28.

Mondal

I.A.

, Haque Md

, Hassan Al-Maruf , Shatabdi

, Handling imbalanced data for credit card fraud detection, 2021 24th International Conference on Computer and Information Technology (ICCIT), (2021), pp. 1–6, IEEE.

29.

Rtayli

, Enneya

, Enhanced credit card fraud detection based on SVM-recursive feature elimination and hyper-parameters optimization, Journal of Information Security and Applications, 55 (2020), 102596, Elsevier.

30.

Das Prusti , Rath , Credit card fraud detection technique by applying graph database model, Arab J Sci Eng, 46 (2021), 1–20, Springer.

31.

Cao

, Le-Khac

, O’Neill

, Nicolau

, McDermott

, Improving fitness functions in genetic programming for classification on unbalanced credit card data, (2016), pp. 35–45, Springer International Publishing.

32.

Seera

, Lim

C.P.

, Kumar

, Dhamotharan

, Tan

K.H.

, An intelligent payment card fraud detection system, Annals of Operations Research (2021), pp. 1–23, Springer.

33.

Parul Singh , Virender Ranga , Attack and intrusion detection in cloud computing using an ensemble learning approach, International Journal of Information Technology, 13 (2021), 565–571, Springer

34.

Akib Mohi Ud Din Khanday , Syed Tanzeel Rabani , et al., Machine learning based approaches for detecting COVID-19 using clinical text data, International Journal of Information Technology, 12 (2020), 731–739, Springer.

35.

Chengbo Lu , Haifeng Ke , Gaoyan Zhang , Ying Mei , Xu

, An improved weighted extreme learning machine for imbalanced data classification, Memetic Computing, 11 (2019), 27–34, Springer.

36.

Murugan

, Vijayalakshmi

, et al., Credit card fraud detection using CNN, Internet of Things and Connected Technologies (2022), pp. 194–204, Springer International Publishing.

37.

Tingfei

, Guangquan

, Kuihua

, Using variational auto encoding in credit card fraud detection, IEEE Access, 18 (2020), 149841–149853, IEEE.

38.

Victoria Priscilla

, Padma Prabha

, Influence of optimizing XGBoost to handle class imbalance in credit card fraud detection, 2020 Third International Conference on Smart Systems and Inventive Technology (ICSSIT), (2020), pp. 1309–1315.

Datasets	#Samples	#Features	#Reduced Features	#Non-Frauds	#Frauds	#Ratio
DS1:European Dataset	284807	31	10	284315	492	577.88
DS2:BankSim Dataset	594643	10	8	587443	7200	81.59
DS3:Payment Type Dataset	1048573	10	8	1047432	1141	917.99
DS4:UCI Credit card	30000	25	11	23364	6636	3.52

Classifiers	Accuracy	TPR	FPR	LR+	LR-
SVM	0.864497712	0.8953766	0.565262937	1.584000235	0.240658972
RF	0.916579151	0.9434499	0.416840508	2.263335538	0.096971866
ELM	0.88907576	0.9066669	0.3557511	2.548599056	0.144871155
CNN	0.878542311	0.8998935	0.455885795	1.973945129	0.183980597
GAN	0.916785753	0.9406461	0.394244037	1.07580312	0.223018024