Oversampling method based on GAN for tabular binary classification problems

Abstract

Data-imbalanced problems are present in many applications. A big gap in the number of samples in different classes induces classifiers to skew to the majority class and thus diminish the performance of learning and quality of obtained results. Most data level imbalanced learning approaches generate new samples only using the information associated with the minority samples through linearly generating or data distribution fitting. Different from these algorithms, we propose a novel oversampling method based on generative adversarial networks (GANs), named OS-GAN. In this method, GAN is assigned to learn the distribution characteristics of the minority class from some selected majority samples but not random noise. As a result, samples released by the trained generator carry information of both majority and minority classes. Furthermore, the central regularization makes the distribution of all synthetic samples not restricted to the domain of the minority class, which can improve the generalization of learning models or algorithms. Experimental results reported on 14 datasets and one high-dimensional dataset show that OS-GAN outperforms 14 commonly used resampling techniques in terms of G-mean, accuracy and F1-score.

Keywords

Oversampling GAN imbalanced learning

1. Introduction

Currently, data-imbalanced problems have been one of the main obstacles in various applications of machine learning. Imbalanced datasets are commonly widespread [1], which makes machine learning algorithms or models skewed to majority class [2]. However, the minority class could be more meaningful in numerous practical problems [3]. At present, studies mainly solve this problem from two levels, namely algorithmic level and data level.

At the algorithmic level, one tends to adjust the network structure to improve the classification accuracy on the unbalanced data. Ensemble learning is an effective method, such as random forest [4] and XGBoost [5], which integrates several weak classifiers and can avoid the drawback of a single classifier. In [6], a dynamic ensemble learning algorithm based on K-means achieves diverse base classifiers and distance-based dynamic ensemble creates a personalized combinational result for each test sample. Another trend is to optimize the costs of prediction errors or other potential costs with developed conceptualizations or techniques to focus the learning on minority samples. Cost-sensitive model [7] punishes parameters severely if the model misclassifies a minority sample, which can guide the model to pay more attention to the minority class. CS-SVM optimizes the SVM by extending the standard loss function with a constructive procedure [8].

At the data level, resampling method is popular and can be divided into three different strategies, viz. undersampling, oversampling and hybrid sampling. Undersampling deletes some majority samples leading to information loss [9]. Oversampling and hybrid sampling have attracted more attention. Synthetic minority oversampling technique (SMOTE) [10] is well-known as an effective oversampling method. Borderline SMOTE is one of popular variants of SMOTE, and it pays more attention to the minority samples located near the classification boundary [11]. LR-SMOTE avoids the generation of bias sample [12]. SMOTE-NaN-DE is proposed for noisy and borderline samples [13]. There are also some studies oversampling based on the distribution of the minority samples [14].

We notice that most existing oversampling methods directly apply Euclidean distance to sequential data especially to high-dimensional ones [15]. This may not well approximate the underlying characteristics of data. In addition, many methods synthesize new samples only based on the minority samples. However, the size of the minority class is generally low, which indicates that the information they can provide for oversampling is limited or insufficient. It is difficult to determine well the distribution of the minority class with few information. There are so many majority samples containing substantial information but not being fully utilized.

Generative adversarial network (GAN) is a commonly considered image generator since it was introduced in 2014 by Goodfellow [16]. It learns underlying true data distributions from limited available images, and then uses the learned distributions to generate synthetic images. This naturally inspires to investigate the effectiveness of GAN in oversampling minority samples for imbalanced datasets [17]. Mullick et al. employ adversarial oversampling in deep learning systems to mitigate the effects of class imbalance on image datasets [18]. Hao et al. propose an Annealing Genetic GAN to reproduce the distribution of the minority class using only limited data samples [19]. Being an intelligent image generator, it is also applied to industrial defect image generation [20]. Using synthetic images to balance the class distribution is fairly a recent topic that needs to be widely and deeply explored.

However, most GAN-based oversampling methods are currently limited to image generation. As far as we know, there are very few research works for numerical imbalanced datasets. Oh et al. use GAN for oversampling through outlier samples detection and removal from the majority class [21]. Another work different from image generation is to use GAN for tabular data generation [22, 23]. Though research in this area is still in its infancy, GAN can generate high-quality realistic samples through learning probability distributions of numerical datasets.

Inspired by the above discussion, in this paper, we present a new oversampling method based on GAN (OS-GAN) for numerical imbalanced datasets. In this method, GAN assumes the responsibility of extracting salient distribution characteristics of the minority class from some majority samples to expose intrinsic quality of the minority samples to oversampling.

Most existing oversampling methods, such as SMOTE and its variants, are interpolation methods which do not require acquiring any distribution information about the minority class. This is also the challenge they have to face. As a class of machine learning frameworks, GAN provides an alternative solution for oversampling by distribution learning. In OS-GAN, the inputs of the discriminator are all minority samples. No explicit estimation of statistical models or parameters of datasets is required in OS-GAN but distribution of the minority class is learned by training GAN, and then distribution information of the minority class can be incorporated into the synthetic samples. It is made in an effort to preserve the global properties (but not the local properties as other oversampling would) of the underlying probability density function of the minority class.

In a numerical imbalanced dataset, the lack of the minority class and small disjuncts samples make the distribution learning of the minority class very difficult. It has recently been observed that there might be a gray area between classes especially in imbalanced datasets [24]. Samples coming from different classes in the gray area share same space and similar characteristics. So, the inputs of the generator of GAN in OS-GAN are not random noise data but are some selected majority samples. A further advantage of this operation is that it can introduce better inductive information from the majority class for the end-to-end training of GAN, beyond what is provided by the minority class.

OS-GAN dose not select majority samples around the classification boundary for the generator but those that belong to the cluster that includes the center of the minority class. This strategy has the following advantage. Majority samples far away from the classification boundary have obvious difference with he minority samples, and they can be considered as having nothing to do with the minority samples. Majority samples around the classification boundary but far away from the center of the minority class not only provide limited distribution information for GAN but also may mislead distribution learning. Majority samples near the center of minority samples have relatively ambiguous classification characteristics with the minority class. They can be considered as samples with a lot of majority class information and a few minority class information. Therefore, inputting these samples to the generator is to help GAN start learning the distribution of the minority class from these majority samples but not from zero.

The main contributions of this paper are as follows:

(1)
GAN is used for imbalanced learning. We present OS-GAN as a minority samples generator through learning the distribution characteristics of the raw minority class. OS-GAN outperforms 7 commonly used resampling techniques on 14 imbalanced benchmark datasets.
(2)
Strategies for generating numerical data with GAN. We design a comprehensive early stopping strategy for GAN if the data is numerical to prevent data generated by GAN from concentrating or lining up. In addition, the inputs of the generator network of GAN are not random noise but some majority samples, which helps GAN extract information of the majority class as well as the minority class.

The originality of OS-GAN is that it does not perform oversampling or undersampling, but uses GAN as a distribution learning tool to directly generate minority samples. The comprehensive integration of majority and minority classes through training GAN can generate new minority samples according to the characteristics of the entire dataset rather than individual samples.

The paper is organized as follows. Section 2 introduces the related work of GAN. Section 3 describes the proposed method. Section 4 presents details of the experimental results. Section 5 demonstrates the conclusion.
2. Background of imbalanced learning and GAN

2.1 Imbalanced learning

An imbalanced dataset means that there are far more or less samples in one class than in other classes. At present, important techniques include ensemble learning, cost-sensitive learning, choosing right performance metric, oversampling, undersampling, the combination of the last two techniques, etc.

Ensemble learning combines results or performance of multiple base learners to achieve higher performance than a single classifier, such as Bagging [25] and Boosting [26]. This technique can construct a more stable model and produce better prediction than provided by a single classifier. Cost-sensitive learning uses penalized learning algorithms where the cost of misclassification on the minority class is paid more attention. Penalized-SVM is one of popular methods of this technique [27]. Performance metrics for imbalance datasets not only assesses the classification performance but also guides the classifier modeling [28]. Choosing right metric is challenging for imbalanced classification problems, and generally, precision, recall, G-mean, F1-score and ROC can provide better insight. The widely used technique for imbalanced learning is resampling, i.e. oversampling, undersampling and the combination of them. Oversampling means adding more samples to the minority class, which dose not lose any information but can cause overfitting or poor generalization to the test set [10]. Conversely, undersampling means removing samples from the majority class, which decreases the run-time but lose some information [29].

2.2 Generative adversarial network (GAN)

Here we first provide a brief introduction of the GAN, shown in Fig. 1, which is well known for its ability of generating fake data from scratch. The operation process of GAN can be clearly described as a gambling process between two competitors, i.e., be simulated as training two neural networks alternatively. One network is used to generate real-like data and it is referred to as generator ( $G$ ). The other is used to identify whether a data is real or fake and is called discriminator ( $D$ ). The objective of $G$ is to generate indiscernible fake data to deceive $D$ . The training of GAN stops when $D$ cannot distinguish the true and false of a data.

In most basic GANs, random noise $z$ subjected to uniform distribution $P_{g(z)}$ is fed into $G$ . Then the fake data $G(z)$ generated by $G$ and the real data $x$ are input to $D$ . The output of $D$ is a probability while the input of $D$ is a real data. It is evident that the probability given by $D$ being 0.5 means that $G$ can generate real-like data which cannot be distinguished by $D$ , and this state is called Nash equilibrium [30]. The objective function of the gamble between $G$ and $D$ is expressed as follows [16]:

$\displaystyle\min_{G}\max_{D}(E_{x\sim Pr(x)}[\log D(x)]+E_{z\sim Pg(z)}[\log(% 1-D(G(z)))])$ (1)

where $D(x)$ means the probability of real data, $D(G(z))$ means the probability of fake data and $E$ is the expectation of the corresponding probability.

Figure 1.

Basic structure of GAN.

3. Related work

3.1 Traditional tasks with GAN

GAN is proposed for generate images at first. Although basic GAN is poor on image task from now to see, it is still an innovative milestone. Besides, researchers use GAN achieve many interesting tasks such as style transfer [31] which can transfer a style of one work to another work and support automatic matting for photos [32].

Additionally, there are also some GAN-variants for speech generation like [33]. Even music can be generated as well [34]. Similarly, the style of music can be transfered as well like image [35].

Besides, there are also some articles propose GAN-variant to address data imbalanced problems. Ref. [36] design experiments to investigate the effect of difficulty factors such as data dimension, class overlap .etc for conditional GAN on image oversampling task. Ref. [37] build an entropyweighted label vector as extra information to help Wasserstein GAN samping minority for high dimensional datasets. However, GAN for tabular dataset with few samples (just several dozens or several hundred) is rare which is what we want to address.

3.2 Tabular data generation using GAN

GAN is generally applied to high-fidelity natural image synthesis, but it is rarely used for numerical data. It is extremely difficult to learn the complicated distribution of a numerical dataset of high dimensionality and the sparsity of data points in the feature space causes many troubles in parameter optimizing.

On the other hand, GAN also exhibits some inherent problems in generating numerical data. The major obstacle is the sparsity of numerical data and the mode collapse. In the training process of GAN on the discrete dataset, networks may be caught by vanishing or exploding gradients and the samples generated by the generator are concentrated, which results in the mode collapse [38]. To address these problems, a gradient penalty is applied to the discriminator in DRAGAN [39]. [40] allows the generator consider both the current state of the discriminator and the state of that after several updates. The strategy we use in this paper is to stop the training early.

Another problem is that, GAN need many samples to train before you need it to generate data. Most of tabular dataset with just several minority samples which is not enough to train a GAN well. Therefore, we introduce GAN learn the distribution of minority from majority here which offers a feasible plan.

4. The proposed oversampling method

Given a binary training imbalanced dataset $\{(x_{i},y_{i})|x_{i}\in R^{n},y_{i}\in\{1,-1\}\}$ . The label $y_{i}=1$ means that the sample $(x,y)$ belongs to minority (positive) class $P$ , otherwise belongs to the majority (negative) class $N$ . The flowchart of OS-GAN is shown in Fig 2. It can be roughly described as four steps:

Step 1:
By combining the center $x_{0}$ of $P$ , some negative samples $x^{N}_{j}$ are selected to be the input of G and all positive samples play the role of the real data in GAN.
Step 2:
Training GAN and generating positive samples $x^{PN}_{j}$ .
Step 3:
Translating $x^{PN}_{j}$ into $x^{P}_{j}$ with central regularization to close $x_{0}$ .
Step 4:
If necessary, use SMOTE to achieve balance.

The following subsections present details of every step.

Figure 2.
Flowchart of OS-GAN.

4.1 Selection of input samples for G

The fundamental difference between OS-GAN and other oversampling methods based GAN is that the inputs of the generator $G$ are not random data but some negative samples. At the same time, let all positive samples play the role of real data. This trick helps $G$ transform negative samples into those that construct a tight integration between the negative class and the positive class. So, $G$ learns the distribution of the positive samples from negative samples.

However, many of the negative samples are too far away from the positive samples, so that there is no substantial relationship between them and the positive class. We need to select some of the negative samples that have the greatest relationship with the positive samples. We first find the center of the positive samples as follows.

$\displaystyle x_{0}^{P}=\frac{\sum_{i=1}^{|P|}x_{i}^{P}}{|P|}$ (2)

where $|P|$ is the size of the positive class and $x_{i}^{P}$ is the ith positive sample. Then we add $x_{0}$ into the negative class and apply k-means algorithm on them.

Figure 3.

Diagram of OS-GAN.

As shown in Fig. 3a, the green and blue points are the negative and positive samples, respectively. The black cross in Fig. 3b is the center $x_{0}$ of the positive class. The clustering result of k-means is simulated in Fig. 3c. The negative samples that belong to the cluster that includes $x_{0}$ are the selected samples for the generator of GAN and are denoted by $x_{j}^{N}$ .

4.2 Training GAN

Because we only consider numerical datasets, $D$ and $G$ are constructed as full-connected multi-layer perceptrons and contain two rectified linear units (ReLU) [41], respectively. The numbers of input neurons of $D$ , the input and output neurons of $G$ are equal to the dimension of training samples according to the design of OS-GAN. A neuron with a sigmoid [42] function is assigned in $D$ to output a probability. All parameters of $D$ and $G$ are initialized random values. The GAN training algorithm is taken from the original 2014 paper [16], i.e. (1) is the loss function and the back propagation algorithm [43] is used to update parameters of $G$ and $D$ .

All $x_{j}^{N}$ are regarded as inputs to the generator $G$ to output fake samples. Then these samples and all positive samples are input to the discriminator of GAN. Through training GAN, samples are generated by the trained generator shown as yellow dots in Fig. 3d and are denoted by $x_{j}^{PN}$ .

Early stopping strategy

However, Fig. 3d shows only the ideal result, lots of experiments we performed show that samples generated by GAN from real numerical datasets are concentrated or line up, as shown in Fig. 4a and b, where GANs are trained for 100 epochs on glass4 and yeast6 datasets (the details of these two datasets are listed in Table 1), respectively.

Figure 4.

Fake samples generated by GAN on two real-world datasets glass4 and yeast6.

One reason for this phenomenon maybe that the numerical positive samples are scattered in the feature space and the amount of them are so limited that $D$ cannot abstract the global distribution of them well. When the output of $G$ satisfies the local distribution, $D$ would give a positive judgement which encourages $G$ to keep this status and the fake data would be more and more locally-distributed to deceive $D$ . Another possible reason maybe that GAN learns the distribution of a dataset not through sampling the best fit points but minimizing the overall distance between the real and fake datasets.

Our goal is to learn distribution characteristics of the positive class to generate positive samples but not negative samples. So, what we want from GAN are samples containing information of both the negative and positive samples but not a good trained generator. So, we use the strategy of stopping the training early to prevent the over-concentration of fake samples. At the same time, early stopping means the time spent on training GAN can be greatly decreased. Figure 4c and d show the fake sample after GANs are trained 20 epochs on glass4 and yeast6 datasets, respectively.

4.3 Central regularization

Although the early stopping strategy can prevent generated samples by GAN from concentrating or lining up, this strategy can also results in insufficient training of GAN, it also brings the risk of too large a gap between the generated samples and the raw positive samples. So, we use the following central regularization to force $x_{j}^{PN}$ to be close to the center $x_{0}$ :

$\displaystyle x_{j}^{P}=x_{0}+\gamma\delta(x_{0}-x_{j}^{PN})$ (3)

shown as red dots in Fig. 3e, where $\delta$ is a random number from 0 to 1 and $\gamma$ is a parameter that determines the relative position of $x_{j}^{P}$ to $x_{0}$ .

4.4 Reparative SMOTE

With the union of $P$ and $x_{j}^{P}$ , the imbalance problem is slightly alleviated but in fact, the number of $x_{j}^{P}$ generally is too small to reach the data balance from lots of experiments on real world datasets. On the other hand, a totally balanced dataset is rarely the best choice from the viewpoint of effective problem-solving, as it is unlikely that the positive and negative classes reside on manifolds of the same size. These two factors drive us to consider how many synthetic samples should be generated.

To find a suitable number of synthetic samples, $S$ , we use the cross-validation based parameter selection under some measures to find a perfect balanced level parameter, $\beta$ , according to the following equation

$\displaystyle S=\max\{|x_{j}^{P}|,(|N|-|P|)\times\beta\}.$ (4)

If $S=|x_{j}^{P}|$ , then the oversampling procedure is finished. Otherwise, SMOTE is applied on all positive samples and $x_{j}^{P}$ to generate $((|N|-|P|)\times\beta-|x_{j}^{P}|)$ samples. $x_{\textit{new}}$ shown as black dots in Fig. 3f are the generated samples by SMOTE.

OS-GANepoch $t$ , learning-rate $r$ , positive samples $x^{P}_{i}$ , negative samples $x^{N}_{i}$ , number of clusters $k$ , initial parameters of GAN $\theta$ , position parameter $\gamma$ , $\delta\in(0,1)$ , balanced level parameter $\beta$ New positive samples $x_{\textit{new}}$ $x_{0}\leftarrow\Sigma_{i=1}^{|P|}x_{i}^{P}/|P|$ $x_{j}^{N}\leftarrow\textit{kmeans}(x_{i}^{P},x_{0},k)$ $i\leftarrow 0$ $i\leqslant t$ $L\leftarrow\textit{loss}(x_{0},x_{j}^{N},x^{P}_{i}|\theta)$ $\theta\leftarrow\theta-r*\partial L/\partial\theta$ $i\leftarrow i+1$ $x_{j}^{PN}\leftarrow\textit{GAN}(x_{j}^{N}|\theta)$ $x_{j}^{P}\leftarrow x_{0}+\gamma\delta(x_{0}-x_{j}^{PN})$ $S\leftarrow\max\{|x_{j}^{P}|,(|N|-|P|)*\beta\}$ $S\geqslant|x_{j}^{P}|$ $x_{\textit{new}}\leftarrow\textit{SMOTE}(x_{i}^{P},x_{i}^{N},x_{j}^{P})$ $S\leftarrow S-|x_{\textit{new}}|$ $\textit{add}x_{\textit{new}}$

4.5 Analysis of complexity of algorithm

To implement an OS-GAN, most of cost is on training a GAN. The time complexity is obviously $O(N*\textit{eopch})$ . For space complexity, because the structure of GAN is fixed, and the space used to store parameters of GAN is fixed which means GAN’s space complexity is $O(1)$ . However, kmeans and SMOTE need space dynamically which are linear complexity with the number of majority samples and the gap between two classes. And this condition is similar to regularisation and center calculation. So the space complexity of OS-GAN is $O(kN)$ meaning linear complexity.

5. Experimental studies and discussion

In this section, OS-GAN are compared with 14 commonly used resampling methods in terms of three metrics on 14 datasets. Because there are only few positive samples in some datasets, five-fold cross validation is used in our experiments. All results are shown as the average of five-fold cross validation of independent-run ten times.

5.1 Datasets and experimental settings

The 14 datasets are collected from KEEL Tool [44] and UCI Repository [45] to demonstrate the performance of the proposed method. The size of these datasets varies from 214 to 1622. The number of features is between 5 and 18. The imbalance ratio (IR) is from 3.25 to 49.69. The description of datasets is shown in Table 1.

Table 1
Details of datasets

ID	Dataset	Positive	Negative	IR
1	Vehicle0	199	647	3.25
2	New-thyroid1	35	180	5.14
3	New-thyroid2	35	180	5.14
4	Ecoli2	52	284	5.46
5	Glass6	29	185	6.38
6	Ecoli3	35	301	8.6
7	Yeast-2vs4	51	463	9.08
8	Yeast-1_vs_7	30	429	14.3
9	Glass4	13	201	15.46
10	Yeast-2vs8	20	462	23.1
11	Yeast5	44	1440	32.73
12	Abalone-21_vs_8	14	567	40.5
13	Yeast6	35	1449	41.4
14	Abalone-19_vs_10-11-12-13	32	1590	49.69

Three classifiers, namely support vector machine (SVM) [46], multilayer perceptron (MLP) [47] and random forest (RF) [4], are used to assess the efficiency of resampling methods. From a large number of experimental results, the coefficient $k$ in K-means is set to 4, 5, 5, 2, 3, 10, 11, 20, 2, 8, 20, 8, 20 and 5 for each dataset, respectively. We use SVM with linear kernel and the parameter $\gamma$ in (3) is set to 2 in all numerical experiments.

Figure 5 shows the sensitivity of OS-GAN to the balanced level parameter $\beta$ in (4) on all the 14 datasets in terms of Gm of SVM, which guides the selection of $\beta$ for OS-GAN on each dataset. The candidate values of $\beta$ are set to 0.5, 0.75, 1.0 and 1.25. As we can see, the balanced level can exactly affect the performance of OS-GAN to different degrees. Similar tendency is visible in other resampling methods but are not plotted here because of the limited space. This suggests that the sizes of positive and negative classes in rebalanced dataset are no need to be the same. In all following experiments, we choose the best $\beta$ parameter according to the elbow point of the Gm line for OS-GAN and other approaches.

When it comes to GAN, in the following experiments, the leaning rate of $D$ and $G$ are both 0.002. As we described in Section 3, the inputs of $G$ are some selected negative samples and the real data for $D$ is the entire positive samples. The batch-size for $D$ is the number of the selected positive samples. Both networks of $D$ and $G$ contain two hidden layers. And the numbers of neurons of second and third layers, namely full-connected layers, are 5 and 3, respectively.

All experiments are implemented in Python 3.6 software and run on a computer with an i7-8750H CPU, 8.00 GB of RAM and 64-bit operating system. To evaluate the performance of resampling methods, three metrics, accuracy (Acc), F1-score (Fm) and G-mean (Gm) [48] are introduced based on the confusion matrix shown in Table 2.

Table 2

Confusion matrix

	Positive sample	Negative sample	Total
Predicting positive	TP	FN	TP $+$ FN
Predicting negative	FP	TN	FP $+$ TN
Total	TP $+$ FP	FN $+$ TN	T

Figure 5.

The sensitivity of OS-GAN to the balanced level parameter $\beta$ in terms of Gm of SVM.

Acc represents the ratio at which the algorithm predicts correctly, but is not reliable enough on unbalanced datasets. It is defined as following:

$\displaystyle\textit{Acc}=\frac{TP+TN}{TP+TN+FP+FN}$ (5)

$F_{m}$ is the harmonic mean of precision and recall and can be determined as follows:

$\displaystyle F_{m}=2*\frac{\frac{TP}{TP+FP}*\frac{TP}{TP+FN}}{\frac{TP}{TP+FP% }+\frac{TP}{TP+FN}}$ (6)

$G_{m}$ is the geometric mean of true positive ratio and true negative ratio. It is one of the most significant metrics to evaluate the performance of machine learning models on imbalanced datasets. It is defined as following:

$\displaystyle G_{m}=\sqrt{\frac{TN}{TN+FP}*\frac{TP}{TP+FN}}$ (7)

5.2 Experimental design and results

45 combinations of three classifiers with 15 comparison resampling methods are shown in Table 3 where OS-GAN is abbreviated as OG. Among these comparison methods, SMOTE [10], RandomOversampling (RO) [49] and borderline-SMOTE (BS) [11] are classical oversampling methods; RandomUndersampling (RU) [29] and NearMiss (NM) [50] are two classical undersampling methods; Tomek-SMOTE (TS) [51] is a top hybrid sampler; self-organizing map oversampling (SOMO) [52], MCT [53], adaptive semi-unsupervised weighted oversampling (ASUWO) [54], CCR [55], SMOTE-D (SD) [56] and kmeans-SMOTE (KS) [57] are recent top performer oversamplers. Although unrolled generative adversarial networks (URG) [40] and generative adversarial minority oversampling (GAMO) [58] are two popular GAN-based oversampling methods for image generation, they are included in the comparison to explore their performance on numerical datasets.

Table 3
45 combinations of three classifiers with eight comparison resampling methods

Classifier	Resampling	Abbreviation	Classifier	Resampling	Abbreviation	Classifier	Resampling	Abbreviation
SVM	RO	SO	MLP	RO	MO	RF	RO	RFO
	SMOTE	SS		SMOTE	MS		SMOTE	RS
	BS	SB		BS	MB		BS	RB
	KS	SK		KS	MK		KS	RK
	RU	SU		RU	MU		RU	RFU
	NM	SN		NM	MN		NM	RN
	TS	ST		TS	MT		TS	RT
	CCR	SC		CCR	MC		CCR	RC
	ASUWO	SA		ASUWO	MA		ASUWO	RA
	SOMO	SSO		SOMO	MSO		SOMO	RSO
	MCT	SM		MCT	MM		MCT	RM
	SD	SSD		SD	MSD		SD	RSD
	URG	SUR		URG	MUR		URG	RUR
	GAMO	SGA		GAMO	MGA		GAMO	RGA
	OG	SG		OG	MG		OG	RG

Figure 6.

Visualization of oversampling process with OS-GAN on synthetic dataset and ID 6.

In order to gain insight into the effect of OS-GAN, the two rows of Fig. 6 show the visualization of the oversampling process with OS-GAN on a two-dimensional synthetic dataset and the real world dataset ID6, respectively. The two subfigures in the first column of Fig. 6 are the raw datasets and the negative and positive samples are shown with the green and blue points, respectively. Each class of the synthetic dataset are subject to the 2-dimensional Gaussian distribution. The second column shows the generated data by GAN. The third column shows the transferred data from the generated data by (3). The last column shows the totally rebalanced dataset with SMOTE. According to the results, OS-GAN not only generates positive samples around the region of classification boundary, but also generates positive samples scattered out of the domain of the original positive class. This can strengthen the ability of generalization of classifiers and is not what traditional SMOTE-variants can do.

Table 4

Test Gm (%) of 45 approaches on ID1-ID5 datasets

Dataset	ID1			ID2			ID3			ID4			ID5
Classifier	SVM	MLP	RF	SVM	MLP	RF	SVM	MLP	RF	SVM	MLP	RF	SVM	MLP	RF
RO	78.65	57.35	83.46	98.38	96.84	97.81	98.49	97.72	97.95	75.71	81.92	91.75	87.60	88.09	95.09
	4.82	5.23	6.06	0.16	1.96	0.21	0.11	0.73	0.96	1.37	1.40	1.28	1.96	1.80	1.10
SMOTE	95.27	88.36	94.16	98.36	95.35	98.00	98.42	98.58	98.14	79.32	80.09	90.72	90.12	90.10	94.44
	0.45	2.81	0.60	0.16	1.45	0.46	0.17	0.39	0.72	1.05	1.05	1.68	0.90	1.80	1.48
BS	95.06	87.57	94.49	98.42	97.61	97.49	98.49	97.85	97.67	78.57	79.31	92.60	89.77	81.78	95.32
	0.44	2.19	1.28	0.21	1.01	0.76	0.11	0.77	0.77	2.44	2.45	2.10	1.42	16.94	2.37
KS	95.71	88.03	94.39	98.36	97.98	98.41	98.37	98.16	98.82	78.67	80.51	92.83	91.97	90.56	95.41
	0.58	1.25	0.93	0.16	0.75	0.99	0.14	0.64	0.80	1.53	1.26	1.13	1.09	1.53	0.63
RU	93.38	87.16	86.05	95.23	72.00	93.81	95.26	66.43	94.10	75.13	76.12	80.44	85.77	23.23	86.16
	0.89	2.11	0.62	1.43	30.30	2.13	1.59	26.60	1.94	1.74	0.79	3.35	2.75	22.03	3.37
NM	94.88	73.85	64.95	98.36	85.99	97.20	98.17	89.04	97.95	63.09	64.53	67.18	88.48	31.60	86.77
	0.41	8.37	0.44	0.16	11.94	0.85	0.44	9.53	0.98	1.12	1.87	2.54	1.74	31.10	2.61
TS	95.68	87.68	94.53	98.39	98.24	98.35	98.42	98.71	98.06	79.46	80.11	92.14	90.12	87.56	93.37
	0.56	2.14	0.80	0.19	0.78	0.92	0.05	0.00	0.65	0.93	0.74	1.12	1.76	2.11	2.38
CCR	94.66	87.11	94.48	98.43	97.29	96.74	97.96	98.01	97.27	78.39	77.82	87.87	88.04	88.25	94.88
	0.34	1.69	1.07	0.45	1.02	1.33	0.54	0.72	0.81	0.94	1.02	2.59	2.43	1.38	2.10
ASUWO	94.46	87.75	93.61	98.37	98.45	98.14	98.25	98.38	98.12	78.26	80.09	92.64	90.53	91.31	96.26
	0.43	1.35	0.72	0.39	0.52	1.11	0.47	0.98	0.88	1.43	1.43	2.20	1.97	1.05	1.44
SOMO	95.65	88.50	95.35	98.33	11.62	98.55	98.36	40.89	98.34	87.95	86.31	92.66	92.04	22.43	95.29
	0.54	2.02	0.65	0.15	15.52	1.03	0.16	31.83	1.78	1.70	1.58	1.82	0.83	28.35	1.20
MCT	94.73	87.78	94.71	98.46	98.11	97.89	98.34	98.21	98.18	79.01	79.80	91.64	88.48	86.49	95.39
	0.42	2.21	0.83	0.14	0.61	1.08	0.55	1.12	0.96	1.23	0.69	1.37	2.16	1.22	2.30
SD	93.77	89.60	93.99	96.89	96.32	96.25	96.06	97.31	96.83	86.68	85.01	92.50	92.66	91.46	94.35
	0.56	2.01	0.76	1.85	0.97	1.97	1.30	0.88	1.18	1.91	1.94	1.48	1.07	1.01	2.73
URG	95.77	88.23	93.41	91.39	90.87	93.67	89.84	88.76	94.08	86.79	74.65	83.32	86.95	77.36	90.80
	1.12	13.52	1.94	6.45	14.71	5.44	12.82	20.46	8.39	10.07	27.88	11.70	10.63	26.72	11.00
GAMO	40.51	54.74	94.98	3.34	30.36	88.20	2.89	30.85	82.92	0.00	2.73	76.73	0.00	1.63	87.40
	10.48	24.51	1.87	13.74	28.63	9.04	11.60	30.09	7.60	0.00	9.43	13.89	0.00	8.00	14.53
OG	95.20	89.81	95.42	98.70	98.36	98.16	98.50	97.61	98.21	81.06	82.97	92.84	92.83	91.56	96.20
	0.56	0.99	0.39	0.47	0.95	1.20	0.77	1.58	0.88	1.17	1.27	0.83	1.33	2.15	0.69

Table 5

Test Gm (%) of 45 approaches on ID6-ID10 datasets

Dataset	ID6			ID7			ID8			ID9			ID10
Classifier	SVM	MLP	RF	SVM	MLP	RF	SVM	MLP	RF	SVM	MLP	RF	SVM	MLP	RF
RO	63.08	71.50	76.03	75.77	78.98	88.63	40.35	43.41	43.98	64.17	75.76	76.55	78.65	57.35	83.46
	1.36	1.08	3.73	1.23	3.37	1.97	1.80	1.67	15.74	5.41	4.03	28.37	4.82	5.23	6.06
SMOTE	69.17	69.62	74.31	81.79	81.08	85.62	49.20	48.16	56.99	69.12	66.30	88.89	89.65	81.40	67.00
	1.27	0.76	2.86	1.25	1.39	1.60	2.87	1.15	6.10	2.90	2.62	3.62	17.89	4.66	19.28
BS	65.75	68.37	74.80	78.43	78.23	87.38	48.23	47.76	49.92	71.68	67.76	83.57	87.74	77.71	62.43
	1.35	1.55	2.42	1.15	2.18	1.67	3.51	1.24	13.89	3.84	5.19	13.45	10.24	10.38	33.25
KS	71.65	70.90	76.38	89.30	88.34	90.81	39.22	44.02	46.45	72.42	78.20	71.33	92.13	92.13	66.70
	1.71	1.34	5.65	1.35	1.24	1.82	3.49	3.22	19.37	3.49	12.29	29.42	10.82	10.82	38.52
RU	58.94	65.46	63.56	75.59	70.53	75.55	36.21	40.73	37.18	59.93	23.75	60.74	95.33	36.66	34.70
	0.91	2.16	2.27	1.58	2.51	1.08	2.60	1.78	3.21	3.10	23.53	3.39	1.15	4.04	2.70
NM	42.50	36.62	31.38	74.93	67.80	65.71	34.64	27.82	28.37	67.81	34.07	59.91	95.33	32.78	25.74
	1.21	1.84	2.30	1.47	1.94	3.04	2.17	1.15	1.42	4.92	24.93	8.19	1.15	2.18	0.81
TS	69.68	70.50	76.92	82.44	80.44	85.89	51.16	48.39	61.96	68.43	78.00	90.22	95.02	80.91	75.87
	1.25	0.83	2.46	0.78	1.56	2.79	3.73	2.95	7.41	2.26	3.75	5.67	1.24	3.49	9.38
CCR	64.88	67.06	78.19	82.09	82.73	87.68	51.75	47.36	64.00	67.33	63.63	86.18	89.84	75.23	88.13
	1.01	1.07	4.55	2.07	1.86	1.37	1.15	1.84	8.89	4.41	4.57	5.67	11.51	3.37	11.40
ASUWO	66.97	70.29	75.09	77.25	71.37	90.98	52.77	49.82	22.05	71.69	67.33	65.00	80.10	70.00	55.53
	1.44	0.88	4.04	2.47	24.08	2.49	6.85	4.95	16.73	4.13	3.70	34.10	4.29	4.07	36.23
SOMO	0.00	7.08	80.36	96.49	78.80	90.17	0.00	5.83	39.29	22.78	1.38	77.74	95.27	88.55	56.37
	0.00	8.80	4.59	0.15	26.45	2.57	0.00	8.91	33.39	16.11	4.14	22.14	1.08	23.01	34.77
MCT	67.83	69.13	72.97	81.29	80.33	89.26	50.43	49.34	57.84	69.40	56.18	85.88	87.44	77.71	64.83
	0.93	1.18	6.90	1.32	2.25	1.31	2.99	1.88	21.16	3.49	19.16	22.19	17.57	17.46	28.59
SD	0.00	75.05	80.44	96.55	74.95	89.47	0.00	15.21	43.78	13.67	18.54	51.61	91.82	74.95	73.49
	0.00	6.24	7.51	0.06	37.92	3.01	0.00	14.02	17.35	15.25	14.94	32.39	6.95	32.67	33.53
URG	71.51	77.55	66.70	76.63	79.95	84.39	10.63	11.90	34.61	45.23	44.99	52.67	71.90	71.42	62.88
	32.23	21.14	10.09	22.81	17.69	7.97	21.83	23.12	23.23	37.46	39.88	29.65	15.49	17.96	26.86
GAMO	0.00	0.75	69.95	0.00	39.01	84.35	0.00	0.00	47.22	0.00	2.31	50.15	0.00	31.46	62.90
	0.00	5.25	12.30	0.00	25.47	5.50	0.00	0.00	21.68	0.00	11.31	31.09	0.00	33.32	15.55
OG	75.50	76.31	80.68	90.10	85.55	88.48	55.07	50.74	54.49	72.74	75.69	90.66	95.33	75.67	83.70
	1.31	1.21	2.04	0.99	1.30	1.00	4.56	2.52	6.39	2.71	2.86	5.42	1.15	3.89	3.63

Table 6

Test Gm (%) of 45 approaches on ID11-ID14 datasets

Dataset	ID11			ID12			ID13			ID14
Classifier	SVM	MLP	RF	SVM	MLP	RF	SVM	MLP	RF	SVM	MLP	RF
RO	55.48	67.49	85.70	66.68	66.80	53.34	40.60	44.00	68.25	24.57	22.97	21.44
	0.59	2.06	1.47	3.66	2.77	30.22	0.47	1.86	2.67	0.84	1.91	1.78
SMOTE	56.74	63.65	83.26	71.82	67.57	71.11	48.08	49.28	72.59	30.02	27.30	24.19
	0.35	0.82	1.81	3.50	4.41	5.50	0.51	0.66	5.23	0.91	1.59	7.64
BS	57.09	63.56	84.48	76.30	71.40	51.97	46.75	52.24	77.65	28.72	27.13	3.13
	0.67	0.54	2.39	7.52	18.50	35.24	1.31	1.07	3.16	3.31	6.26	6.53
KS	56.35	66.31	84.32	62.66	75.88	56.19	42.62	53.73	72.85	22.30	22.34	15.70
	0.49	0.90	1.32	7.51	2.38	29.21	4.56	5.70	3.89	1.70	2.89	5.84
RU	48.31	56.03	57.65	28.63	39.84	36.84	37.61	36.13	39.22	14.40	22.10	19.68
	0.67	1.44	1.51	2.32	6.95	4.39	0.82	2.94	1.29	1.69	1.14	0.94
NM	57.35	57.56	64.37	22.76	17.91	15.71	15.67	18.31	15.66	14.63	15.67	13.36
	3.51	1.49	1.65	8.97	2.63	1.12	1.44	2.08	0.58	1.10	0.36	0.62
TS	58.20	65.71	84.56	74.73	69.88	65.70	48.43	49.58	73.11	29.82	27.12	20.98
	0.61	1.17	1.52	4.29	4.94	10.50	0.93	1.19	4.30	0.82	1.46	4.53
CCR	53.64	59.57	88.37	64.86	60.63	64.83	46.62	51.02	76.33	29.96	25.45	6.97
	0.42	0.61	3.29	3.32	4.57	10.74	0.34	1.51	19.10	1.73	5.31	6.67
ASUWO	59.14	66.04	83.86	69.55	73.83	52.62	47.34	52.61	81.23	30.04	24.23	1.95
	1.19	1.22	2.53	8.51	5.99	24.48	1.49	1.09	6.41	1.82	6.04	3.98
SOMO	0.00	76.49	87.65	0.00	45.71	40.05	0.00	4.53	66.67	0.00	0.00	1.40
	0.00	4.48	3.74	0.00	25.19	23.44	0.00	7.18	19.91	0.00	0.00	4.20
MCT	57.88	65.48	87.84	77.68	71.98	56.38	48.43	48.46	74.51	32.70	27.39	3.38
	0.56	0.97	1.93	3.14	3.47	24.86	0.80	1.24	15.00	1.76	1.48	6.89
SD	0.00	78.37	85.48	0.00	3.97	52.36	0.00	11.14	71.48	0.00	0.00	0.00
	0.00	4.07	5.31	0.00	7.95	18.26	0.00	14.56	26.35	0.00	0.00	0.00
URG	39.54	57.56	66.17	61.27	70.73	38.94	47.24	42.76	45.85	42.93	24.57	0.00
	39.61	23.60	13.42	30.69	26.23	34.27	24.95	26.27	16.97	24.18	26.01	0.00
GAMO	0.00	7.53	61.49	0.00	1.15	38.01	0.00	0.00	41.80	0.00	0.00	0.81
	0.00	16.37	15.03	0.00	8.08	36.30	0.00	0.00	20.27	0.00	0.00	5.70
OG	61.30	68.55	85.52	79.36	80.27	75.99	52.65	57.52	78.09	27.67	27.50	27.47
	1.04	1.17	2.24	6.26	6.29	5.21	1.01	1.66	1.97	2.56	0.96	4.60

Because there are too many numbers to show, the test Gm results of 45 approaches on 14 datasets are shown in Tables 4–6. The best values are shown in bold. In accordance with these three tables, OS-GAN gets at least one best result on each dataset. The remaining results of OS-GAN are close to the best, and there is no the worst. We also find that if the IR of dataset is over than 10 or the size of the dataset is relatively large, the more times OS-GAN wins. This may because that GAN is a statistic learner and the larger the number of samples, the better the learning effect. Datasets with a high IR can provide more information for OS-GAN than datasets with a low IR because of the majority samples selection technique in OS-GAN. In addition, the standard deviations of approaches with OS-GAN are low, which indicates that our method is stable. There is no doubt that OS-GAN outperforms other resampling methods in terms of the metric Gm that plays an important role in evaluating imbalance learning.

Figure 7.

Test Acc of 45 approaches on the 14 datasets.

Figure 7 shows the test Acc of 45 approaches on the 14 datasets. Our proposed method is shown in black. As we can see that SG, MG and RG are at the top on almost all datasets. Especially, RG outperforms other approaches on 12 datasets, which means that OS-GAN fits well with random forest. The high Acc means that OS-GAN keeps the ability of identifying negative samples while improving the stability of detecting positive samples of machine learning models or algorithms. We also find that the two classical undersampling methods, RU and NM, not only produce less values but also are unstable on datasets with high IR. This may be due to the loss of information caused by removing some negative samples. Except for these two undersampler, the GAN-base oversampling method URG does not perform as well as the other approaches.

Figure 8.

Test Fm (%) of 42 approaches on the 14 datasets.

Figure 8 shows the test Fm of 42 approaches on the 14 datasets. We notice that the performance of GAMO is very poor in Gm where 14 results are zero. This phenomenon attracts us to its AUC values because AUC can provide an aggregate measure of performance across all possible classification thresholds. By the cross validation, most average AUC values of GAMO on 14 datasets close to 0.5. This means that approaches with GAMO being the oversampler have no discrimination capacity to classification problems. So there is no need to further analyze its performance in Fm. According to Fig. 8, OS-GAN gets the highest Fm values in many times. With the increase of IR, the number of times OS-GAN surpasses other approaches is increasing. We also can find that there is little difference in the performance of most approaches on datasets with low IR, such as on ID 1–4. But the difference becomes apparent if IR is great. This means that resampling methods produce different balanced results and classifiers are sensitive to the balanced data. So, selecting a good and appropriate resampling method is extremely important for classification problems.

Figure 9.

Boxplots of the results of 15 resampling methods on the whole dataset.

Figure 9 shows the boxplots of the results of 15 resampling methods on the whole 14 dataset. The results are the average of Acc, Fm and Gm of 14 datasets in each time of ten-independent-run. The first picture illustrates the average Acc of SVM with different resampling methods. The second one and the third one show the information which is similar to the first one according to the subtitles. In the terms of Acc of SVM, the best resampling method is the GAN-based URG. But OS-GAN outperforms other 14 methods on the whole dataset in terms of Fm and Gm, and the the standard deviation of OS-GAN is apparently less than others. This means that OS-GAN is competitive and pragmatic.

Figure 10.

The mean values and standard deviations of Fm of RG on ID2, ID5 and ID11 in terms of different $\gamma$ .

From above experimental results, if GAN-based oversampling methods are directly applied on numerical datasets, the performance is barely satisfactory. The proposed method works well not only because GAN learns the data distribution, but also because of what happens after sample generating by GAN. This indicates that there is still a fundamental difference between image data and numerical data if GAN is used to oversample.

To explore the influence of the parameter $\gamma$ in (3) on the performance of our proposed algorithm, we apply RG on ID2, ID5 and ID11 datasets with $\gamma$ varying from 1 to 5. Figure 10a shows the mean values of Fm and Fig. 10b shows the standard deviations of Fm, respectively. In accordance with the result, setting $\gamma$ to 2 is recommended. Overall, with the increasing of $\gamma$ , the standard deviation reasonably increases as well, which shows that OS-GAN becomes unstable because of the promotion of randomness.

The above numerical experiments are implemented on datasets with moderate dimension. To explore the performance of OS-GAN on high-dimension datasets, we conduct our method on a 90-dimension dataset, $\textit{movement\_ libras}\_1-13\_vs\_14-15$ , from UCI Repository. It contains 360 samples and the IR is 6.5. As shown in Table 7, our method still obtains outstanding results on the high-dimension dataset.

Table 7

The performance of OS-GAN on a high-dimension dataset: $\textit{movement\_ libras}\_1-13\_vs\_14-15$

	Acc	Fm	Gm
SVM	0.9017 $\pm$ 0.01	0.6376 $\pm$ 0.03	0.7797 $\pm$ 0.03
MLP	0.9525 $\pm$ 0.01	0.8246 $\pm$ 0.03	0.8985 $\pm$ 0.02
RF	0.9594 $\pm$ 0.01	0.8446 $\pm$ 0.03	0.9221 $\pm$ 0.02

To evaluate the robustness of our proposed algorithm, we add some noisy samples into positive class of ID2 dataset through fliping the label of negative samples. The noisy ratio is set to 5%, 10% and 20%, respectively. The relative error e is introduced to mesure the rate changes of Acc, Fm and Gm, which can be derived as follows.

$\displaystyle e=\frac{A-B}{A}*100\%$ (8)

where $A$ and $B$ are the values of Acc, Fm or Gm on the original and the noised data, respectively.

Table 8

The relative errors of metrics on ID2 with noise

Classifier	5%			10%			20%
	Acc	Fm	Gm	Acc	Fm	Gm	Acc	Fm	Gm
SVM	0.94%	2.69%	1.26%	2.54%	7.17%	3.56%	4.37%	11.62%	5.49%
MLP	1.12%	3.47%	1.01%	2.44%	7.05%	2.70%	4.54%	12.13%	5.44%
RF	0.89%	2.63%	1.25%	2.17%	6.23%	2.82%	4.00%	10.61%	4.80%

Figure 11.

Acc of classifiers with MGAN, TGAN and OG on 14 datasets.

Table 8 shows the results of robustness test. The relative errors of Acc and Gm are sufficiently low and definitely acceptable. Although the relative errors are large in terms of 20% noisy ratio, they are still less than the noisy ratio. This shows that OS-GAN is robust and can effectively counteract noise.

5.3 Comparison with similar GAN oversamplers

Section 5.2 shows OS-GAN can beat most of baselines on 14 benchmarks. However, most of baselines are implemented based on SMOTE or GAN for image. Here we choose two state of the art GAN samplers that are similar to OS-GAN to present. Ref. [59] propose an oversampler (MGAN) based on GAN learning information from majority class that is similar to OS-GAN. Ref. [23] focus on tabular data with GAN (TGAN). We would do an experiment like Section 5.2 on abovementioned 14 public datasets to explore deeply. Figure 11 shows OS-GAN is better for few samples with high imbalance ratio. In addition, the performance of OS-GAN is satisfactory compared with other similar configurations.

5.4 Ablation experiment

To evaluate the performance and importance of central regularisation and reparative SMOTE, we do an ablation study which investigates the performance of OS-GAN without regularization (OG-R) and OS-GAN without SMOTE (OG-S). Figure 12 shows the Gm of RF with OG, OG-R and OG-S on 14 datasets. In conclusion, regularisation is better for datasets with high overlapping or datasets with complex distribution while SMOTE is better for datasets with high imbalance ratio, especially, the dataset with complex distribution and fuzzy classification boundary. This experiment shows both central regularisation and reparative SMOTE are necessary in most cases to increase the performance of OS-GAN.

Figure 12.

Gm of ablation study of OS-GAN on 14 datasets.

5.5 Exploration of multi-class classification

Although the original intention for OS-GAN is to address binary classification task, we explore the efficiency of OS-GAN on multi-class classification as well. First of all, two multi-class datasets are built with ecoli2 and ecoli3 datasets which share the same characteristics. The first dataset is named $\textit{ecoli}-3-124$ (E3), a dataset with three classes, which means that ecoli2 and the majority of ecoli3 are chosen and control the initial imbalance ratio is $|P|$ in $\textit{ecoli}2:|P|$ in $\textit{ecoli}3:|N|$ in $\textit{ecoli}2=1:2:4$ . The second one is named $\textit{ecoli}-4-1234$ (E4), a dataset with four classes, which means that ecoli2 and ecoli3 are chosen and control the initial imbalance ratio is $|P|$ in $\textit{ecoli}2:|P|$ in $\textit{ecoli}3:|N|$ in $\textit{ecoli}2:|N|$ in $\textit{ecoli}3=1:2:3:4$ , and the samples are denoted by e1, e2, e3 and e4. We use RF as classifier and recall as metric to evaluate the result in multi-class task in this section.

$\displaystyle\textit{recall}=\frac{TP}{TP+FN}$ (9)

Sampling process is as follows. Firstly, oversampling e3 with the number of e3 and that of e4 $=$ 1:1 ratio. Then, take e3 and e4 as negative, e2 as positive, oversampling e2 with the number of e2 and that of e4 $=$ 1:1. Finally, take e2, e3 and e4 as negative, e1 as positive, oversampling e1 with the number of e1 and that of e4 $=$ 1:1. Thus, the ratio is 1:1:1:1 after sampling.

Table 9

The performance of OS-GAN on two multi-class datasets

	recall1	recall2	recall3	recall4
E3	0.8013	0.7031	0.8442	/
	0.02	0.04	0.01	/
E4	0.7621	0.7101	0.8670	0.8143
	0.02	0.03	0.02	0.02

Table 9 shows the performance of OS-GAN on two multi-class datasets. The results are satisfactory on the two multi-class datasets using OS-GAN with the oversampling process we mentioned above. In conclusion, OS-GAN can be used on multi-class task conveniently with class-wise sampling process without changing the binary sampling structure of algorithm.

6. Conclusion

The paper presents a new oversampling method, OS-GAN, where GAN is used on numerical datasets to learn the distribution of the positive class through converting some negative samples into positive-liked ones. Since fake samples generated by GAN are concentrated or line up, an early-stop learning strategy is adopted in GAN. After that, a further operation based on the center of positive class is executed to modify the generated data by GAN, which allows synthetic samples to carry distinguishing positive characteristic. OS-GAN abstracts information from both negative and positive classes to rebalance the dataset and to improve the generalization of classifiers. In the light of the results of experiments on 14 benchmark datasets and one high-dimension dataset, our method outperforms 14 commonly used resampling methods in terms of G-mean, accuracy and F1-score. This manifests that OS-GAN can effectively generate positive samples with the information of the negative class as well as the positive class. Further investigation may include exploring a new method to exploit the negative information to sketch further distribution characteristics of datasets.

Footnotes

Acknowledgments

This work was supported in part by the National Key R&D Program of China under Grant 2018AAA0100300, the National College Student Innovation and Entrepreneurship Training Program Support Project under Grant 20211014110020, in part by the Fundamental Research Funds for the Central Universities under Grant DUT22YG236, in part by the National Natural Science Foundation of China under Grant 11201051, 62172073, 62076182.

Declaration of competing interests

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Haixiang

Yijing

Shang

Mingyun

Yuanyue

and Bing

, Learning from class-imbalanced data: Review of methods and applications, Expert Systems with Applications 73 (2017), 220–239.

Spelmen

V.S.

and Porkodi

, A review on handling imbalanced data, in: 2018 International Conference on Current Trends towards Converging Technologies (ICCTCT), IEEE, 2018, pp. 1–11.

Wang

Xin

and Xu

, A novel deep metric learning model for imbalanced fault diagnosis and toward open-set classification, Knowledge-Based Systems 220 (2021), 106925.

Y.-S.

Chi

Shao

X.-Y.

M.-L.

and Xu

B.-G.

, A novel random forest approach for imbalance problem in crime linkage, Knowledge-Based Systems 195 (2020), 105738.

Wang

Deng

and Wang

, Imbalance-XGBoost: Leveraging weighted and focal losses for binary label-imbalanced classification with XGBoost, Pattern Recognition Letters 136 (2020), 190–197.

Guo

Liu

and Lu

, A Dynamic Ensemble Learning Algorithm based on K-means for ICU mortality prediction, Applied Soft Computing 103 (2021), 107166.

Wong

M.L.

Seng

and Wong

P.K.

, Cost-sensitive ensemble of stacked denoising autoencoders for class imbalance problems in business domain, Expert Systems with Applications 141 (2020), 112918.

Iranmehr

Masnadi-Shirazi

and Vasconcelos

, Cost-sensitive support vector machines, Neurocomputing 343 (2019), 50–64.

Tsai

C.-F.

Lin

W.-C.

Y.-H.

and Yao

G.-T.

, Under-sampling class imbalanced datasets by combining clustering analysis and instance selection, Information Sciences 477 (2019), 47–54.

10.

Chawla

N.V.

Bowyer

K.W.

Hall

L.O.

and Kegelmeyer

W.P.

, SMOTE: Synthetic minority over-sampling technique, Journal of Artificial Intelligence Research 16 (2002), 321–357.

11.

Han

Wang

W.-Y.

and Mao

B.-H.

, Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning, in: International Conference on Intelligent Computing, Springer, 2005, pp. 878–887.

12.

Liang

Jiang

Xue

and Wang

, LR-SMOTE – An improved unbalanced data set oversampling based on K-means and SVM, Knowledge-Based Systems 196 (2020), 105845.

13.

Zhu

Zhang

Gong

and Fan

, SMOTE-NaN-DE: Addressing the noisy and borderline examples problem in imbalanced classification by natural neighbors and differential evolution, Knowledge-Based Systems, 2021, 107056.

14.

Pan

Zhao

and Yang

, Learning imbalanced datasets based on SMOTE and Gaussian distribution, Information Sciences 512 (2020), 1214–1233.

15.

Fernández

Garcia

Herrera

and Chawla

N.V.

, SMOTE for learning from imbalanced data: progress and challenges, marking the 15-year anniversary, Journal of Artificial Intelligence Research 61 (2018), 863–905.

16.

Goodfellow

I.J.

Pouget-Abadie

Mirza

Warde-Farley

Ozair

Courville

and Bengio

, Generative adversarial networks, arXiv preprint arXiv:1406.2661, 2014.

17.

Sampath

Maurtua

Martín

J.J.A.

and Gutierrez

, A survey on generative adversarial networks for imbalance problems in computer vision tasks, Journal of Big Data 8(1) (2021), 1–59.

18.

Roy

S.K.

Haut

J.M.

Paoletti

M.E.

Dubey

S.R.

and Plaza

, Generative adversarial minority oversampling for spectral-spatial hyperspectral image classification, IEEE Transactions on Geoscience and Remote Sensing, 2021.

19.

Hao

Wang

Zhang

and Yang

, Annealing genetic GAN for minority oversampling, arXiv preprint arXiv:2008.01967, 2020.

20.

Niu

Wang

and Lin

, Defect image sample generation with GAN for improving defect recognition, IEEE Transactions on Automation Science and Engineering 17(3) (2020), 1611–1622.

21.

J.-H.

Hong

J.Y.

and Baek

J.-G.

, Oversampling method using outlier detectable generative adversarial network, Expert Systems with Applications 133 (2019), 1–8.

22.

Skoularidou

Cuesta-Infante

and Veeramachaneni

, Modeling tabular data using conditional gan, arXiv preprint arXiv:1907.00503, 2019.

23.

and Veeramachaneni

, Synthesizing tabular data using generative adversarial networks, arXiv preprint arXiv:1811.11264, 2018.

24.

Almutairi

and Janicki

, On relationships between imbalance and overlapping of datasets, in: CATA, 2020, pp. 141–150.

25.

Tuysuzoglu

and Birant

, Enhanced bagging (eBagging): A novel approach for ensemble learning, Int. Arab. J. Inf. Technol 17(4) (2020), 515–528.

26.

Svetnik

Wang

Tong

Liaw

Sheridan

R.P.

and Song

, Boosting: An ensemble learning tool for compound classification and QSAR modeling, Journal of Chemical Information and Modeling 45(3) (2005), 786–799.

27.

Zhang

and Wang

, The OCS-SVM: An objective-cost-sensitive SVM with sample-based misclassification cost invariance, IEEE Access 7 (2019), 118931–118942.

28.

Fatourechi

Ward

R.K.

Mason

S.G.

Huggins

Schloegl

and Birch

G.E.

, Comparison of evaluation metrics in classification applications with imbalanced datasets, in: 2008 Seventh International Conference on Machine Learning and Applications, IEEE, 2008, pp. 777–782.

29.

Tahir

M.A.

Kittler

and Yan

, Inverse random under sampling for class imbalance problem and its application to multi-label classification, Pattern Recognition 45(10) (2012), 3738–3750.

30.

Walia

and Babyn

, Generative adversarial network in medical imaging: A review, Medical Image Analysis 58 (2019), 101552.

31.

Yang

Wang

Liu

and Guo

, Controllable artistic text style transfer via shape-matching gan, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 4442–4451.

32.

Lutz

Amplianitis

and Smolic

, Alphagan: Generative adversarial networks for natural image matting, arXiv preprint arXiv:1807.10088, 2018.

33.

Eskimez

S.E.

Dimitriadis

Gmyr

and Kumanati

, GAN-Based Data Generation for Speech Emotion Recognition., in: INTERSPEECH, 2020, pp. 3446–3450.

34.

Yang

L.-C.

Chou

S.-Y.

and Yang

Y.-H.

, MidiNet: A convolutional generative adversarial network for symbolic-domain music generation, arXiv preprint arXiv:1703.10847, 2017.

35.

C.-Y.

Xue

M.-X.

Chang

C.-C.

Lee

C.-R.

and Su

, Play as you like: Timbre-enhanced multi-modal music style transfer, in: Proceedings of the Aaai Conference on Artificial Intelligence, Vol. 33, 2019, pp. 1061–1068.

36.

Nazari

and Branco

, On Oversampling via Generative Adversarial Networks under Different Data Difficulty Factors, in: Third International Workshop on Learning with Imbalanced Domains: Theory and Applications, PMLR, 2021, pp. 76–89.

37.

Ren

Liu

and Liu

, EWGAN: Entropy-based Wasserstein GAN for imbalanced learning, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, 2019, pp. 10011–10012.

38.

Zhou

Liang

Song

Wang

Zhang

and Zhang

, Lipschitz generative adversarial nets, in: International Conference on Machine Learning, PMLR, 2019, pp. 7584–7593.

39.

Arjovsky

Chintala

and Bottou

, Wasserstein generative adversarial networks, in: International Conference on Machine Learning, PMLR, 2017, pp. 214–223.

40.

Metz

Poole

Pfau

and Sohl-Dickstein

, Unrolled generative adversarial networks, arXiv preprint arXiv:1611.02163, 2016.

41.

Weng

Zhang

Chen

Song

Hsieh

C.-J.

Daniel

Boning

and Dhillon

, Towards fast computation of certified robustness for relu networks, in: International Conference on Machine Learning, PMLR, 2018, pp. 5276–5285.

42.

Yin

Goudriaan

Lantinga

E.A.

Vos

and Spiertz

H.J.

, A flexible sigmoid function of determinate growth, Annals of Botany 91(3) (2003), 361–371.

43.

Lillicrap

T.P.

Santoro

Marris

Akerman

C.J.

and Hinton

, Backpropagation and the brain, Nature Reviews Neuroscience 21(6) (2020), 335–346.

44.

Alcalá-Fdez

Sánchez

Garcia

del Jesus

M.J.

Ventura

Garrell

J.M.

Otero

Romero

Bacardit

Rivas

V.M.

et al., KEEL: A software tool to assess evolutionary algorithms for data mining problems, Soft Computing 13(3) (2009), 307–318.

45.

Asuncion

and Newman

, UCI machine learning repository, Irvine, CA, USA, 2007.

46.

Kouziokas

G.N.

, SVM kernel based on particle swarm optimized vector and Bayesian optimized SVM in atmospheric particulate matter forecasting, Applied Soft Computing 93 (2020), 106410.

47.

Shen

Wang

and Guo

, MLP neural network-based recursive sliding mode dynamic surface control for trajectory tracking of fully actuated surface vessel subject to unknown dynamics and input saturation, Neurocomputing 377 (2020), 103–112.

48.

Vijayakumar

and Vinothkanna

, Capsule network on font style classification, Journal of Artificial Intelligence 2(02) (2020), 64–76.

49.

Zhang

Bhatia

Pandya

Sahinidis

N.V.

Cao

and Flores-Cerrillo

, Industrial text analytics for reliability with derivative-free optimization, Computers & Chemical Engineering 135 (2020), 106763.

50.

Ishwaran

and O’Brien

, Commentary: The problem of class imbalance in biomedical data, J Thorac Cardiovasc Surg 1 (2020), 2.

51.

Jonathan

Putra

P.H.

and Ruldeviyani

, Observation Imbalanced Data Text to Predict Users Selling Products on Female Daily with SMOTE, Tomek, and SMOTE-Tomek, in: 2020 IEEE International Conference on Industry 4.0, Artificial Intelligence, and Communications Technology (IAICT), IEEE, 2020, pp. 81–85.

52.

Douzas

and Bacao

, Self-Organizing Map Oversampling (SOMO) for imbalanced data set learning, Expert Systems with Applications 82 (2017), 40–52.

53.

Jiang

Qiu

and Li

, A novel minority cloning technique for cost-sensitive learning, International Journal of Pattern Recognition and Artificial Intelligence 29(04) (2015), 1551004.

54.

Nekooeimehr

and Lai-Yuen

S.K.

, Adaptive semi-unsupervised weighted oversampling (A-SUWO) for imbalanced datasets, Expert Systems with Applications 46 (2016), 405–416.

55.

Koziarski

and Woźniak

, CCR: A combined cleaning and resampling algorithm for imbalanced data classification, International Journal of Applied Mathematics and Computer Science 27(4) (2017).

56.

Torres

F.R.

Carrasco-Ochoa

J.A.

and Martínez-Trinidad

J.F.

, SMOTE-D a deterministic version of SMOTE, in: Mexican Conference on Pattern Recognition, Springer, 2016, pp. 177–188.

57.

Douzas

Bacao

and Last

, Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE, Information Sciences 465 (2018), 1–20.

58.

Mullick

S.S.

Datta

and Das

, Generative adversarial minority oversampling, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 1695–1704.

59.

Sharma

Bellinger

Krawczyk

Zaiane

and Japkowicz

, Synthetic oversampling with the majority class: A new perspective on handling extreme imbalance, in: 2018 IEEE International Conference on Data Mining (ICDM), IEEE, 2018, pp. 447–456.

Oversampling method based on GAN for tabular binary classification problems

Abstract

Keywords

1. Introduction

2.1 Imbalanced learning

2.2 Generative adversarial network (GAN)

3.1 Traditional tasks with GAN

3.2 Tabular data generation using GAN

4. The proposed oversampling method

Early stopping strategy

5. Experimental studies and discussion

5.1 Datasets and experimental settings

Table 1 Details of datasets

Table 3 45 combinations of three classifiers with eight comparison resampling methods

5.4 Ablation experiment

Footnotes

Acknowledgments

Declaration of competing interests

References

Table 1
Details of datasets

Table 3
45 combinations of three classifiers with eight comparison resampling methods