Clustering-based improved adaptive synthetic minority oversampling technique for imbalanced data classification

Abstract

Synthetic Minority Oversampling Technique (SMOTE) and some extensions based on it are popularly used to balance imbalanced data. In this study, we concentrate on solving overfitting of the classification model caused by choosing instances to oversample that increase the occurrence of overlaps with the majority class. Our method called Clustering-based Improved Adaptive Synthetic Minority Oversampling Technique (CI-ASMOTE1) decomposes minority instances into sub-clusters according to their connectivity in the feature space and then selects minority sub-clusters which are relatively close to the decision boundary as the candidate regions to oversample. After application of CI-ASMOTE1, new minority instances are only synthesized within each connected region of the selected sub-clusters. Considering the diversity of the synthetic instances in each selected sub-cluster, CI-ASMOTE2 is put forward to extend CI-ASMOTE1 by keeping all features of those instances in the feature space as different as possible. The experimental evaluation shows that CI-ASMOTE1 and CI-ASMOTE2 improve SMOTE and its extensions, especially in the occurrence of overlaps between the minority instances and the majority instances.

Keywords

Imbalanced data classification clustering oversampling overfitting

1. Introduction

Imbalanced data distribution reduces the classification accuracy, which has become a serious problem in a lot of practical applications [1, 2]. Since traditional machine learning methods are biased towards formulating classification models that favor the majority class, the issue of imbalanced data classification arises [3, 4]. While this fact generally causes large overall accuracy, it may lead to unsatisfactory decision-making because the minority class may has high misclassification costs across numerous applications such as credit risk assessment [5, 6], medical diagnosis [7, 8], software defect prediction [9, 10], bearing fault diagnosis [11], and detect Web Spam [12].

Oversampling techniques have been proven to be effective ways to deal with imbalanced data classification [13]. The earliest method is random oversampling [14]. SMOTE [15] is a typical improvement of this approach. In SMOTE, new instances are synthesized along the line segment between two original minority instances. However, SMOTE encounters the issue of overfitting [16]. Because it aggravates overlaps between classes by blindly utilizing all the original minority instances to generate new instances. The emergence of SMOTE was followed by a stage of the development of oversampling techniques. Some extensions based on SMOTE, such as Borderline-SMOTE [17] and Adaptive Synthetic Sampling Approach (ADASYN) [18], oversample minority instances under the prior knowledge of available majority instances. The above extensions, improving SMOTE to different extents, mainly aim to identify hard-to-learn instances to oversample because those instances have beneficial information for classification. However, these extensions either do not fully utilize the distribution information of instances or do not update the random selection of the nearest neighbors. Therefore, they may oversample overlapping and noisy instances, which creates difficulty in learning and worse overfitting. Those strategies are especially debatable in the case of extremely skewed class distribution. In that case, the minority class is very sparse relative to the majority class, thus leading to a bigger possibility of class mixture.

In this work, an improved oversampling method called Clustering-based Improved Adaptive Synthetic Minority Oversampling Technique (CI-ASMOTE1) is presented for the problem of imbalanced data classification. CI-ASMOTE1 decomposes minority instances into different sub-clusters according to their connectivity in the feature space and then selects semi-safe minority sub-clusters as candidate sub-clusters to oversample. To synthesize more instances that are beneficial for classification, it assigns the higher oversampling weights to those selected sub-clusters which have the higher degree of sparseness and are closer to the majority class. CI-ASMOTE1 oversamples each selected sub-cluster by SMOTE to avoid generating overlapping and noisy instances, which decreases the occurrence of overfitting. We also consider the diversity of the synthetic instances within selected sub-clusters. Therefore, CI-ASMOTE2 extends CI-ASMOTE1 by two-step SMOTE (TSMOTE) to provide a useful solution to the data imbalance issue in the real-world. CI-ASMOTE2 only generates instances that are beneficial for class-discriminative information, so it can guarantee the diversity of the synthetic instances.

This paper has the following structures: Section 2 introduces SMOTE and other oversampling methods that are relevant to imbalanced data classification. The steps of the two proposed methods are explained in Section 3. The experimental results are analyzed in Section 4 and the conclusion is drawn in Section 5.

2. Literature survey

In this section, relevant methods for imbalanced data classification are presented. First, random oversampling technique and SMOTE are presented. And then some well-known extensions are discussed.

Oversampling techniques are the most primitive and traditional processing methods for imbalanced data [19]. The simplest technique of them is random oversampling. It randomly selects some original minority instances and replicates them until the sample numbers of the two classes are roughly equal. New instances generated by random oversampling overlap with original minority instances, which may cause serious overfitting. To alleviate this issue, SMOTE [15] synthesizes new instances along the line segment between one randomly selected minority instance and its K-nearest neighbors of the minority class. However, SMOTE may aggravate overlaps between the two classes, as it generates new minority instances without taking the distribution of the majority class into account. In other words, SMOTE may oversample overlapping and noisy instances of the minority class. Besides, SMOTE also oversamples uninformative instances that are far from the decision boundary. Various extensions based on SMOTE have been put forward to better identify minority instances from the imbalanced data. The sampling technique of Borderline-SMOTE [17] is the same as SMOTE, but it only oversamples minority instances that are in the proximity of the decision boundary. If the separability of the imbalanced data is good, Borderline-SMOTE will be effective in improving the classification result of the data. Otherwise, Borderline-SMOTE may increase the complexity of the classification because new instances generated by it are closer to unsafe regions and may overlap with majority instances. The increased complexity may make the decision boundary between the two classes more difficult to identify. Safe-Level-SMOTE [20] defines a safe level for each minority instance according to the distribution of the minority class in its K-nearest neighbors. To reduce the generation of overlapping and noisy instances, Safe-Level-SMOTE makes synthetic instances closer to minority instances which have the higher safe level by introducing weights when calculating new instances. Regrettably, this method is prone to lead to overfitting because synthetic instances of it are uninformative instances that are far away from the decision boundary. Another method that assigns weights to minority instances is ADASYN [18]. ADASYN assigns a bigger oversampling weight to the minority instance that has more majority class neighbors, so more instances are synthesized based on those minority instances that are more difficult to learn. However, ADASYN also will lead to overfitting if original imbalanced data contain overlaps between classes, which is the same as Borderline-SMOTE. Synthetic minority oversampling technique with natural neighbors (NaNSMOTE) [21] determines the neighbor number of each minority instance based on the surrounding distribution of the instance. NaNSMOTE can decrease the generation of noisy instances by assigning a smaller neighbor number to the minority instance that is closer to the majority class. This method still has the hidden danger of overfitting because it cannot prevent synthesizing overlapping instances when imbalanced data have serious overlaps between classes.

Many clustering-based oversampling methods have been presented to address the issue of overfitting. Cluster-SMOTE [22] uses K-means to divide minority instances into sub-clusters and then oversamples each sub-cluster by SMOTE. However, when determining the oversampling level of each minority sub-cluster, Cluster-SMOTE not only does not consider the distance between each sub-cluster and the decision boundary, but also does not take the complexity of each sub-cluster into account. Thus, Cluster-SMOTE is possible to expand overlapping areas and cause overfitting. Majority Weighted Minority Oversampling Technique (MWMOTE) [23] is a two-step weighted method. It firstly finds those minority and majority instances that are close to the decision boundary and secondly filters out informative instances from the minority class according to the distance from each minority instance to the majority class. Finally, it applies a clustering method to generate new instances based on those filtered instances. MWMOTE decreases the occurrence of overlaps between synthetic instances and majority instances, but it still may produce overfitting as it oversamples minority disjuncts [24] that mixed into the majority class. Minority disjuncts are regions that made up of a small number of minority instances [24, 25]. Besides, Minority disjuncts that are far away from the majority class cannot be detected by MWMOTE, even if they may have relevant information about classification. K-means based SMOTE [25] utilizes K-means to divide all instances into sub-clusters of different sizes and then selects sub-clusters that are composed of at least 50% minority instances as safe regions to oversample. This method not only avoids the generation of noisy instances, but also reduces intra-class and inter-class imbalances. Adaptive semi-unsupervised weighted oversampling (A-SUWO) [26] is a clustering-based method proposed for easing the issue of overfitting. It identifies sub-clusters from the minority class that are not overlapped with majority sub-clusters and then only oversamples these minority sub-clusters to reduce overlapping synthetic instances. To alleviate the problem of overfitting caused by noisy instances, A-SUWO removes instances surrounded with instances of the adversarial class from the training set before clustering [26]. However, A-SUWO will lead to the loss of testing classification performances if those removed instances are over-recognized noisy instances, especially when the minority instances are sparsely distributed in the majority class. Improving noise-immunity majority weighted minority oversampling technique (NI-MWMOTE) [27] is another method proposed to address the problem of overfitting caused by noisy instances. It reduces the over-recognition of noisy instances by iteratively using K-nearest neighbor algorithm and misclassification errors and then oversampling minority instances according to aggregative hierarchical clustering algorithm (AHC) [28]. But the computational process of NI-MWMOTE is relatively complicated, its efficiency in practical applications is not ideal.

In summary, to solve the problem of imbalanced data classification, many extensions based on SMOTE and clustering have been proposed. These methods overcome some shortcomings of traditional random oversampling and SMOTE, but they are less successful in avoiding increasing overlapping or noisy instances. In particular, some of these methods exacerbate overlaps when there are overlaps in original data, because they adopt the strategy of selecting candidate instances in the proximity of the decision boundary. They make the decision boundary more complex, which leads to overfitting. Furthermore, in these clustering-based methods, the minority sub-clusters often contains sporadic majority instances (i.e. noisy instances), which also leads to overfitting. Some of these methods use the strategy of removing noisy instances in advance to reduce the problem of overfitting, but they can lead to under-fitting for small sample data in the real world.

3. Proposed methods

3.1 CI-ASMOTE1

CI-ASMOTE1 is composed of four main procedures: (1) Construct minority sub-clusters, (2) Select candidate minority sub-clusters, (3) Adaptively determine the number of new instances of each candidate sub-cluster, and (4) Synthesize new minority instances.

3.1.1 Construct minority sub-clusters

To avoid the generation of overlapping and noisy instances, Euclidean Distance Clustering Algorithm (EDCA) is proposed in this section. EDCA is based on the Constructive Covering Algorithm (CCA) [29] which can cover instances with the same label. EDCA decomposes minority instances into different sub-clusters according to the connectivity of minority instances. In other words, if one majority instance exists between two minority instances, EDCA will not merge them in the same sub-cluster. The specific procedures of EDCA are as follows.

Suppose the input dataset is ${\bf X}=\{(x_{1},y_{1}),(x_{2},y_{2}),\ldots,(x_{m},y_{m})\}$ , in which $x_{i}=(x_{i}^{1},x_{i}^{2},\ldots,x_{i}^{n})$ and $y_{i}$ is the label of the $i$ -th instance. In the dataset ${\bf X}$ , the minority class is ${\bf MIN}=\{(x_{j},y_{j})|y_{j}=-1\},j=1,\ldots,p$ , and the majority class is ${\bf MAJ}=\{(x_{t},y_{t})|y_{t}=1\},t=1,\ldots,q$ .

Step1. Randomly select one minority instance $x_{o}$ from the ${\bf MIN}$ as the center of a minority sub-cluster and calculate the Euclidean distance between $x_{o}$ and each majority instance.

$\displaystyle D\left({x_{o},x_{t}}\right)=\sqrt{\left({x_{o}^{1}-x_{t}^{1}}% \right)^{2}+\ldots+\left({x_{o}^{n}-x_{t}^{n}}\right)^{2}},y_{o}=-1,y_{t}=1,t=% 1,\ldots,q$ (1)

Step2. Calculate the distance $\omega$ between the center $x_{o}$ and the nearest majority instance of it.

$\displaystyle\omega=\min\left\{{D\left({x_{o},x_{t}}\right)}\right\},t\in\left% \{{1,\ldots,q}\right\}$ (2)

Step3. Calculate the Euclidean distance between the center $x_{o}$ and each minority instance in ${\bf MIN}$ . If satisfies the condition $D(x_{o},x_{j})\leqslant\omega$ , and $x_{o}$ will belong to the same connected region. In other words, they will form the minority sub-cluster ${\bf C}_{o}$ . Otherwise, $x_{o}$ will constitute the sub-cluster ${\bf C}_{o}$ alone. Later, remove minority instances that have been divided into the sub-cluster ${\bf C}_{o}$ from the ${\bf MIN}$ , then update the ${\bf MIN}$ .

Step4. Calculate the radius $r_{o}$ of the minority sub-cluster ${\bf C}_{o}$ .

$\displaystyle\varphi=\max\left\{D\left(x_{o},x_{j}\right)|0\leqslant D\left(x_% {o},x_{j}\right)\leqslant\omega\right\},j\in\left\{1,\ldots,p\right\}$ (3)

$\displaystyle r_{o}=\frac{\omega+\varphi}{2}$ (4)

Among it, $\varphi$ represents the distance between the center $x_{o}$ and the instance which is farthest from the center $x_{o}$ in the sub-cluster ${\bf C}_{o}$ .

Step5. Repeat steps 1 to 4 until the ${\bf MIN}$ is updated to the empty set. This will construct $l$ minority sub-clusters in the end. We denote the set of these minority sub-clusters as ${\bf C}_{\text{MIN}}=\{{\bf C}_{1},\ldots,{\bf C}_{l}\}$ and the set of corresponding radii as ${\bf R}_{\text{MIN}}=\{r_{1},\ldots,r_{l}\}$ .

3.1.2 Select candidate minority sub-clusters

As we know, the purpose of oversampling methods is to increase the recognition rate of the minority class and the overall classification result of the imbalanced data [30]. Adding key minority instances is one of useful ways to improve the performance of the classification model in identifying minority instances. Therefore, it is vital to find a method which can select key minority instances as candidate instances to oversample. In most oversampling methods, the selection of candidate minority instances is binary [17, 20]. It means that candidate minority instances are just safe instances or unsafe instances, which are labeled according to their distances to the decision boundary. Some of oversampling methods only consider candidate minority instances in the proximity of the decision boundary. They neglect the influence of other minority instances on the performance of the classifier. In fact, minority instances can be divided into three groups in the process of selecting candidate instances. The first group is minority instances that are far away from the decision boundary. These instances only have a little class-discriminative information. If they are selected as candidate instances to oversample, the contribution of the synthesized instances to the classification will not be significant. The second group is minority instances which are very close to majority instances that are obviously clustered. Selecting these instances to oversample not only can easily increase the occurrence of overlaps between two classes but also can cause the problem of overfitting. The third group is minority instances that are outside the above two groups. These instances are relatively close to the decision boundary and contain a lot of class-discriminative information. Thus, this group is more suitable to be oversampled than other two groups.

In this paper, CI-ASMOTE1 divides minority sub-clusters into the three groups mentioned above, according to the sample number of each minority sub-cluster and the proportion of majority instances in K-nearest neighbors of each minority sub-cluster. In some oversampling methods, the $K$ value for K-nearest neighbor algorithm is set artificially. Different values of $K$ can affect whether the minority instance is defined as safe or unsafe. In the CI-ASMOTE1, the value of $K$ is adaptively set to the sample number of the largest minority sub-cluster to determine which group each minority sub-cluster belongs to.

$\displaystyle K=\max\left\{\text{Num}\left({\bf C}_{i}\right)\right\},i\in% \left\{1,\ldots,l\right\}$ (5)

In the Eq. (5), $\text{Num}({\bf C}_{i})$ represents the number of instances within the minority sub-cluster ${\bf C}_{i}$ .

Then, the proportion $H_{i}$ of the majority class in the K-nearest neighbors of each minority sub-cluster ${\bf C}_{i}$ is calculated by the Eq. (6).

$\displaystyle H_{i}=\frac{\sum\limits_{j=1}^{\text{Num}\left({{\bf C}_{i}}% \right)}{\Delta_{ij}}}{K*\text{Num}\left({{\bf C}_{i}}\right)},i=1,\ldots,l$ (6)

Among it, $\Delta_{ij}$ refers to the number of majority instances in the K-nearest neighbors of the $j$ -th minority instance within the sub-cluster ${\bf C}_{i}$ .

According to $\text{Num}({\bf C}_{i})$ and $H_{i}$ , CI-ASMOTE1 divides minority sub-clusters into the three groups as follows.

(1)

The first group is minority sub-clusters which contain two instances or more than two instances and are far from majority instances. We define it as the set of safe minority sub-clusters ${\bf C}_{\text{MIN}}^{s}$ .

$\displaystyle{\bf C}_{\text{MIN}}^{s}=\left\{{\bf C}_{i}|\text{Num}\left({{\bf C% }_{i}}\right)\geqslant 2\wedge 0\leqslant H_{i}<\frac{1}{4}\right\}$ (7)

(2)

The second group is minority sub-clusters which have two instances or more than two instances and are very close to majority instances. Besides, minority sub-clusters that only contain one instance are also belong to the second group. We define this group as the set of dangerous minority sub-clusters ${\bf C}_{\text{MIN}}^{d}$ .

$\displaystyle{\bf C}_{\text{MIN}}^{d}=\left\{{\bf C}_{i}|\text{Num}\left({\bf C% }_{i}\right)=1\vee\text{Num}\left({\bf C}_{i}\right)\geqslant 2\wedge\frac{3}{% 4}<H_{i}\leqslant 1\right\}$ (8)

(3)

The third group is minority sub-clusters which contain two instances or more than two instances and are relatively close to majority instances. We define it as the set of semi-safe minority sub-clusters ${\bf C}_{\text{MIN}}^{ss}$ .

$\displaystyle{\bf C}_{\text{MIN}}^{ss}=\left\{{\bf C}_{i}|\text{Num}\left({\bf C% }_{i}\right)\geqslant 2\wedge\frac{1}{4}\leqslant H_{i}\leqslant\frac{3}{4}\right\}$ (9)

To prevent the issue of overfitting and improve the classification result of the imbalanced data as well as possible, CI-ASMOTE1 chooses ${\bf C}_{\text{MIN}}^{ss}=\{{\bf C}_{1}^{ss},\ldots,{\bf C}_{h}^{ss}\}$ as candidate minority sub-clusters. The set of corresponding radii is ${\bf R}_{\text{MIN}}^{ss}=\{r_{1}^{ss},\ldots,r_{h}^{ss}\}$ .

3.1.3 Adaptively determine the number of new instances of each candidate sub-cluster

CI-ASMOTE1 adaptively determines the number of synthetic instances of each candidate minority sub-cluster according to two factors. The first factor is the sparsity of each candidate sub-cluster. The second factor is the average Euclidean distance between each candidate sub-cluster and majority instances within its K-nearest neighbors. The sparsity of each candidate sub-cluster is the inverse of its density. The density of one sub-cluster is defined as the ratio of the sample number of it to the covering area formed by it. If the density of the sub-cluster is small, it means that the dispersion of its internal instances is great. Thus, new instances synthesized in the candidate minority sub-cluster that has the higher sparsity will have the smaller possibility of overlapping with original minority instances. In addition, for each candidate minority sub-cluster, if the average Euclidean distance between it and majority instances within its K-nearest neighbors is smaller, new instances generated within it will contain more useful information for the classification. To synthesize more instances that are beneficial to the classification, CI-ASMOTE1 assigns the bigger oversampling weight to the candidate minority sub-cluster which has bigger sparsity and is closer to majority neighbors. The steps of this section are as follows.

Step 1. Calculate the sparsity of each sub-cluster in ${\bf C}_{\text{MIN}}^{ss}$ .

$\displaystyle\textit{sparsity}\left({\bf C}_{i}^{ss}\right)=\frac{\pi*r_{i}^{% ss}*r_{i}^{ss}}{\text{Num}\left({\bf C}_{i}^{ss}\right)},i=1,\ldots,h$ (10)

Step 2. Calculate the average Euclidean distance $\overline{D}({\bf C}_{i}^{ss})$ between the sub-cluster ${\bf C}_{i}^{ss}$ and majority instances within its K-nearest neighbors.

$\displaystyle\overline{D}\left({\bf C}_{i}^{ss}\right)=\!\frac{1}{\text{Num}% \left({\bf C}_{i}^{ss}\right)}\!\sum_{j=1}^{\text{Num}\left({\bf C}_{i}^{ss}% \right)}\!\!\left[\frac{1}{\Delta_{ij}}\sum_{\alpha=1}^{\Delta_{ij}}D\left(x_{% ij},x_{ij(\alpha)}\right)\!\right],y_{ij}=-1,y_{ij\left(\alpha\right)}=1,i=1,% \ldots,h$ (11)

In the above formula, $\text{Num}({\bf C}_{i}^{ss})$ is the sample number of the sub-cluster ${\bf C}_{i}^{ss}$ . $x_{ij}$ is the $j$ -th instance in ${\bf C}_{i}^{ss}$ . And $\Delta_{ij}$ is the number of majority instances within K-nearest neighbors of $x_{ij}.x_{ij(\alpha)}$ is the $\alpha$ -th nearest majority neighbors of $x_{ij}$ .

Step 3. Assign the oversampling weight $W_{i}$ to each candidate minority sub-cluster according to $\textit{sparsity}({\bf C}_{i}^{ss})$ and $\overline{D}({\bf C}_{i}^{ss})$ .

$\displaystyle W_{i}=\textit{sparsity}\left({\bf C}_{i}^{ss}\right)+\frac{1}{% \overline{D}\left({\bf C}_{i}^{ss}\right)},i=1,\ldots,h$ (12)

Step 4. Adaptively determine the number $N_{i}$ of new instances of the candidate minority sub-cluster ${\bf C}_{i}^{ss}$ based on its oversampling weight. The specific formula is as follows:

$\displaystyle N_{i}=\left({q-p}\right)*\frac{W_{i}}{\sum\limits_{i=1}^{h}{W_{i% }}},i=1,\ldots,h$ (13)

Among it, $q$ is the number of original majority instances and $p$ is the number of original minority instances.

3.1.4 Synthesize new minority instances

In this section, CI-ASMOTE1 adopts SMOTE to each candidate minority sub-cluster. SMOTE [15] randomly selects two different original instances $x_{i1}$ and $x_{i2}$ in the candidate minority sub-cluster ${\bf C}_{i}^{ss}$ and then synthesizes a new instance between them according to the Eq. (14).

$\displaystyle x_{\text{new}}=x_{i1}+\text{rand}\left({0,1}\right)*\left|{x_{i1% }-x_{i2}}\right|$ (14)

The above step is repeated until the corresponding number $N_{i}$ of new instances are inserted in each candidate minority sub-cluster. Finally, combine all synthetic instances with original instances and then output the balanced dataset.

3.2 CI-ASMOTE2

Since new instances generated by SMOTE only exist on the line of two selected minority instances, instances synthesized by CI-ASMOTE1 in the same sub-cluster may be similar. Considering this defect, CI-ASMOTE2 is further proposed in this paper. CI-ASMOTE2 improves the fourth procedure of CI-ASMOTE1. It uses two-step SMOTE (TSMOTE) instead of SMOTE to insert diversified and class-discriminating instances in each candidate minority sub-cluster. TSMOTE selects three original instances from the candidate minority sub-cluster to synthesize new instances. New instances will randomly exist in the region composed of the three original instances. The specific procedures of TSMOTE are as follows.

Step 1. Synthesize the intermediary instance $x_{in}$ .

$\displaystyle x_{\text{in}}=x_{i1}+\text{rand}\left({0,1}\right)*\left|{x_{i1}% -x_{i2}}\right|$ (15)

In this step, $x_{i1}$ and $x_{i2}$ are two original instances randomly selected from the candidate minority sub-cluster ${\bf C}_{i}^{ss}$ . If there are only two original instances in the sub-cluster ${\bf C}_{i}^{ss}$ , the intermediary instance $x_{in}$ will be the new minority instance $x_{\text{new}}$ . In other words, the ${\bf C}_{i}^{ss}$ that only contains two instances will be oversampled by SMOTE. Else, continue with the step 2.

Step 2. Synthesize new minority instance $x_{\text{new}}$ .

$\displaystyle x_{\text{new}}=x_{\text{in}}+\text{rand}\left({0,1}\right)*\left% |{x_{\text{in}}-x_{i3}}\right|$ (16)

Among it, $x_{i3}$ is the third original minority instance selected from the sub-cluster ${\bf C}_{i}^{ss}$ . The $x_{i3}$ is different from the above two instances.

Step 3. Repeat the step1 to 2 until the number $N_{i}$ of new instances are generated in the candidate sub-cluster ${\bf C}_{i}^{ss}$ .

Step 4. Output the balanced dataset by combining synthetic instances with original instances.

The proposed CI-ASMOTE2 method is described in Algorithm 1.

Algorithm1: CI-ASMOTE2
Input: Original imbalanced dataset ${\bf X}=\left\{{\left({x_{1},y_{1}}\right),\left({x_{2},y_{2}}\right),\ldots,% \left({x_{m},y_{m}}\right)}\right\}$ .
Output: Balanced dataset ${\bf U}$ after oversampling .
Specific procedures:
1.	Standardize the dataset ${\bf X}$ . Then, split it into the group of minority class ${\bf MIN}=\left\{{\left({x_{j},y_{j}}\right)\left\|{y_{j}=-1}\right.}\right\},j% =1,\ldots,p$ and the group of majority class ${\bf MAJ}=\left\{{\left({x_{t},y_{t}}\right)\left\|{y_{t}=1}\right.}\right\},t=% 1,\ldots,q$ .
2.	Initialize ${\bf C}_{\text{MIN}}=\left\{\right\}$ , ${\bf R}_{\text{MIN}}=\left\{\right\}$ and ${\bf S}=\left\{∼{}\right\}$ . ${\bf S}$ is the set of synthetic minority instances.
3.	While ${\bf MIN}\neq\emptyset$ do
4.	Randomly select an instance $x_{o}$ in ${\bf MIN}$ as the center of a sub-cluster;
5.	Compute the distance $\omega$ between $x_{o}$ and its nearest instance in ${\bf MAJ}$ according to Eqs (1)–(2);
6.	Compute the distance $D\left({x_{o},x_{j}}\right)$ between $x_{o}$ and each instance in ${\bf MIN}$ (contains $x_{o}$ );
7.	If $D\left({x_{o},x_{j}}\right)\leqslant\omega$
8.	${\textbf{C}}_{o}\leftarrow x_{o},x_{j}(x_{j}\neq x_{o})$
9.	Else
10.	$\textbf{C}_{o}=\{x_{o}\}$
11.	End If
12.	Compute the radius $r_{o}$ of the minority sub-cluster ${\bf C}_{o}$ according to Eqs (3)–(4);
13.	${\textbf{R}}_{\text{MIN}}\leftarrow r_{o}$
14.	${\textbf{C}}_{\text{MIN}}\leftarrow{\textbf{C}}_{o}$
15.	MIN $=$ MIN –C ${}_{o}$
16.	End While
17.	Get the set of minority sub-clusters ${\bf C}_{\text{MIN}}=\left\{{\bf C}_{1},\ldots,{\bf C}_{l}\right\}$ and the set of corresponding radii ${\bf R}_{\text{MIN}}=\left\{{r_{1},\ldots,r_{l}}\right\}$ .
18.	Adaptively obtain the $K$ value of the K-nearest neighbor algorithm according to Eq. (5).
19.	For each ${\bf C}_{i}$ in ${\bf C}_{\text{MIN}}$
20.	Compute $H_{i}$ of ${\bf C}_{i}$ according to Eq. (6).
21.	End
22.	Choose ${\bf C}_{\text{MIN}}^{ss}=\left\{{{\bf C}_{1}^{ss},\ldots,{\bf C}_{h}^{ss}}\right\}$ as candidate sub-clusters according to Eq. (9). The set of corresponding radii is ${\bf R}_{\text{MIN}}^{ss}=\left\{r_{1}^{ss},\ldots,r_{h}^{ss}\right\}$ .
23.	For each ${\bf C}_{i}^{ss}$ in ${\bf C}_{\text{MIN}}^{ss}$
24.	Compute the $\textit{sparsity}\left({\bf C}_{i}^{ss}\right)$ according to Eq. (10).
25.	Compute the $\overline{D}\left({\bf C}_{i}^{ss}\right)$ according to Eq. (11).
26.	Adaptively obtain the number $N_{i}$ of new instances that need to be inserted into ${\bf C}_{i}^{ss}$ according to Eqs (12)–(13).
27.	End
28.	For each ${\bf C}_{i}^{ss}$ in ${\bf C}_{\text{MIN}}^{ss}$
29.	If $\text{Num}\left({\bf C}_{i}^{ss}\right)=$ 2
30.	Synthesize $N_{i}$ new instances by SMOTE.
31.	Else If $\text{Num}\left({\bf C}_{i}^{ss}\right)>2$
32.	Synthesize $N_{i}$ new instances by TSMOTE.
33.	End If
34.	Obtain the set of new instances ${\bf S}_{i}=\left\{{x_{\text{new}}^{1},\ldots,x_{\text{new}}^{N_{i}}}\right\}$ in the ${\bf C}_{i}^{ss}$ .
35.	End
36.	Get the set of synthetic minority instances ${\bf S}=\left\{{\bf S}_{1},\ldots,{\bf S}_{h}\right\}$ .
37.	Output ${\bf U}={\bf S}\cup{\bf X}$ .

4. Experiments and analyses

4.1 Datasets and experiments

To validate the effectiveness of CI-ASMOTE1 and CI-ASMOTE2, we evaluated them on a simulated dataset and eight real-world datasets that provided by the University of California Irvine (UCI). We compare them with five other oversampling methods. The five contrastive methods are SMOTE [15], Borderline-SMOTE [17], Safe-Level-SMOTE [20], Cluster-SMOTE [22] and A-SUWO [26]. The parameters of each oversampling method were set according to the reference, and the oversampling rate was 100%. The simulated dataset is a two-dimensional dataset. The minority class of the simulated dataset is composed of data that mean value is [ $-$ 1.5, 2] and covariance matric is [0.25, 0; 0, 0.6]. The majority class of the simulated dataset is composed of data that mean value is [ $-$ 2, 1] and covariance matric is [0.3, 0; 0, 0.35]. For real-world datasets which have over two classes, they were transformed into two-class datasets by the method mentioned in reference [31]. The specific information of nine imbalanced datasets is shown in Table 1. These imbalanced datasets have no noisy and missing values in any features. Figure 1 shows the distribution of the 8 real-world datasets in the first three feature spaces. In it, stars represent minority instances and hollow triangles refer to majority instances. It can be seen from Fig. 1 that minority instances and majority instances of these datasets have some similar features in the first three feature spaces. This indicates that these real-world datasets have overlaps between classes. To ensure that the methods proposed in this paper are not restricted by the peculiar classifier, two different classifiers were utilized to create classification models. Naive Bayes Classifier (NBC) [32] is a traditional linear classifier with a stable classification performance and a good operation speed, so it was used in this research. Support Vector Machine with Radial Basis Function (RBFSVM) [33, 34] is a non-linear classifier. It is usually used to solve the classification of two-class datasets. RBFSVM performs well both on low-dimensional data and high-dimensional data, so it was applied in this study. The stratified ten-fold cross-validation was used in this study. Experiments were repeated three times for each dataset to reduce the impact of randomness on the results. The reported classification results of these datasets are the average results of 30 experiments. In addition, the classification results of these datasets with No-sampling were also given.

Table 1
The specific information of the nine imbalanced datasets used in the experiment

Dataset	Minority class	Majority class	Features	Instances	Imbalanced ratio
Simulated	“-1”	“1”	2	280	1:4.00
Heart-disease (Heart)	“1,2,3,4”	“0”	10	294	1:1.77
Glass	“2”	All other	9	214	1:1.82
Segmentation (Seg)	“Foliage” “Window”	All other	19	210	1:2.50
Ecoli	“IM”	All other	7	336	1:3.36
Yeast	“MIT”	All other	8	1484	1:5.08
Libra	“2, 3”	All other	90	360	1:6.50
Abalone6 (Aba6)	“6”	All other	7	1253	1:15.06
Abalone5 (Aba5)	“5”	All other	7	1253	1:34.80

4.2 Evaluation indexes

The confusion matrix [35], which can visualize the classification result of the imbalanced data, is shown in Table 2. In it, minority instances are marked as the positive class and majority instances are marked as the negative class. According to it, different performance indexes can be achieved. In this study, evaluation indexes are F1-Measure (F1-M) and G-Mean (G-M).

Table 2
The confusion matrix

	Positive class	Negative class
Positive prediction	True Positive (TP)	False Positive (FP)
Negative prediction	False Negative (FN)	True Negative (TN)

F1-M is the harmonic average of Precision and Recall. It can reflect the performance of the classification model in identifying minority instances. Precision represents the proportion of TP instances in the instances that are predicted to be the positive class. Recall is also called true positive rate (TPR), which refers to the ratio of TP instances that are accurately identified. The value of F1-M is bigger, which means that the classification result of the minority class is more satisfactory. The F1-M is computed according to these following formulae.

$\displaystyle\textit{Precision}=\frac{\textit{TP}}{\textit{TP}+\textit{FP}}$ (17) $\displaystyle\textit{TPR}=\frac{\textit{TP}}{\textit{TP}+\textit{FN}}$ (18) $\displaystyle\textit{F1-M}=\frac{2*\textit{Precision}*\textit{TPR}}{\textit{% Precision}+\textit{TPR}}$ (19)

Figure 1.

The distribution of 8 real-world datasets in the first three feature spaces.

G-M is a comprehensive index. It can reflect the classification results of two classes in the imbalanced data. G-M is the geometric mean of TPR and True Negative Rate (TNR). TNR represents the proportion of TN instances that are rightly classified. The value of G-M will be better if the recognition rates of majority instances and minority instances are both large. The G-M can be computed as follows.

$\displaystyle\textit{TNR}=\frac{\textit{TN}}{\textit{TN}+\textit{FP}}$ (20) $\displaystyle G-M=\sqrt{\textit{TPR}*\textit{TNR}}$ (21)

4.3 Comparison between CI-ASMOTE1 and safe minority sub-clusters oversampling

To confirm the rationality of CI-ASMOTE1, we compared CI-ASMOTE1 with the method that uses SMOTE to oversample the safe minority sub-clusters (abbreviated SAFE). Table 3 shows the testing results of the two methods using NBC and RBFSVM in the 9 imbalanced datasets. The better results are highlighted in bold. In terms of F1-M and G-M, CI-ASMOTE1 has better performances on 7 out of 9 datasets when using NBC and on 9 out of 9 datasets when using RBFSVM. For the datasets Aba6 and Aba5 which are highly imbalanced and contain serious overlaps between classes, the SAFE method cannot increase the recognition rates of minority and majority instances anymore. On the whole, CI-ASMOTE1 is a more useful approach for imbalanced datasets that contain overlaps between classes.

Table 3
The testing results of the SAFE method and CI-ASMOTE1 using NBC and RBFSVM in 9 imbalanced datasets

	NBC				RBFSVM
	F1-M		G-M		F1-M		G-M
Dataset	SAFE	CI-ASMOTE1	SAFE	CI-ASMOTE1	SAFE	CI-ASMOTE1	SAFE	CI-ASMOTE1
Simulated	0.5623	0.6195	0.6842	0.7884	0.3909	0.5333	0.5083	0.7049
Heart	0.7409	0.7284	0.7895	0.7800	0.6397	0.6675	0.7061	0.7330
Glass	0.5748	0.6155	0.5919	0.6013	0.6189	0.6909	0.6830	0.7544
Seg	0.7058	0.7948	0.7891	0.8454	0.9131	0.9365	0.9392	0.9636
Ecoli	0.7275	0.7299	0.8397	0.8538	0.7799	0.7984	0.8433	0.8716
Yeast	0.5997	0.5050	0.7333	0.7450	0.5455	0.5527	0.6504	0.6624
Libra	0.4810	0.4908	0.6587	0.7376	0.7219	0.8430	0.7745	0.9074
Aba6	0.2731	0.3089	0.7732	0.7754	NaN	0.3371	0.0000	0.7684
Aba5	0.3254	0.4208	0.9211	0.8343	NaN	0.4550	0.0000	0.8610

4.4 Results and analyses

4.4.1 The results and analyses of the simulated dataset

In this study, we used the simulated dataset to compare data distributions of CI-ASMOTE1, CI-ASMOTE2 and five contrastive methods. Figure 2 shows the results, in which stars represent minority instances, hollow triangles refer to majority instances and hollow circles represent synthetic minority instances. As shown in Fig. 2a, there are half of minority instances overlapping with the majority class. From Fig. 2b to Fig. 2e, some synthetic instances of SMOTE, Borderline-SMOTE, Safe-Level-SMOTE and Cluster-SMOTE are mixed into the majority class because the four methods oversampled overlapping instances. This shows that SMOTE and Borderline-SMOTE inevitably exacerbate the overlaps between the two classes, and the effectiveness of Safe-Level-SMOTE and Cluster-SMOTE in reducing the generation of overlapping instances is not significant. As shown in Fig. 2f, A-SUWO removes original noisy instances before oversampling minority sub-clusters to reduce the occurrence of overfitting. However, the data distribution characteristics of the simulated dataset were changed by it. From Fig. 2g, all original minority instances are decomposed into different sub-clusters by CI-ASMOTE1, and new instances are only distributed within semi-safe minority sub-clusters that are relatively close to the decision boundary. This shows that CI-ASMOTE1 can completely prevent the overlaps between synthetic instances and majority instances. However, some new instances of CI-ASMOTE1 are similar, especially those synthetic instances in the same minority sub-cluster. The further proposed CI-ASMOTE2 reduces the above weakness by using TSMOTE to synthesize instances. As shown in Fig. 2h, new instances in semi-safe minority sub-clusters containing over two original instances are different from each other. To sum up, compared with the five contrastive methods, the two methods submitted in this paper can keep all the original instances and avoid generating overlapping and noisy instances in the process of balancing the simulated dataset. Moreover, CI-ASMOTE2 can better increase the diversity and quality of synthetic instances than CI-ASMOTE1.

Figure 2.

Comparison of No-sampling and seven oversampling methods on the simulated dataset.

Table 4 shows the training and testing results of No-sampling, CI-ASMOTE1, CI-ASMOTE2 and other five oversampling methods using NBC and RBFSVM in the simulated dataset. The optimal results are reported in bold. First, it is easy to find that CI-ASMOTE2 achieves the best training and testing results in terms of F1-M and G-M whether using NBC or RBFSVM. Second, when the classifier is NBC, CI-ASMOTE1 gets better training results than No-sampling and some comparison methods in terms of F1-M and G-M. It also achieves greater testing performances than No-sampling and other five contrastive methods in terms of the two evaluation indexes. Compared with the training values of CI-ASMOTE1, Safe-Level-SMOTE respectively has an improvement about 0.33% and 0.27% in terms of F1-M and G-M. This indicates that the differences between the two methods are not large. When the classifier is RBFSVM, the training and testing results of CI-ASMOTE1 are bigger than No-sampling and five contrastive methods in terms of F1-M and G-M. These findings suggest that CI-ASMOTE1 can more effectively improve the recognition rate of the minority class and the overall classification result of this simulated dataset than other five contrastive methods because CI-ASMOTE1 does not generate new minority instances that overlapped with majority instances. This verifies the rationality of the principle of the CI-ASMOTE1 method. Moreover, CI-ASMOTE2 outperforms CI-ASMOTE1 and five contrastive methods, which illustrates that CI-ASMOTE2 can further improve the classification result of this simulated dataset since it increases the diversity of synthetic instances. Although SMOTE, Borderline-SMOTE, Safe-Level-SMOTE and Cluster-SMOTE also have good performances on the training set, they do not perform well on the testing set, especially when the classifier is RBFSVM. This shows that the above four contrastive methods cause overfitting to the classification model because they expand the overlapping regions of the training set. The reason why A-SUWO does not obtain better results is that the noisy instances removed by it may be vital for the classification. Generally, compared with other five oversampling methods, CI-ASMOTE1 and CI-ASMOTE2 can reduce overfitting caused by overlapping synthetic instances in this simulated dataset.

Table 4

The F1-M and G-M values of No-sampling and seven oversampling methods using NBC and RBFSVM in the simulated dataset

	NBC				RBFSVM
	Training		Testing		Training		Testing
Methods	F1-M	G-M	F1-M	G-M	F1-M	G-M	F1-M	G-M
No-sampling	0.5638	0.6649	0.5369	0.6399	0.6420	0.6928	0.3110	0.4447
SMOTE	0.8064	0.8096	0.5714	0.7158	0.9149	0.9108	0.3866	0.5853
Borderline-SMOTE	0.7817	0.7748	0.5792	0.7784	0.8930	0.8822	0.4046	0.6094
Safe-Level-SMOTE	0.8425	0.8374	0.5556	0.7491	0.9128	0.9064	0.4686	0.6513
Cluster-SMOTE	0.7878	0.7867	0.5823	0.7706	0.8858	0.8805	0.4687	0.6689
A-SUWO	0.6723	0.6967	0.5754	0.7596	0.9216	0.9152	0.5266	0.7037
CI-ASMOTE1	0.8392	0.8346	0.6195	0.7884	0.9237	0.9259	0.5333	0.7049
CI-ASMOTE2	0.8449	0.8407	0.6210	0.7991	0.9256	0.9278	0.5714	0.7143

4.4.2 The results and analyses of eight real-world imbalanced datasets

Table 5 gives the training and testing results of No-sampling and seven oversampling methods using NBC and RBFSVM in the 8 real-world datasets. The best performances are highlighted in bold. When the classifier is NBC, CI-ASMOTE1 and CI-ASMOTE2 get better training results than other five oversampling methods for 3 out of 8 datasets in terms of F1-M, 4 out of 8 datasets in terms of G-M. When the classifier is RBFSVM, CI-ASMOTE2 obtains the best training results for 4 out of 8 datasets in terms of F1-M and G-M, and CI-ASMOTE1 acquires better training results than five contrastive methods for 3 out of 8 datasets in terms of F1-M and G-M. In other cases, the training results of CI-ASMOTE1 and CI-ASMOTE2 are also close to the optimal results. Moreover, it is worth noting that CI-ASMOTE2 has better training results than CI-ASMOTE1 for most of the datasets when using NBC and RBFSVM. These findings show that the two proposed methods can increase more useful synthetic instances than contrastive methods and confirm that the quality of new instances generated by CI-ASMOTE2 is better than CI-ASMOTE1.

Table 5
The F1-M and G-M values of No-sampling and seven oversampling methods using NBC and RBFSVM in 8 real-world datasets

		NBC				RBFSVM
		Training		Testing		Training		Testing
Dataset	Methods	F1-M	G-M	F1-M	G-M	F1-M	G-M	F1-M	G-M
Heart	No-sampling	0.7494	0.7996	0.7046	0.7637	0.8676	0.8846	0.6286	0.6940
	SMOTE	0.8102	0.8142	0.7111	0.7722	0.9214	0.9182	0.6685	0.7360
	Borderline-SMOTE	0.7619	0.8085	0.7093	0.7709	0.9190	0.9129	0.6758	0.7424
	Safe-Level-SMOTE	0.8208	0.8228	0.7031	0.7639	0.9196	0.9156	0.6716	0.7387
	Cluster-SMOTE	0.8211	0.8223	0.7017	0.7632	0.9251	0.9235	0.6775	0.7442
	A-SUWO	0.8329	0.8320	0.6934	0.7564	0.9258	0.9204	0.6611	0.7311
	CI-ASMOTE1	0.8220	0.8178	0.7284	0.7800	0.9109	0.9093	0.6675	0.7330
	CI-ASMOTE2	0.8178	0.8159	0.7317	0.7828	0.9095	0.9079	0.6814	0.7448
Glass	No-sampling	0.6191	0.6260	0.6133	0.5991	0.8816	0.8977	0.6270	0.6926
	SMOTE	0.7497	0.6289	0.5988	0.5793	0.9222	0.9191	0.7094	0.7717
	Borderline-SMOTE	0.7360	0.6266	0.5968	0.5622	0.9199	0.9159	0.6998	0.7643
	Safe-Level-SMOTE	0.7497	0.6277	0.6099	0.5852	0.9294	0.9263	0.7052	0.7686
	Cluster-SMOTE	0.7433	0.6260	0.6011	0.5779	0.9207	0.9175	0.7076	0.7727
	A-SUWO	0.7124	0.6262	0.4874	0.4946	0.9364	0.9349	0.7001	0.7612
	CI-ASMOTE1	0.7289	0.6334	0.6155	0.6013	0.9333	0.9317	0.6909	0.7544
	CI-ASMOTE2	0.7350	0.6356	0.6160	0.6147	0.9391	0.9381	0.7324	0.7870
Seg	No-sampling	0.7272	0.8159	0.6951	0.7920	0.9473	0.9648	0.9038	0.9295
	SMOTE	0.8588	0.8582	0.7524	0.8367	0.9836	0.9832	0.9368	0.9623
	Borderline-SMOTE	0.8596	0.8603	0.7654	0.8410	0.9829	0.9824	0.9381	0.9663
	Safe-Level-SMOTE	0.8646	0.8636	0.7575	0.8395	0.9830	0.9826	0.9382	0.9616
	Cluster-SMOTE	0.8815	0.8809	0.7500	0.8262	0.9825	0.9821	0.9271	0.9508
	A-SUWO	0.8796	0.8767	0.7794	0.8658	0.9836	0.9836	0.9372	0.9630
	CI-ASMOTE1	0.8028	0.8169	0.7948	0.8454	0.9825	0.9821	0.9365	0.9636
	CI-ASMOTE2	0.8548	0.8476	0.7967	0.8812	0.9824	0.9820	0.9399	0.9653
Ecoli	No-sampling	0.7288	0.8511	0.7190	0.8417	0.8808	0.9212	0.7947	0.8517
	SMOTE	0.8790	0.8767	0.7288	0.8509	0.9466	0.9441	0.7804	0.8799
	Borderline-SMOTE	0.8407	0.8523	0.7334	0.8578	0.9415	0.9358	0.7547	0.8787
	Safe-Level-SMOTE	0.8887	0.8856	0.7297	0.8515	0.9508	0.9488	0.7829	0.8770
	Cluster-SMOTE	0.8851	0.8823	0.7273	0.8494	0.9503	0.9491	0.7803	0.8732
	A-SUWO	0.9096	0.9060	0.7248	0.8471	0.9463	0.9429	0.7737	0.8800
	CI-ASMOTE1	0.8820	0.8790	0.7299	0.8538	0.9581	0.9574	0.7984	0.8716
	CI-ASMOTE2	0.8837	0.8808	0.7399	0.8612	0.9582	0.9576	0.8131	0.8839
Yeast	No-sampling	0.6480	0.7443	0.4000	0.5313	0.7611	0.7983	0.5009	0.6025
	SMOTE	0.7843	0.7665	0.4684	0.7341	0.9662	0.9654	0.4987	0.6683
	Borderline-SMOTE	0.7236	0.7487	0.3867	0.6479	0.9736	0.9726	0.4570	0.6294
	Safe-Level-SMOTE	0.8189	0.8028	0.4862	0.7421	0.9624	0.9621	0.5179	0.6714
	Cluster-SMOTE	0.8214	0.8053	0.4897	0.7452	0.9388	0.9384	0.5459	0.7187
	A-SUWO	0.8060	0.7793	0.4634	0.7293	0.9785	0.9781	0.5143	0.6638
	CI-ASMOTE1	0.8499	0.8352	0.5050	0.7450	0.9555	0.9562	0.5527	0.6624
	CI-ASMOTE2	0.8524	0.8401	0.5133	0.7468	0.9548	0.9556	0.5542	0.6630
Libra	No-sampling	0.5797	0.8288	0.4647	0.7184	0.8869	0.8974	0.6914	0.7524
	SMOTE	0.8419	0.8393	0.4812	0.7481	0.9907	0.9906	0.8566	0.9120
	Borderline-SMOTE	0.8163	0.8312	0.4304	0.6982	0.9897	0.9896	0.8508	0.9111
	Safe-Level-SMOTE	0.8604	0.8583	0.4602	0.7251	0.9887	0.9886	0.8563	0.9113
	Cluster-SMOTE	0.8749	0.8767	0.4323	0.6396	0.9860	0.9857	0.8153	0.9049
	A-SUWO	0.8484	0.8475	0.4834	0.7372	0.9932	0.9932	0.8493	0.9080
	CI-ASMOTE1	0.8445	0.8425	0.4908	0.7376	0.9872	0.9871	0.8430	0.9074
	CI-ASMOTE2	0.8471	0.8487	0.4983	0.7695	0.9872	0.9872	0.8702	0.9213
Aba6	No-sampling	0.2752	0.7804	0.2744	0.7732	NaN	0.0000	NaN	0.0000
	SMOTE	0.8142	0.7810	0.2645	0.7786	0.9132	0.9032	0.3141	0.7560

Table 5, continued
		NBC				RBFSVM
		Training		Testing		Training		Testing
Dataset	Methods	F1-M	G-M	F1-M	G-M	F1-M	G-M	F1-M	G-M
	Borderline-SMOTE	0.8057	0.7819	0.2621	0.7762	0.9168	0.9070	0.3053	0.7443
	Safe-Level-SMOTE	0.8280	0.8017	0.2752	0.7795	0.9277	0.9206	0.3209	0.7245
	Cluster-SMOTE	0.8563	0.8361	0.2891	0.7784	0.8994	0.8897	0.3240	0.7686
	A-SUWO	0.8473	0.8135	0.2738	0.7786	0.9228	0.9130	0.3177	0.7461
	CI-ASMOTE1	0.8582	0.8448	0.3089	0.7754	0.9358	0.9325	0.3371	0.7684
	CI-ASMOTE2	0.8647	0.8483	0.3119	0.7893	0.9372	0.9343	0.4113	0.7878
Aba5	No-sampling	0.3605	0.9485	0.2857	0.8148	NaN	0.0000	NaN	0.0000
	SMOTE	0.9327	0.9283	0.3098	0.8974	0.9390	0.9359	0.3364	0.8892
	Borderline-SMOTE	0.9215	0.9154	0.2913	0.9090	0.9339	0.9300	0.3243	0.9064
	Safe-Level-SMOTE	0.9633	0.9614	0.4268	0.9095	0.9655	0.9639	0.4261	0.8938
	Cluster-SMOTE	0.9571	0.9553	0.4209	0.9050	0.9570	0.9553	0.4049	0.8916
	A-SUWO	0.9685	0.9669	0.4289	0.8816	0.9699	0.9685	0.4143	0.8561
	CI-ASMOTE1	0.9700	0.9689	0.4208	0.8343	0.9704	0.9694	0.4550	0.8610
	CI-ASMOTE2	0.9715	0.9705	0.4555	0.8537	0.9708	0.9697	0.4657	0.8733

For some of real-world datasets used in this study, although some contrastive oversampling methods have better training results than CI-ASMOTE1 and CI-ASMOTE2, the classification models of them cannot classify the unknown testing set well. When the classifier is NBC, comparing with No-sampling, SMOTE has smaller testing results for the dataset Glass and Borderline-SMOTE acquires smaller testing results for the datasets Glass, Yeast, Libra and Aba6. When using NBC, Safe-Level-SMOTE, Cluster-SMOTE and A-SUWO get worse testing performances than No-sampling on the datasets Heart and Glass. When the classifier is RBFSVM, SMOTE and Borderline-SMOTE achieve worse testing results than No-sampling for the dataset Yeast in terms of F1-M. Besides, the five contrastive methods get smaller testing values than No-sampling for the dataset Ecoli in terms of F1-M when using RBFSVM. These findings suggest that the five contrastive methods are easy to cause overfitting to classification models when the dataset contains overlaps between classes.

In contrast, CI-ASMOTE1 and CI-ASMOTE2 perform well in the testing set of these real-world datasets because they do not remove any original instances before oversampling and can avoid the generation of overlapping and noisy instances. According to Table 5, CI-ASMOTE1 and CI-ASMOTE2 have better testing results than No-sampling for the 8 real-world datasets when using NBC and RBFSVM. When the classifier is NBC, CI-ASMOTE1 has bigger testing values than five contrastive methods for 6 out of 8 datasets in terms of F1-M, 4 out of 8 datasets in terms of G-M. When using RBFSVM, CI-ASMOTE1 gets better testing results than five comparison methods for 5 out of 8 datasets in terms of F1-M and has better testing performances than some contrastive methods in terms of G-M. In short, CI-ASMOTE1 can better increase the recognition rate of the minority class and further improve the overall classification result and better address the issue of overfitting than other contrastive methods for most of the datasets. This again verifies the rationality of the CI-ASMOTE1 method. Moreover, it is worth noting that CI-ASMOTE2 has better testing results than CI-ASMOTE1 for all datasets, whether the classifier is NBC or RBFSVM. For the testing results of the 8 real-world datasets, CI-ASMOTE2 gets the best F1-M and G-M values in most cases. Specifically, when the classifier is NBC, CI-ASMOTE2 gets the biggest F1-M values for all these real-world datasets and achieves the largest G-M values for 7 out of 8 datasets. For the dataset Aba5, CI-ASMOTE2 does not perform as well as other contrastive methods in terms of G-M, but the differences between them are small. When using RBFSVM, CI-ASMOTE2 obtains the best F1-M values for 8 real-world datasets and gains the biggest G-M values for 5 out of 8 datasets. For the datasets Seg, Yeast and Aba5, its performances on G-M are lower than some comparison methods, but the differences between them are small. These findings show that the classification models established after the oversampling process of CI-ASMOTE2 can better identify minority instances and majority instances in these real-world datasets than other methods. In solving overfitting, the CI-ASMOTE2 is superior to CI-ASMOTE1 because CI-ASMOTE2 can increase the diversity of synthetic instances in each semi-safe minority sub-clusters.

Generally, in solving the classification of real-world imbalanced datasets containing overlaps between classes, CI-ASMOTE1 and CI-ASMOTE2 can alleviate the issue of overfitting and improve classification performance by avoiding the generation of new instances that overlap with the majority class and generating diverse instances within minority sub-clusters that are relatively close to the decision boundary. The two methods proposed in this paper have general applicability to real-world imbalanced datasets, so they can be alternative oversampling methods for imbalanced data classification.

5. Conclusion

In this paper, we proposed the CI-ASMOTE1 method, which aims to avoid increasing the occurrence of overlaps between majority instances and minority instances. CI-ASMOTE1 includes two steps. The first step is to decompose minority instances into sub-clusters according to their connectivity in the feature space by EDCA. The second step is to select minority sub-clusters that are relatively close to the decision boundary as candidate regions to oversample. Based on CI-ASMOTE1, CI-ASMOTE2 is also proposed to improve the diversity of the synthetic instances by evenly oversampling instances within the region of each selected sub-cluster, so that the synthetic instances can include more class-discriminative information.

The two methods were evaluated on a simulated imbalanced dataset and eight real-world datasets with different imbalanced ratios, and were compared with SMOTE, Borderline-SMOTE, Safe-level-SMOTE, Cluster SMOTE and A-SUWO by using NBC and RBFSVM. The experimental results show that our proposed methods are better than these contrastive methods in alleviating overfitting caused by overlaps between synthetic minority instances and majority instances, especially when the original imbalanced datasets have overlaps between classes. Compared with CI-ASMOTE1, many better performances show that CI-ASMOTE2 increases the class-discriminative information by improving the diversity of the new instances. The works done in this paper encourage future studies in reducing the occurrence of overlaps between the minority class and the majority class, increasing the diversity of the synthetic instances, and giving good class discrimination to high performance classification. During the above research, it was found that many data in the real-world have noisy and missing values, which has an effect on the whole method. Therefore, in the future we will consider this data problem as a new research direction for imbalanced data classification.

Footnotes

Acknowledgments

We thank for the support of the Open Research Fund of Beijing Key Laboratory of Big Data Technology for Food Safety Project under Grant (No. BTBD-2019KF02).

References

Gong

and Kim

, RHSBoost: Improving classification performance in imbalance data, Computational Statistics & Data Analysis 111 (2017), 1–13.

and Garcia

E.A.

, Learning from Imbalanced Data, IEEE Transactions on Knowledge and Data Engineering 21(9) (2009), 1263–1284.

Lin

Chen

and Qi

, Deep reinforcement learning for imbalanced classification, Applied Intelligence 50(8) (2020), 2488–2502.

Maldonado

Lopez

and Vairetti

, An alternative SMOTE oversampling strategy for high-dimensional datasets, Applied Soft Computing 76 (2019), 380–389.

Wang

Chen

Jiang

and Yao

, Imbalanced credit risk evaluation based on multiple sampling, multiple kernel fuzzy self-organizing map and local accuracy ensemble, Applied Soft Computing 91 (2020), 106262.

Gicic

and Subasi

, Credit scoring for a microcredit data set using the synthetic minority oversampling technique and ensemble classifiers, Expert Systems 36(2) (2019), e12363.

Cao

Lai

Liu

Wang

and Ding

, Epileptic Signal Classification Based on Synthetic Minority Oversampling and Blending Algorithm, IEEE Transactions on Cognitive and Developmental Systems 13(2) (2021), 368–382.

Shen

Nie

Kou

Yin

and Han

, A cluster-based oversampling algorithm combining SMOTE and k-means for imbalanced medical data, Information Sciences 572 (2021), 574–589.

Feng

Keung

Xiao

Bennin

K.E.

Kabir

M.A.

et al., COSTE: Complexity-based OverSampling TEchnique to alleviate the class imbalance problem in software defect prediction, Information and Software Technology 129 (2021), 106432.

10.

Rao

K.N.

and Reddy

C.S.

, An Efficient Software Defect Analysis Using Correlation-Based Oversampling, Arabian Journal for Science and Engineering 43(8) (2018), 4391–4411.

11.

Wei

Huang

Yao

Fan

and Huang

, New imbalanced bearing fault diagnosis method based on Sample-characteristic Oversampling TechniquE (SCOTE) and multi-class LS-SVM, Applied Soft Computing 101 (2021), 107043.

12.

Kaur

and Gosain

, GT2FS-SMOTE: An Intelligent Oversampling Approach Based Upon General Type-2 Fuzzy Sets to Detect Web Spam, Arabian Journal for Science and Engineering 46(4) (2021), 3033–3050.

13.

Sun

Wong

A.K.C.

and Kamel

M.S.

, CLASSIFICATION OF IMBALANCED DATA: A REVIEW, International Journal of Pattern Recognition and Artificial Intelligence 23(4) (2009), 687–719.

14.

Susan

and Kumar

, SSOMaj-SMOTE-SSOMin: Three-step intelligent pruning of majority and minority samples for learning from imbalanced datasets, Applied Soft Computing 78 (2019), 141–149.

15.

Chawla

N.V.

Bowyer

K.W.

Hall

L.O.

and Kegelmeyer

W.P.

, SMOTE: Synthetic minority over-sampling technique, Journal of Artificial Intelligence Research 16 (2002), 321–357.

16.

Hawkins

D.M.

, The problem of overfitting, Journal of Chemical Information and Computer Sciences 44(1) (2004), 1–12.

17.

Han

Wang

W.Y.

and Mao

B.H.

, Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning, in: Advances in Intelligent Computing, Pt 1, Proceedings, 2005, pp. 878–887.

18.

Bai

Garcia

E.A.

and Li

, ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning, in: 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence), Hong Kong, PEOPLES R CHINA 2008, pp. 1322–1328.

19.

Ren

Yang

and Sun

, Oversampling technique based on fuzzy representativeness difference for classifying imbalanced data, Applied Intelligence 50(8) (2020), 2465–2487.

20.

Bunkhumpornpat

Sinapiromsaran

, and Lursinsap

, Safe-Level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling TEchnique for Handling the Class Imbalanced Problem, in: 13th Pacific-Asia Conference on Knowledge Discovery and Data Mining, Bangkok, THAILAND, 2009, pp. 475–482.

21.

Zhu

and Fan

, A novel oversampling technique for class-imbalanced learning based on SMOTE and natural neighbors, Information Sciences 565 (2021), 438–455.

22.

Cieslak

D.A.

Chawla

N.V.

and Striegel

, Combating imbalance in network intrusion datasets, in: IEEE International Conference on Granular Computing, Atlanta, GA, 2006, pp. 732–737.

23.

Barua

Islam

M.M.

Yao

and Murase

, MWMOTE-Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning, IEEE Transactions on Knowledge and Data Engineering 26(2) (2014), 405–425.

24.

Carvalho

D.R.

and Freitas

A.A.

, Evaluating six candidate solutions for the small-disjunct problem and choosing the best solution via meta-learning, Artificial Intelligence Review 24(1) (2005), 61–98.

25.

Douzas

Bacao

and Last

, Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE, Information Sciences 465 (2018), 1–20.

26.

Nekooeimehr

and Lai-Yuen

S.K.

, Adaptive semi-unsupervised weighted oversampling (A-SUWO) for imbalanced datasets, Expert Systems with Applications 46 (2016), 405–416.

27.

Wei

Huang

Yao

Fan

and Huang

, NI-MWMOTE: An improving noisy-immunity majority weighted minority oversampling technique for imbalanced classification problems, Expert Systems with Applications 158 (2020), 113504.

28.

Zhou

and Liu

, Method for Determining the Optimal Number of Clusters Based on Agglomerative Hierarchical Clustering, IEEE Transactions on Neural Networks and Learning Systems 28(12) (2017), 3007–3017.

29.

Zhang

and Zhang

, A geometrical representation of McCulloch-Pitts neural model and its applications, IEEE transactions on neural networks 10(4) (1999), 925–929.

30.

Ertekin

, Adaptive Oversampling for Imbalanced Data Classification, in: 28th International Symposium on Computer and Information Sciences (ISCIS), Inst Henri Poincare, Paris, FRANCE, 2013, pp. 261–269.

31.

Zhang

and Li

, RWO-Sampling: A random walk over-sampling approach to imbalanced data classification, Information Fusion 20 (2014), 99–116.

32.

Martinez-Arroyo

and Sucar

L.E.

, Learning an optimal naive Bayes classifier, in: 18th International Conference on Pattern Recognition (ICPR 2006), Hong Kong, PEOPLES R CHINA, 2006, pp. 1236–1239.

33.

Sebald

D.J.

and Bucklew

J.A.

, Support vector machine techniques for nonlinear equalization, IEEE Transactions on Signal Processing 48(11) (2000), 3217–3226.

34.

Xiong

Wang

Deng

and Ye

, ACO Resampling: Enhancing the performance of oversampling methods for class imbalance classification, Knowledge-Based Systems 196 (2020), 105818.

35.

Luque

Carrasco

Martin

and de las Heras

, The impact of class imbalance in classification performance metrics based on the binary confusion matrix, Pattern Recognition 91 (2019), 216–231.

Clustering-based improved adaptive synthetic minority oversampling technique for imbalanced data classification

Abstract

Keywords

1. Introduction

2. Literature survey

3. Proposed methods

3.1 CI-ASMOTE1

3.1.1 Construct minority sub-clusters

4.1 Datasets and experiments

Table 1 The specific information of the nine imbalanced datasets used in the experiment

Table 2 The confusion matrix

Table 3 The testing results of the SAFE method and CI-ASMOTE1 using NBC and RBFSVM in 9 imbalanced datasets

4.4.1 The results and analyses of the simulated dataset

Table 5 The F1-M and G-M values of No-sampling and seven oversampling methods using NBC and RBFSVM in 8 real-world datasets

Footnotes

Acknowledgments

References

Table 1
The specific information of the nine imbalanced datasets used in the experiment

Table 2
The confusion matrix

Table 3
The testing results of the SAFE method and CI-ASMOTE1 using NBC and RBFSVM in 9 imbalanced datasets

Table 5
The F1-M and G-M values of No-sampling and seven oversampling methods using NBC and RBFSVM in 8 real-world datasets