Abstract
Synthetic Minority Oversampling Technique (SMOTE) and some extensions based on it are popularly used to balance imbalanced data. In this study, we concentrate on solving overfitting of the classification model caused by choosing instances to oversample that increase the occurrence of overlaps with the majority class. Our method called Clustering-based Improved Adaptive Synthetic Minority Oversampling Technique (CI-ASMOTE1) decomposes minority instances into sub-clusters according to their connectivity in the feature space and then selects minority sub-clusters which are relatively close to the decision boundary as the candidate regions to oversample. After application of CI-ASMOTE1, new minority instances are only synthesized within each connected region of the selected sub-clusters. Considering the diversity of the synthetic instances in each selected sub-cluster, CI-ASMOTE2 is put forward to extend CI-ASMOTE1 by keeping all features of those instances in the feature space as different as possible. The experimental evaluation shows that CI-ASMOTE1 and CI-ASMOTE2 improve SMOTE and its extensions, especially in the occurrence of overlaps between the minority instances and the majority instances.
Introduction
Imbalanced data distribution reduces the classification accuracy, which has become a serious problem in a lot of practical applications [1, 2]. Since traditional machine learning methods are biased towards formulating classification models that favor the majority class, the issue of imbalanced data classification arises [3, 4]. While this fact generally causes large overall accuracy, it may lead to unsatisfactory decision-making because the minority class may has high misclassification costs across numerous applications such as credit risk assessment [5, 6], medical diagnosis [7, 8], software defect prediction [9, 10], bearing fault diagnosis [11], and detect Web Spam [12].
Oversampling techniques have been proven to be effective ways to deal with imbalanced data classification [13]. The earliest method is random oversampling [14]. SMOTE [15] is a typical improvement of this approach. In SMOTE, new instances are synthesized along the line segment between two original minority instances. However, SMOTE encounters the issue of overfitting [16]. Because it aggravates overlaps between classes by blindly utilizing all the original minority instances to generate new instances. The emergence of SMOTE was followed by a stage of the development of oversampling techniques. Some extensions based on SMOTE, such as Borderline-SMOTE [17] and Adaptive Synthetic Sampling Approach (ADASYN) [18], oversample minority instances under the prior knowledge of available majority instances. The above extensions, improving SMOTE to different extents, mainly aim to identify hard-to-learn instances to oversample because those instances have beneficial information for classification. However, these extensions either do not fully utilize the distribution information of instances or do not update the random selection of the nearest neighbors. Therefore, they may oversample overlapping and noisy instances, which creates difficulty in learning and worse overfitting. Those strategies are especially debatable in the case of extremely skewed class distribution. In that case, the minority class is very sparse relative to the majority class, thus leading to a bigger possibility of class mixture.
In this work, an improved oversampling method called Clustering-based Improved Adaptive Synthetic Minority Oversampling Technique (CI-ASMOTE1) is presented for the problem of imbalanced data classification. CI-ASMOTE1 decomposes minority instances into different sub-clusters according to their connectivity in the feature space and then selects semi-safe minority sub-clusters as candidate sub-clusters to oversample. To synthesize more instances that are beneficial for classification, it assigns the higher oversampling weights to those selected sub-clusters which have the higher degree of sparseness and are closer to the majority class. CI-ASMOTE1 oversamples each selected sub-cluster by SMOTE to avoid generating overlapping and noisy instances, which decreases the occurrence of overfitting. We also consider the diversity of the synthetic instances within selected sub-clusters. Therefore, CI-ASMOTE2 extends CI-ASMOTE1 by two-step SMOTE (TSMOTE) to provide a useful solution to the data imbalance issue in the real-world. CI-ASMOTE2 only generates instances that are beneficial for class-discriminative information, so it can guarantee the diversity of the synthetic instances.
This paper has the following structures: Section 2 introduces SMOTE and other oversampling methods that are relevant to imbalanced data classification. The steps of the two proposed methods are explained in Section 3. The experimental results are analyzed in Section 4 and the conclusion is drawn in Section 5.
Literature survey
In this section, relevant methods for imbalanced data classification are presented. First, random oversampling technique and SMOTE are presented. And then some well-known extensions are discussed.
Oversampling techniques are the most primitive and traditional processing methods for imbalanced data [19]. The simplest technique of them is random oversampling. It randomly selects some original minority instances and replicates them until the sample numbers of the two classes are roughly equal. New instances generated by random oversampling overlap with original minority instances, which may cause serious overfitting. To alleviate this issue, SMOTE [15] synthesizes new instances along the line segment between one randomly selected minority instance and its K-nearest neighbors of the minority class. However, SMOTE may aggravate overlaps between the two classes, as it generates new minority instances without taking the distribution of the majority class into account. In other words, SMOTE may oversample overlapping and noisy instances of the minority class. Besides, SMOTE also oversamples uninformative instances that are far from the decision boundary. Various extensions based on SMOTE have been put forward to better identify minority instances from the imbalanced data. The sampling technique of Borderline-SMOTE [17] is the same as SMOTE, but it only oversamples minority instances that are in the proximity of the decision boundary. If the separability of the imbalanced data is good, Borderline-SMOTE will be effective in improving the classification result of the data. Otherwise, Borderline-SMOTE may increase the complexity of the classification because new instances generated by it are closer to unsafe regions and may overlap with majority instances. The increased complexity may make the decision boundary between the two classes more difficult to identify. Safe-Level-SMOTE [20] defines a safe level for each minority instance according to the distribution of the minority class in its K-nearest neighbors. To reduce the generation of overlapping and noisy instances, Safe-Level-SMOTE makes synthetic instances closer to minority instances which have the higher safe level by introducing weights when calculating new instances. Regrettably, this method is prone to lead to overfitting because synthetic instances of it are uninformative instances that are far away from the decision boundary. Another method that assigns weights to minority instances is ADASYN [18]. ADASYN assigns a bigger oversampling weight to the minority instance that has more majority class neighbors, so more instances are synthesized based on those minority instances that are more difficult to learn. However, ADASYN also will lead to overfitting if original imbalanced data contain overlaps between classes, which is the same as Borderline-SMOTE. Synthetic minority oversampling technique with natural neighbors (NaNSMOTE) [21] determines the neighbor number of each minority instance based on the surrounding distribution of the instance. NaNSMOTE can decrease the generation of noisy instances by assigning a smaller neighbor number to the minority instance that is closer to the majority class. This method still has the hidden danger of overfitting because it cannot prevent synthesizing overlapping instances when imbalanced data have serious overlaps between classes.
Many clustering-based oversampling methods have been presented to address the issue of overfitting. Cluster-SMOTE [22] uses K-means to divide minority instances into sub-clusters and then oversamples each sub-cluster by SMOTE. However, when determining the oversampling level of each minority sub-cluster, Cluster-SMOTE not only does not consider the distance between each sub-cluster and the decision boundary, but also does not take the complexity of each sub-cluster into account. Thus, Cluster-SMOTE is possible to expand overlapping areas and cause overfitting. Majority Weighted Minority Oversampling Technique (MWMOTE) [23] is a two-step weighted method. It firstly finds those minority and majority instances that are close to the decision boundary and secondly filters out informative instances from the minority class according to the distance from each minority instance to the majority class. Finally, it applies a clustering method to generate new instances based on those filtered instances. MWMOTE decreases the occurrence of overlaps between synthetic instances and majority instances, but it still may produce overfitting as it oversamples minority disjuncts [24] that mixed into the majority class. Minority disjuncts are regions that made up of a small number of minority instances [24, 25]. Besides, Minority disjuncts that are far away from the majority class cannot be detected by MWMOTE, even if they may have relevant information about classification. K-means based SMOTE [25] utilizes K-means to divide all instances into sub-clusters of different sizes and then selects sub-clusters that are composed of at least 50% minority instances as safe regions to oversample. This method not only avoids the generation of noisy instances, but also reduces intra-class and inter-class imbalances. Adaptive semi-unsupervised weighted oversampling (A-SUWO) [26] is a clustering-based method proposed for easing the issue of overfitting. It identifies sub-clusters from the minority class that are not overlapped with majority sub-clusters and then only oversamples these minority sub-clusters to reduce overlapping synthetic instances. To alleviate the problem of overfitting caused by noisy instances, A-SUWO removes instances surrounded with instances of the adversarial class from the training set before clustering [26]. However, A-SUWO will lead to the loss of testing classification performances if those removed instances are over-recognized noisy instances, especially when the minority instances are sparsely distributed in the majority class. Improving noise-immunity majority weighted minority oversampling technique (NI-MWMOTE) [27] is another method proposed to address the problem of overfitting caused by noisy instances. It reduces the over-recognition of noisy instances by iteratively using K-nearest neighbor algorithm and misclassification errors and then oversampling minority instances according to aggregative hierarchical clustering algorithm (AHC) [28]. But the computational process of NI-MWMOTE is relatively complicated, its efficiency in practical applications is not ideal.
In summary, to solve the problem of imbalanced data classification, many extensions based on SMOTE and clustering have been proposed. These methods overcome some shortcomings of traditional random oversampling and SMOTE, but they are less successful in avoiding increasing overlapping or noisy instances. In particular, some of these methods exacerbate overlaps when there are overlaps in original data, because they adopt the strategy of selecting candidate instances in the proximity of the decision boundary. They make the decision boundary more complex, which leads to overfitting. Furthermore, in these clustering-based methods, the minority sub-clusters often contains sporadic majority instances (i.e. noisy instances), which also leads to overfitting. Some of these methods use the strategy of removing noisy instances in advance to reduce the problem of overfitting, but they can lead to under-fitting for small sample data in the real world.
Proposed methods
CI-ASMOTE1
CI-ASMOTE1 is composed of four main procedures: (1) Construct minority sub-clusters, (2) Select candidate minority sub-clusters, (3) Adaptively determine the number of new instances of each candidate sub-cluster, and (4) Synthesize new minority instances.
Construct minority sub-clusters
To avoid the generation of overlapping and noisy instances, Euclidean Distance Clustering Algorithm (EDCA) is proposed in this section. EDCA is based on the Constructive Covering Algorithm (CCA) [29] which can cover instances with the same label. EDCA decomposes minority instances into different sub-clusters according to the connectivity of minority instances. In other words, if one majority instance exists between two minority instances, EDCA will not merge them in the same sub-cluster. The specific procedures of EDCA are as follows.
Suppose the input dataset is
Step1. Randomly select one minority instance
Step2. Calculate the distance
Step3. Calculate the Euclidean distance between the center
Step4. Calculate the radius
Among it,
Step5. Repeat steps 1 to 4 until the
As we know, the purpose of oversampling methods is to increase the recognition rate of the minority class and the overall classification result of the imbalanced data [30]. Adding key minority instances is one of useful ways to improve the performance of the classification model in identifying minority instances. Therefore, it is vital to find a method which can select key minority instances as candidate instances to oversample. In most oversampling methods, the selection of candidate minority instances is binary [17, 20]. It means that candidate minority instances are just safe instances or unsafe instances, which are labeled according to their distances to the decision boundary. Some of oversampling methods only consider candidate minority instances in the proximity of the decision boundary. They neglect the influence of other minority instances on the performance of the classifier. In fact, minority instances can be divided into three groups in the process of selecting candidate instances. The first group is minority instances that are far away from the decision boundary. These instances only have a little class-discriminative information. If they are selected as candidate instances to oversample, the contribution of the synthesized instances to the classification will not be significant. The second group is minority instances which are very close to majority instances that are obviously clustered. Selecting these instances to oversample not only can easily increase the occurrence of overlaps between two classes but also can cause the problem of overfitting. The third group is minority instances that are outside the above two groups. These instances are relatively close to the decision boundary and contain a lot of class-discriminative information. Thus, this group is more suitable to be oversampled than other two groups.
In this paper, CI-ASMOTE1 divides minority sub-clusters into the three groups mentioned above, according to the sample number of each minority sub-cluster and the proportion of majority instances in K-nearest neighbors of each minority sub-cluster. In some oversampling methods, the
In the Eq. (5),
Then, the proportion
Among it,
According to
The first group is minority sub-clusters which contain two instances or more than two instances and are far from majority instances. We define it as the set of safe minority sub-clusters
The second group is minority sub-clusters which have two instances or more than two instances and are very close to majority instances. Besides, minority sub-clusters that only contain one instance are also belong to the second group. We define this group as the set of dangerous minority sub-clusters
The third group is minority sub-clusters which contain two instances or more than two instances and are relatively close to majority instances. We define it as the set of semi-safe minority sub-clusters
To prevent the issue of overfitting and improve the classification result of the imbalanced data as well as possible, CI-ASMOTE1 chooses
CI-ASMOTE1 adaptively determines the number of synthetic instances of each candidate minority sub-cluster according to two factors. The first factor is the sparsity of each candidate sub-cluster. The second factor is the average Euclidean distance between each candidate sub-cluster and majority instances within its K-nearest neighbors. The sparsity of each candidate sub-cluster is the inverse of its density. The density of one sub-cluster is defined as the ratio of the sample number of it to the covering area formed by it. If the density of the sub-cluster is small, it means that the dispersion of its internal instances is great. Thus, new instances synthesized in the candidate minority sub-cluster that has the higher sparsity will have the smaller possibility of overlapping with original minority instances. In addition, for each candidate minority sub-cluster, if the average Euclidean distance between it and majority instances within its K-nearest neighbors is smaller, new instances generated within it will contain more useful information for the classification. To synthesize more instances that are beneficial to the classification, CI-ASMOTE1 assigns the bigger oversampling weight to the candidate minority sub-cluster which has bigger sparsity and is closer to majority neighbors. The steps of this section are as follows.
Step 1. Calculate the sparsity of each sub-cluster in
Step 2. Calculate the average Euclidean distance
In the above formula,
Step 3. Assign the oversampling weight
Step 4. Adaptively determine the number
Among it,
In this section, CI-ASMOTE1 adopts SMOTE to each candidate minority sub-cluster. SMOTE [15] randomly selects two different original instances
The above step is repeated until the corresponding number
Since new instances generated by SMOTE only exist on the line of two selected minority instances, instances synthesized by CI-ASMOTE1 in the same sub-cluster may be similar. Considering this defect, CI-ASMOTE2 is further proposed in this paper. CI-ASMOTE2 improves the fourth procedure of CI-ASMOTE1. It uses two-step SMOTE (TSMOTE) instead of SMOTE to insert diversified and class-discriminating instances in each candidate minority sub-cluster. TSMOTE selects three original instances from the candidate minority sub-cluster to synthesize new instances. New instances will randomly exist in the region composed of the three original instances. The specific procedures of TSMOTE are as follows.
Step 1. Synthesize the intermediary instance
In this step,
Step 2. Synthesize new minority instance
Among it,
Step 3. Repeat the step1 to 2 until the number
Step 4. Output the balanced dataset by combining synthetic instances with original instances.
The proposed CI-ASMOTE2 method is described in Algorithm 1.
Datasets and experiments
To validate the effectiveness of CI-ASMOTE1 and CI-ASMOTE2, we evaluated them on a simulated dataset and eight real-world datasets that provided by the University of California Irvine (UCI). We compare them with five other oversampling methods. The five contrastive methods are SMOTE [15], Borderline-SMOTE [17], Safe-Level-SMOTE [20], Cluster-SMOTE [22] and A-SUWO [26]. The parameters of each oversampling method were set according to the reference, and the oversampling rate was 100%. The simulated dataset is a two-dimensional dataset. The minority class of the simulated dataset is composed of data that mean value is [
The specific information of the nine imbalanced datasets used in the experiment
The specific information of the nine imbalanced datasets used in the experiment
The confusion matrix [35], which can visualize the classification result of the imbalanced data, is shown in Table 2. In it, minority instances are marked as the positive class and majority instances are marked as the negative class. According to it, different performance indexes can be achieved. In this study, evaluation indexes are F1-Measure (F1-M) and G-Mean (G-M).
The confusion matrix
The confusion matrix
F1-M is the harmonic average of Precision and Recall. It can reflect the performance of the classification model in identifying minority instances. Precision represents the proportion of TP instances in the instances that are predicted to be the positive class. Recall is also called true positive rate (TPR), which refers to the ratio of TP instances that are accurately identified. The value of F1-M is bigger, which means that the classification result of the minority class is more satisfactory. The F1-M is computed according to these following formulae.
The distribution of 8 real-world datasets in the first three feature spaces.
G-M is a comprehensive index. It can reflect the classification results of two classes in the imbalanced data. G-M is the geometric mean of TPR and True Negative Rate (TNR). TNR represents the proportion of TN instances that are rightly classified. The value of G-M will be better if the recognition rates of majority instances and minority instances are both large. The G-M can be computed as follows.
To confirm the rationality of CI-ASMOTE1, we compared CI-ASMOTE1 with the method that uses SMOTE to oversample the safe minority sub-clusters (abbreviated SAFE). Table 3 shows the testing results of the two methods using NBC and RBFSVM in the 9 imbalanced datasets. The better results are highlighted in bold. In terms of F1-M and G-M, CI-ASMOTE1 has better performances on 7 out of 9 datasets when using NBC and on 9 out of 9 datasets when using RBFSVM. For the datasets Aba6 and Aba5 which are highly imbalanced and contain serious overlaps between classes, the SAFE method cannot increase the recognition rates of minority and majority instances anymore. On the whole, CI-ASMOTE1 is a more useful approach for imbalanced datasets that contain overlaps between classes.
The testing results of the SAFE method and CI-ASMOTE1 using NBC and RBFSVM in 9 imbalanced datasets
The testing results of the SAFE method and CI-ASMOTE1 using NBC and RBFSVM in 9 imbalanced datasets
The results and analyses of the simulated dataset
In this study, we used the simulated dataset to compare data distributions of CI-ASMOTE1, CI-ASMOTE2 and five contrastive methods. Figure 2 shows the results, in which stars represent minority instances, hollow triangles refer to majority instances and hollow circles represent synthetic minority instances. As shown in Fig. 2a, there are half of minority instances overlapping with the majority class. From Fig. 2b to Fig. 2e, some synthetic instances of SMOTE, Borderline-SMOTE, Safe-Level-SMOTE and Cluster-SMOTE are mixed into the majority class because the four methods oversampled overlapping instances. This shows that SMOTE and Borderline-SMOTE inevitably exacerbate the overlaps between the two classes, and the effectiveness of Safe-Level-SMOTE and Cluster-SMOTE in reducing the generation of overlapping instances is not significant. As shown in Fig. 2f, A-SUWO removes original noisy instances before oversampling minority sub-clusters to reduce the occurrence of overfitting. However, the data distribution characteristics of the simulated dataset were changed by it. From Fig. 2g, all original minority instances are decomposed into different sub-clusters by CI-ASMOTE1, and new instances are only distributed within semi-safe minority sub-clusters that are relatively close to the decision boundary. This shows that CI-ASMOTE1 can completely prevent the overlaps between synthetic instances and majority instances. However, some new instances of CI-ASMOTE1 are similar, especially those synthetic instances in the same minority sub-cluster. The further proposed CI-ASMOTE2 reduces the above weakness by using TSMOTE to synthesize instances. As shown in Fig. 2h, new instances in semi-safe minority sub-clusters containing over two original instances are different from each other. To sum up, compared with the five contrastive methods, the two methods submitted in this paper can keep all the original instances and avoid generating overlapping and noisy instances in the process of balancing the simulated dataset. Moreover, CI-ASMOTE2 can better increase the diversity and quality of synthetic instances than CI-ASMOTE1.
Comparison of No-sampling and seven oversampling methods on the simulated dataset.
Table 4 shows the training and testing results of No-sampling, CI-ASMOTE1, CI-ASMOTE2 and other five oversampling methods using NBC and RBFSVM in the simulated dataset. The optimal results are reported in bold. First, it is easy to find that CI-ASMOTE2 achieves the best training and testing results in terms of F1-M and G-M whether using NBC or RBFSVM. Second, when the classifier is NBC, CI-ASMOTE1 gets better training results than No-sampling and some comparison methods in terms of F1-M and G-M. It also achieves greater testing performances than No-sampling and other five contrastive methods in terms of the two evaluation indexes. Compared with the training values of CI-ASMOTE1, Safe-Level-SMOTE respectively has an improvement about 0.33% and 0.27% in terms of F1-M and G-M. This indicates that the differences between the two methods are not large. When the classifier is RBFSVM, the training and testing results of CI-ASMOTE1 are bigger than No-sampling and five contrastive methods in terms of F1-M and G-M. These findings suggest that CI-ASMOTE1 can more effectively improve the recognition rate of the minority class and the overall classification result of this simulated dataset than other five contrastive methods because CI-ASMOTE1 does not generate new minority instances that overlapped with majority instances. This verifies the rationality of the principle of the CI-ASMOTE1 method. Moreover, CI-ASMOTE2 outperforms CI-ASMOTE1 and five contrastive methods, which illustrates that CI-ASMOTE2 can further improve the classification result of this simulated dataset since it increases the diversity of synthetic instances. Although SMOTE, Borderline-SMOTE, Safe-Level-SMOTE and Cluster-SMOTE also have good performances on the training set, they do not perform well on the testing set, especially when the classifier is RBFSVM. This shows that the above four contrastive methods cause overfitting to the classification model because they expand the overlapping regions of the training set. The reason why A-SUWO does not obtain better results is that the noisy instances removed by it may be vital for the classification. Generally, compared with other five oversampling methods, CI-ASMOTE1 and CI-ASMOTE2 can reduce overfitting caused by overlapping synthetic instances in this simulated dataset.
The F1-M and G-M values of No-sampling and seven oversampling methods using NBC and RBFSVM in the simulated dataset
Table 5 gives the training and testing results of No-sampling and seven oversampling methods using NBC and RBFSVM in the 8 real-world datasets. The best performances are highlighted in bold. When the classifier is NBC, CI-ASMOTE1 and CI-ASMOTE2 get better training results than other five oversampling methods for 3 out of 8 datasets in terms of F1-M, 4 out of 8 datasets in terms of G-M. When the classifier is RBFSVM, CI-ASMOTE2 obtains the best training results for 4 out of 8 datasets in terms of F1-M and G-M, and CI-ASMOTE1 acquires better training results than five contrastive methods for 3 out of 8 datasets in terms of F1-M and G-M. In other cases, the training results of CI-ASMOTE1 and CI-ASMOTE2 are also close to the optimal results. Moreover, it is worth noting that CI-ASMOTE2 has better training results than CI-ASMOTE1 for most of the datasets when using NBC and RBFSVM. These findings show that the two proposed methods can increase more useful synthetic instances than contrastive methods and confirm that the quality of new instances generated by CI-ASMOTE2 is better than CI-ASMOTE1.
The F1-M and G-M values of No-sampling and seven oversampling methods using NBC and RBFSVM in 8 real-world datasets
The F1-M and G-M values of No-sampling and seven oversampling methods using NBC and RBFSVM in 8 real-world datasets
For some of real-world datasets used in this study, although some contrastive oversampling methods have better training results than CI-ASMOTE1 and CI-ASMOTE2, the classification models of them cannot classify the unknown testing set well. When the classifier is NBC, comparing with No-sampling, SMOTE has smaller testing results for the dataset Glass and Borderline-SMOTE acquires smaller testing results for the datasets Glass, Yeast, Libra and Aba6. When using NBC, Safe-Level-SMOTE, Cluster-SMOTE and A-SUWO get worse testing performances than No-sampling on the datasets Heart and Glass. When the classifier is RBFSVM, SMOTE and Borderline-SMOTE achieve worse testing results than No-sampling for the dataset Yeast in terms of F1-M. Besides, the five contrastive methods get smaller testing values than No-sampling for the dataset Ecoli in terms of F1-M when using RBFSVM. These findings suggest that the five contrastive methods are easy to cause overfitting to classification models when the dataset contains overlaps between classes.
In contrast, CI-ASMOTE1 and CI-ASMOTE2 perform well in the testing set of these real-world datasets because they do not remove any original instances before oversampling and can avoid the generation of overlapping and noisy instances. According to Table 5, CI-ASMOTE1 and CI-ASMOTE2 have better testing results than No-sampling for the 8 real-world datasets when using NBC and RBFSVM. When the classifier is NBC, CI-ASMOTE1 has bigger testing values than five contrastive methods for 6 out of 8 datasets in terms of F1-M, 4 out of 8 datasets in terms of G-M. When using RBFSVM, CI-ASMOTE1 gets better testing results than five comparison methods for 5 out of 8 datasets in terms of F1-M and has better testing performances than some contrastive methods in terms of G-M. In short, CI-ASMOTE1 can better increase the recognition rate of the minority class and further improve the overall classification result and better address the issue of overfitting than other contrastive methods for most of the datasets. This again verifies the rationality of the CI-ASMOTE1 method. Moreover, it is worth noting that CI-ASMOTE2 has better testing results than CI-ASMOTE1 for all datasets, whether the classifier is NBC or RBFSVM. For the testing results of the 8 real-world datasets, CI-ASMOTE2 gets the best F1-M and G-M values in most cases. Specifically, when the classifier is NBC, CI-ASMOTE2 gets the biggest F1-M values for all these real-world datasets and achieves the largest G-M values for 7 out of 8 datasets. For the dataset Aba5, CI-ASMOTE2 does not perform as well as other contrastive methods in terms of G-M, but the differences between them are small. When using RBFSVM, CI-ASMOTE2 obtains the best F1-M values for 8 real-world datasets and gains the biggest G-M values for 5 out of 8 datasets. For the datasets Seg, Yeast and Aba5, its performances on G-M are lower than some comparison methods, but the differences between them are small. These findings show that the classification models established after the oversampling process of CI-ASMOTE2 can better identify minority instances and majority instances in these real-world datasets than other methods. In solving overfitting, the CI-ASMOTE2 is superior to CI-ASMOTE1 because CI-ASMOTE2 can increase the diversity of synthetic instances in each semi-safe minority sub-clusters.
Generally, in solving the classification of real-world imbalanced datasets containing overlaps between classes, CI-ASMOTE1 and CI-ASMOTE2 can alleviate the issue of overfitting and improve classification performance by avoiding the generation of new instances that overlap with the majority class and generating diverse instances within minority sub-clusters that are relatively close to the decision boundary. The two methods proposed in this paper have general applicability to real-world imbalanced datasets, so they can be alternative oversampling methods for imbalanced data classification.
In this paper, we proposed the CI-ASMOTE1 method, which aims to avoid increasing the occurrence of overlaps between majority instances and minority instances. CI-ASMOTE1 includes two steps. The first step is to decompose minority instances into sub-clusters according to their connectivity in the feature space by EDCA. The second step is to select minority sub-clusters that are relatively close to the decision boundary as candidate regions to oversample. Based on CI-ASMOTE1, CI-ASMOTE2 is also proposed to improve the diversity of the synthetic instances by evenly oversampling instances within the region of each selected sub-cluster, so that the synthetic instances can include more class-discriminative information.
The two methods were evaluated on a simulated imbalanced dataset and eight real-world datasets with different imbalanced ratios, and were compared with SMOTE, Borderline-SMOTE, Safe-level-SMOTE, Cluster SMOTE and A-SUWO by using NBC and RBFSVM. The experimental results show that our proposed methods are better than these contrastive methods in alleviating overfitting caused by overlaps between synthetic minority instances and majority instances, especially when the original imbalanced datasets have overlaps between classes. Compared with CI-ASMOTE1, many better performances show that CI-ASMOTE2 increases the class-discriminative information by improving the diversity of the new instances. The works done in this paper encourage future studies in reducing the occurrence of overlaps between the minority class and the majority class, increasing the diversity of the synthetic instances, and giving good class discrimination to high performance classification. During the above research, it was found that many data in the real-world have noisy and missing values, which has an effect on the whole method. Therefore, in the future we will consider this data problem as a new research direction for imbalanced data classification.
Footnotes
Acknowledgments
We thank for the support of the Open Research Fund of Beijing Key Laboratory of Big Data Technology for Food Safety Project under Grant (No. BTBD-2019KF02).
