Abstract
Imbalanced problem is concerned with the performance of classifiers on the data set with severe class imbalance distribution. For two-class, the examples can be categorized into majority class or minority class, and the cost of misclassifying minority class examples is often much higher than the contrary cases. However, traditional classifiers do not work well on the imbalanced problem due to the assumption that the number of each class examples is similar to each other. To handle these problems, this paper proposes a simple but effective ensemble learning method based on feature projection and under-sampling (EFPUS). EFPUS learns an ensemble through the following two steps: (1) under-sampling several subset from majority class and learning a novel projection matrix from each subset, and (2) constructing new training sets by projecting the original training set to different spaces defined by the matrixes and learning a class-imbalance oriented classifier from each new training set. For the first step, feature projection and under-sampling mainly aim to improve the diversity between ensemble members. With respect to the second step, the base models can be learned by any traditional class-imbalance oriented learning method such as USBagging, SMOTEBoost and BalanceCascade. Experimental results show that, compared with other state-of-the-art methods, EFPUS shows significantly better performance on measures of g-mean, f-measure, AUC, recall and accuracy.
Introduction
Class imbalance problems are often encountered in real world, rising from medical diagnosis, risk management, fraud detection and other domain applications [1]. For two-class, the class imbalance data is characterized as having much less examples of one class (minority class) than the other (majority class), and the cost of misclassifying minority class examples is much expensive than the contrary cases [2, 3, 4]. However, most conventional classification methods try to achieve high accuracy with the assumption of balanced class distribution, i.e., the number of examples in any class is similar to each other, which leads to the fact that the minority class examples are often ignored and misclassified to majority class.
Many approaches have been proposed to handle the class imbalance problem, and sampling technique and ensemble learning are two candidates of the most often used methods. Sampling techniques including under-sampling and over-sampling are often used to pre-process data before learning models and the objective of sampling techniques is to balance the class distribution to eliminate the harms of imbalanced problem by sampling the data space. Under-sampling techniques try to eliminate the harms of imbalanced distribution by removing the intrinsic examples in the majority class. Random under-sampling (RUS) is the simplest yet very effective method, which randomly eliminating majority class examples [5, 6]. On the contrary to under-sampling, over-sampling techniques including randomly over-sampling (ROS) and SMOTE aim to create new minority class examples, where ROS randomly duplicates the minority class examples and SMOTE generates new synthetic examples along the line between the minority examples and their selected nearest neighbors [7, 8].
Ensemble learning, which has often used to solute challenging issues such as image recognition [9, 10, 11, 12], is an effective technique for dealing with imbalanced problem [13, 14, 15, 16, 17]. Classifier ensembles for imbalanced data can be categorized into two levels: iterative based ensembles and parallel based ensembles. Boosting is the most well-known iterative based ensemble and often used to solve class-imbalance problem by embedding techniques for data preprocessing into its learning process. In such a manner, these methods alter and bias the weight distribution to train the next classifier toward the minority class, examples including SMOTEBoost [15] and RUSBoost [16] algorithms. Parallel based ensembles refer to ensemble models in which each base classifier can be trained in parallel. As a member of parallel based ensembles, bagging is often combined with other class-imbalance oriented methods to handle imbalanced problem. Examples includes OverBagging, UnderBagging and UnderOverBagging [14], where OverBagging learns each base classifier on a data set obtained by over-sampling (or generating from) minority class. On the contrary to OverBagging, UnderBagging learns each member on under-sampled subset from majority class. UnderOverBagging makes use of both over-sampling and under-sampling techniques to improve ensemble diversity.
In this paper, we propose a novel ensemble method based on feature projection and under-sampling (EFPUS) to improve the performance of conventional class-imbalance oriented methods by pre-processing the training set. EFPUS is a parallel based ensemble, which constructs several new training sets and trains a base class-imbalance oriented classifier that can be a single model or an ensemble using each training set. These training sets are obtained though the following three steps: under-sampling several subsets from majority class, learning a feature projection matrix from each subset, and projecting the original training set to new spaces defined by the projection matrixes to obtain new training sets. Therefore, these new training sets are also imbalanced data sets. EFPUS mainly aims to obtain larger diversity between ensemble members, and thus improves the ensemble performance on imbalanced data. In order to further improve the diversity between ensemble members, following rotation forests [18], the original feature set is randomly split into
Employing both feature projection and under-sampling technique to deal with imbalanced data to generate new training sets with diversities; Proposing a novel ensemble method based on feature projection and under-sampling (EFPUS) to improve ensemble performance on imbalanced data sets; Extensive experiments are conducted and the results show that, compared with other state-of-the-art classification methods, EFPUS shows much better performance on measures of g-mean, f-measure, AUC and recall.
The rest of this paper is organized as follows: after presenting the strategies of handling imbalanced problem in Section 2, Section 3 describes the proposed ensemble method, Section 4 presents the experimental results and, finally, Section 5 concludes this work.
The imbalanced problems rise from the scare representation of the most important examples, which leads to the fact that learned models tend to focus more on normal examples, overlooking the minority class examples. Numerous techniques have been developed to deal with the problem of class-imbalance data sets, which can mainly categorized into the following three groups: data level, algorithm level and ensemble learning.
Data level
These techniques in data level try to rebalance the given imbalanced data through resampling data space such that conventional learning methods pay more attention on minority class. Reported studies of data level can be further subdivided two types: resampling and weighting the data space. Resampling techniques aim to alleviate the effect of class imbalance distribution through sampling data space to rebalance the corresponding imbalanced data set. Commonly used sampling techniques are falling to the following three categories: over-sampling methods, under-sampling methods and hybrid method. Over-sampling techniques try to create new minority class examples to eliminate the harms of imbalanced problem. Randomly duplicating the minority samples and SMOTE [6, 7] are the two most popular examples of over-sampling techniques. Under-sampling techniques such as Random Under-sampling (RUS), the simplest yet most effective method, tries to eliminate the harms of class-imbalance distribution through removing the examples of the majority class. The hybrid method is a combination of over-sampling and under-sampling. Japkowicz [19] discussed the sampling technique and observed that the technique was very effective and furthermore, she noted that using sophisticated sampling techniques does not provide any clear advantage in solving the class imbalance problem. Moreover, the researches carried out by Mani and Zhang [20] showed that the random under-sampling approaches often outperform complicated under-sampling methods. Batista et al. [21] and Xie and Qiu [22] both showed that over-sampling usually perform better than under-sampling. Estabrooks et al. [23] suggested that a combination of over-sampling and under-sampling might be more effective to solve the class imbalance problem. The strategies of weighting the data space aim to modify the training set distribution using information concerning the misclassification costs [24], cost-sensitive methods [25] and an ensemble of SVM with asymmetric misclassification costs [26].
Algorithm based level
Algorithm level approaches (also called internal) aims to improve the learning ability of existing classification algorithms to improve the classification performance for imbalanced data [26, 27, 28]. These methods require special knowledge of both the corresponding classifier and the application domain, comprehending why the classifier fails when the class distribution is uneven. Examples include enhancing the discriminatory power of the classifiers using kernel transformation [29, 30], converting the training goal to a well-deigned objective function that more severely penalize errors on minority class [28, 31, 32] and clustering algorithms [33, 34].
Ensemble learning
Classifier ensembles, also named multiple classifier systems, are known to enhance the performance of a single classifier by combining several base classifiers [35]. According to Galar et al. [36], classifier ensembles for class imbalance problem can be grouped into three categories: 1) bagging-, 2) boosting-, and 3) hybrid-based approaches.
Many bagging-based ensembles such as OverBagging, UnderBagging and UnderOverBagging [14] have been proposed to deal with data imbalance due to their simplicity and good generalization ability, where OverBagging learns each base classifier on a data set obtained by over-sampling (or generating from) minority class. On the contrary to OverBagging, UnderBagging learns each member on under-sampled subset from majority class. UnderOverBagging makes use of both over-sampling and under-sampling techniques to improve ensemble diversity. Boosting-based ensembles embed techniques for data preprocessing into boosting algorithms. In such a manner, these methods alter and bias the weight distribution used to train the next classifier toward the minority class, examples including SMOTEBoost [15, 37] and RUSBoost [16] algorithms. SMOTEBoost introduces synthetic examples using the SMOTE data preprocessing algorithms. The weights of the new examples are proportional to the total number of examples in the new data set. RUSBoost performs similarly to SMOTEBoost, but it removes examples from the majority class by randomly under-sampling the data set. A hybrid ensemble is an ensemble of ensembles, and EasyEnsemble and BalanceCascade are the two specific examples of this category [13]. Both EasyEnsemble and BalanceCascade use bagging as the main ensemble learning method, but in spite of training a classifier for each new bag, they train each bag using Adaboost. The difference between the two methods is the way in which they treat the examples. EasyEnsemble simply under-samples several subsets and learns an Adaboost on each subset. Hence, all the Adaboosts can be trained in parallel. BalanceCascade trains each Adaboost sequentially, where in each step the majority class examples which are correctly classified by the current trained learners are removed from further consideration.
Leaning process
This section proposes a novel ensemble method based on feature projection and under-sampling (EFPUS) for class imbalance problem. EFPUS learns several projection matrixes and using each matrix, obtains new training sets by projecting the training set to the space defined by the matrixes and trains a model on each new training set using conventional class-imbalance oriented learning methods. Therefore, the key of EFPUS is how to construct new training sets for learning base classifiers.
Let
Randomly sample a subset Split For every such subset Organize the obtained vectors with coefficients in a sparse projection matrix
Obtain
The method of constructing new training sets is similar to rotation forest [18], with difference in class manipulation and the sampling techniques, i.e., HE learns projection matrixes focusing more on minority class. The process of EFPUS is as shown in Algorithm 1. The
The running time of EFPUS(the proposed method) is mainly dominated by learning the feature projection matrix (lines 5
Dietterich stated that a necessary and sufficient condition for an ensemble of classifiers to be more accurate than any of its individual members is if the classifiers are accurate and diverse [38]. Here, EFPUS is mainly used to improve ensemble diversity through feature projection and under-sampling.
Using the projection heuristic, the number of different partitions of the feature set with
where
For instance, the chance of all different classifiers in an ensemble of
Under-sampling is a class-imbalance learning method which uses only a subset of majority class and thus is very efficient for imbalanced learning problem [26, 38]. Under-sampling technique contributes to the diversity of ensemble members in the processes of learning projection matrixes (line 2, Algorithm 1) and the larger the ratio between the size of negative set and positive set, the larger the diversity of ensemble members. When the number of majority class examples is equal to that of minority class, the diversity of ensemble members is determined by the process of learning projection heuristic. Under-sampling can also make base learner bias towards minority class through under-sampling majority class such that the learned projection matrixes capture the minority class characterization.
Experimental setup
In order to facilitate the experiment, we design a platform named Loyang Spoon (LySpoon) based on python (Please contact hpguo_cm@163.com for the related source code).
Twenty-three very imbalanced data sets were selected from the KEEL imbalanced data sets [39]. Many data sets have some missing attribute values, such as breast-cancer, hepatitis and sick. C4.5, the base classifier adopted in this paper, handles missing attribute values automatically. More details about data sets see Table 1, where #Attrs, #Exps, and #IR are the attribute number, example number, and imbalance ratio defined as the ratio between the size of minority class set and that of majority class set respectively.
The description of the data sets used in this paper
The description of the data sets used in this paper
To evaluate the performance of the proposed method EFPUS, 10-fold cross-validation strategy was conducted: each data set was divided into ten folds with similar number of examples, and for each fold, the other nine folds are used to train a model and the current fold is used to test the model [41]. We repeated ten times of the 10-fold cross-validation, and therefore, 100 learners were trained by a learning algorithm on each data set.
We selected RUSC45, SMOTEC45, RUSBagging, SMOTEBagging, RUSBoost, SMOTEBoost, EasyEnsemble and BalanceCascade as comparing methods to test the performance of EFPUS:
RUSC45 (RC) [38] pre-processed training sets using random under-sampling technique to obtain relatively balanced data set and then learned a model using C4.5 on the balanced set. EFPUS-RUSC45 (EFPUS-RC). RUSC45 (RC) was used as base learners of EFPUS and SMOTEC45 (SC) [38] pre-processed training set using SMOTE [7] to obtain relatively balanced data set and then learned a model using C4.5 on the balanced set. EFPUS-SMOTEC45 (EFPUS-SC). SMOTEC45 (SC) was used as the base learner and RUSBagging (RB) [14] learned each member on under-sampled subset from majority class. C4.5 was selected as the base learner. The number of iterations was set to be 40. EFPUS-RUSBagging (EFPUS-RB). RUSBagging (RB) was treated as the base learner of EFPUS and C4.5 is selected as the base learner for RUSBagging. We set SMOTEBagging (SB) [14] learned each member on the rebalanced data by over-sampling training set using SMOTE and C4.5 was selected as the base classifier learner. The number of iterations was set to be 40. EFPUS-SMOTEBagging (EFPUS-SB) is similar to EFPUS-RB except that SMOTEBagging (SB) was used as the base learner. RUSBoost (RBT) [16] combined under-sample technique into the boost learning process. The number of iterations was set to be 40, and for each iteration, a subset with size equal to minority class was sampled (without replacement) from majority class. Then, C4.5 was used to train a classifier using the subset and minority class set. EFPUS-RUSBoost (EFPUS-RBT) treated RUSBoost (RBT) to lean base classifier and C4.5 as the base learner for RUSBoost. Similar to EFPUS-SB, we set SMOTEBoost (SBT) [15] used SMOTE to get new minority class examples. C4.5 was used to train base classifiers. The number of iterations was 40. The EFPUS-SMOTEBoost (EFPUS-SBT) selects SMOTEBoost (SBT) as base classifier. C4.5 was used to train base classifiers of SBT. Similar to EFPUS_RBT, we set EasyEnsemble (EE) [13] samples EFPUS-EasyEnsemble (EFPUS-EE) used EasyEnsemble (EE) to learn BalanceCascade (BC) [13] is similar to EasyEnsemble except that it removes correctly classified major class examples of trained Adaboosts from further consideration. C4.5 was selected to train weak classifiers, and we set EFPUS-BalanceCascade (EFPUS-BC) was similar to EFPUS-EE except that it used BC as base classifier. We set
To evaluate the performance of EFPUS (the proposed method), RUSC45 (RC), SMOTEC45 (SC), RUSBagging (RB), SMOTEBagging (SB), RUSBoost (RBT), SMOTEBoost (SBT), EasyEnsemble (EE) and BalanceCascade (BC) were selected as comparing methods (more details refer to 4.1). The corresponding results were reported in nine tables, where Tables 2, 4, 6, 8 and 10 report the summary results of the comparing methods on the measures of g-mean, f-measure, AUC, recall, and accuracy, and Tables 3, 5, 7 and 9 show the ranks of the compared methods on g-mean, g-measure, AUC and recall respectively, where the last column of the nine tables shows the average performance of algorithms on each data set. In these tables with even numbers, EFPUS is compared to the corresponding base classifier and the value in parentheses next to a result is the
Tables 2 and 3 report the summary results and the ranks of the 16 algorithms on the measure of g-mean respectively. We statistically compared the performance of algorithms using pair-wise
The performance of compared algorithms on g-mean
The performance of compared algorithms on g-mean
The ranks of compared algorithms on g-mean
The performance of compared algorithms on f-measure
The ranks of compared algorithms on f-measure
The performance of compared algorithms on AUC
The ranks of compared algorithms on AUC
The performance of compared algorithms on Recall
The ranks of compared algorithms on recall
The performance of compared algorithms on Accuracy
Tables 4 and 5 report the summary results and the ranks of the sixteen algorithms on the measure of f-measure respectively. The pair-wise
Tables 6 and 7 show the experimental results of compared algorithms on measure of AUC, where Table 6 summarizes the results of different algorithms and Table 7 reports the ranks of the algorithms. From Table 6, we can observe that EFPUS statistically outperforms RC, SC, RB, SB, RBT, SBT, EE and BC on 23, 21, 16, 17, 22, 22, 17 and 17 out of the 23 datasets where the pair-wise
Tables 8 and 9 depict the recalls and ranks of the five comparing methods respectively. Similar to the results above, EFPUS significantly wins on most of the 23 sets comparing to other methods. The average recalls (ranks) of EFPUS-RC, RC, EFPUS-BC, BC, EFPUS-RBT, RBT, EFPUS-SBT, SBT, EFPUS-EE, EE, EFPUS-BC and BC are 0.900 (3.57), 0.860 (10.11), 0.830 (10.13), 0.780 (14.48), 0.900 (3.80), 0.890 (5.96), 0.820 (10.93), 0.780 (14.09), 0.910 (2.13), 0.860 (8.91), 0.830 (10.72), 0.770 (14.72), 0.900 (4.96), 0.870 (7.96), 0.900 (4.43) and 0.880 (8.15) respectively.
Though accuracy is inadequate to evaluate the classification performance, poor accuracy means a bad classifier. An efficient classifier may improve g-mean, f-measure, AUC or recall without decreasing accuracy. Table 10 presents the accuracy of the comparing algorithms. As shown in Table 9, EFPUS statistically outperforms RC, SC, RB, SB, RBT, SBT, EE and BC on 23, 23, 21, 23, 20, 21, 16 and 17 out of the 23 datasets with the pairwise
All the experimental results we provided above indicate that EFPUS is an efficient method to improve the generalization performance of conventional class-imbalance oriented methods, no matter the method is a single model, an ensemble or a hybrid ensemble. Moreover, the results show that constructing ensemble through feature projection and under-sampling can improve the performance of imbalanced learning and is worth of further research.
The effects of missing value on the performance of EFPUS.
The effects of noise on the performance of EFPUS.
We investigate in this section the effects of missing values and noise on the performance of the EFPUS (the proposed method). In these experiments, classifiers were built using corrupted versions of the original data. The data set “yeast-0-2-5-6_vs_3-7-8-9” was selected as the candidate and EFPUS-RC was selected as the candidate of EFPUS. For testing the robust of EFPUS on missing value, the attribute values of each instance of the training set are set to be missing value with
For testing the effects of noise for EFPUS, class labels of a fixed percentage
Conclusion
In this paper, we propose a simple but effective ensemble method through feature projection and under-sampling (EFPUS) to tackle the imbalanced problem. The main heuristic in EFPUS consists of sampling subsets from majority class, learning a projection matrix from each subset, obtaining training sets by projecting the original training set to new spaces defined by the matrixes, and constructing a conventional class-imbalance oriented method on each new training set. Therefore, EFPUS mainly obtains the diversity between ensemble members through feature projection and under-sampling. Experimental results show that EFPUS can significantly improve traditional class-imbalanced oriented method for imbalanced data sets on the measure of g-mean, f-measure, AUC, recall and accuracy, no matter the method is a single model, an ensemble or a hybrid ensemble.
The main issue in this paper is to use sampling technique and feature projection to create new training sets, and therefore one of the future work is to carry out more experiments to study the effects of differ sampling technique and feature projection. Besides, it is very important to gain deep insight into behavior of the algorithm clearly, and thus another future work is to carry out simulation for studying the behavior of the proposed algorithm.
Footnotes
Acknowledgments
This work is supported in part by the National Natural Science Foundation of China (No. 61602422, No. 61501393), part by Project of Science and Technology Department of Henan Province (No. 182102210132) and in part by Nanhu Scholars Program for Young Scholars of XYNU.
