Abstract
Research focus increases rapidly on recent years in mining imbalanced data set, because of its challenge and its extensive application on the real world. A dataset is said to be imbalance, if categories of the classification attribute is not evenly represented. A fine balanced dataset is an important source for the classifiers to define the best prediction model. All the existing classifiers are inclined to perform poor on the imbalanced datasets. The reason for this is, all the classifiers seek to optimize their overall accuracy not by considering the relative distribution of each class. Hence, it is very essential to go for well balanced dataset for classification. In this paper, the comprehensive Enhanced Minority Oversampling TEchnique (EMOTE) is proposed to improve the performance of the classifier by balancing the dataset. The key idea of the proposed method is to balance the dataset by tuning the misclassified instances of the minority classes into correctly classified instances through oversampling their nearest neighbor. To investigate the performance of the proposed model, different oversampling and under sampling methods inclusive of the well known method SMOTE (Synthetic Minority Oversampling TEchnique) are considered. Various imbalanced datasets from the UCI machine learning repository are considered for experiments The experimental results shows that, the proposed method EMOTE outperformed the other methods in balancing the dataset. In addition to this it is also proved that, the classifier is able to effectively improve its performance on the dataset which is generated by EMOTE.
Introduction
A dataset is said to be imbalanced, if the class attributes are not evenly presented [7]. The rate of imbalance is 100 to 1 in fraud detection and in other applications; the same is reported as 1,00,000 to 1 [19]. Many attempts were tried to deal with imbalanced datasets in the domains like Telecommunication Management [10], Fraudulent calls in telecom, text classification [15] and detection of oil spills in satellite images [12]. Recent research works have found that, Machine Learning algorithms fail to prove their performance on the imbalance dataset [18]. Therefore, it is necessary to have a good technique to balance the dataset.
Classifier and imbalanced dataset
The classification is one of the extensively accepted Machine Learning and data mining Technique used to mine the data and makes the prediction regarding the future. By building the apt classifier, it predicts well about which class the new instance is. The classification techniques generally presume that the instances are evenly distributed between different classes. The classifier performs fine on the datasets which is uniformly distributed between various classes. Conversely, the real world datasets are imbalanced between the class attribute distributions [7]. The imbalance class issue occurs, when many more instances in one class (Majority Class) and highly less in other class (Minority Class) in the training dataset.
Classifiers built based on such imbalanced data outperformed on the majority class and performed very poorly on the minority class [21]. However, in many real cases, the minority class instances are the most important one for the prediction. The real time applications like fraud analysis on bank loan, medical research analysis, telecommunication churn analysis and etc., have the imbalanced class distribution problem.
Weka and IKVM
Weka [27] is a set of Machine learning algorithms for the purpose of data mining task. It also has the visualization tool for the analysis of data together with GUI to perform effortless access on these utilities. Weka performs various functions like data pre-processing, association, regression, classification and clustering. It was developed on java and the functions presented in weka can be called directly by means of java code.
IKVM [28] is a tool implemented to run java code on.NET Language. Jeroen Frijters, the Technical Director of Sumatra Software, based in the Netherlands is the main contributor to IKVM.NET. It permits to call all the classes of java using.NET Code. By using IKVM, the conversion from Weka jar file to Weka DLL can be done in a single compilation. There after all Classes of Weka are executed through.NET applications.
The paper emphasizes the inefficiency of the classifier on the imbalanced dataset. To solve the same, an Enhanced Minority Oversampling TEchnique (EMOTE) was defined to balance the dataset. By which the performance of the classifier is also improved. In the rest of the paper, to prove the efficiency of the proposed method, various experiments with different classifiers are done on different datasets retrieved from UCI Repository.
The paper is organized as follows: Section 2 reviews the related work which deals with imbalanced datasets. Section 3 reveals the details of the proposed method. Section 4 presents various experimental results which proves the efficiency of the proposed method and also presents the comparative study between the proposed method and the other widely accepted methods. Section 5 discusses the conclusion of the study.
Related work
As the real time data have the imbalance problem, researchers have focused and proposed several methods to solve this issues. The proposed methods try to solve this imbalance issue in one of two ways i.e. algorithm level and data level. In data level the imbalanced dataset are classified into minority class and majority class. The classifier may leads to misclassification when it disregards the minority class. To resolve this problem under sampling and oversampling techniques have been developed.
Materials and methods
In this section, the comprehensive
The key idea of the proposed method is that, “

The flow starts by considering the imbalanced data set (Ai) as input data set. In addition to this, the method also takes one additional parameter namely Number of Nearest Neighbours (
As a first task, based on the result of the classifier, the performance measures like Accuracy, Sensitivity and Specificity are calculated. From the calculated measures the minority class is identified. To improve the accuracy on part of the minority class, the instances of the actual dataset (
As a next task, the misclassified instance of minority class is considered for tuning. To convert the misclassified instance into correctly classified instance, initially the nearest neighbours of the misclassified instance are retrieved from the dataset
As a final task the classifier is rebuilt on the improved dataset. The above mentioned primary and secondary tasks are repeated till there was an improvement on the classifier accuracy.
Experiments and results
To test the performance of the proposed method, the C# code has been written for the proposed algorithm. For the purpose of data mining process the WEKA tool was used. Using IKVM, the file WEKA.jar is converted to WEKA.dll to use the classes of WEKA in C#. To build the classifier, the algorithm C4.5 [22] (WEKA version of J48) is used in the developed code. Eight dataset from the UCI Machine Learning Repository namely Pima Indian Diabetes Data, Haberman’s Survival Data, Spect(Heart Data), German Credit Data, Churn Telecom Data, Bupa Data, Adult Data, Mammography Data [2] and an dataset named Oil Dataset provided by Robert Holte [12] are used in the experiment. The oil dataset was personally requested and used in the experiment of the proposed method, since very famous and admirable method SMOTE was experimented with this dataset. The above mentioned nine dataset are selected and used in the experiment, because they are the datasets with imbalance of various ratios between the majority and minority class. The characteristics of these nine datasets are shown in Table 1.
Characteristics of data sets used in the experiments
Characteristics of data sets used in the experiments
Ten cross fold validation method is used to split the data into 90% of training set and 10% of testing set. To evaluate and to prove the efficiency of the proposed method, various types of test and comparison with existing methods has been carried out. They are Performance of C4.5 Classification Algorithm on various actual imbalanced and EMOTE balanced dataset. Performance of Various Classification Algorithms on actual imbalanced Bupa dataset and EMOTE balanced Bupa dataset. Performance Comparison of EMOTE with FDUS + SMOTE ROC Performance Evaluation of EMOTE AUC Performance Comparison of EMOTE with CMTNN + SMOTE Performance Comparison of EMOTE with SMOTE
To evaluate the performance of the proposed EMOTE, as a first step the C4.5 algorithm was executed on the actual imbalanced dataset using Weka tool and results are recorded. As a second step, the EMOTE method was executed on actual dataset to balance the same. After balancing, the C4.5 is again executed on the balanced dataset.
The performance of all the classifiers for binary class problem is measured by means of a confusion matrix [6, 7]. To study about the performance of the classifier, churn telecom data is considered. The classifier is executed on the same and the confusion matrix of the execution is shown in Fig. 2.

Confusion matrix of churn telecom dataset.
From the confusion matrix, the overall performance of the classifier is
This particular measure is reasonable in the context of balanced datasets, as it reveals the overall performance of the classifier on both the classes. But in the presence of imbalanced datasets, the performance of the classifier is to be monitored individually on both the minority and majority class, since it may fail to prove on minority class. In such cases, it is more suitable to use the other performance measures like Sensitivity and Specificity. Based on above confusion matrix, the sensitivity and specificity for the churn dataset are
From the above results it is identified that the classifier is able to perform well on the majority class with predictive accuracy of 99.62%. Whereas, not fine on the required minority class (Predictive accuracy is only 30.30%). To improve the performance of the classifier on such dataset, it needs to be balanced. To balance the dataset, the proposed method EMOTE was executed on the dataset. After balancing the dataset, the classifier is executed again on the dataset balance by EMOTE. From the results of the classifier, it is identified that the classifier is able to perform well on the required minority class (Prediction accuracy is 99.54%) also. The same experiments are done for all nine dataset and obtained results are recorded and presented in Table 2.
Performance evaluation of C4.5 with various datasets
The experimental results in Table 2 show that, the performance of classifier is better on the EMOTE balanced data set than the actual imbalanced data set. From the table it is also revealed that, the proposed method significantly lifts the accuracy of an actual dataset by a minimum of 33% to maximum of 69% on various datasets.
In contrast to the above comparison, various classification algorithms are tested on the proposed method using Bupa dataset. To perform the same, the classifiers used in the experiments falls in various categories like Lazy Learner (K-Star), decision tree (NB Tree and Simple Cart), Bayes (Bayes Net) and ANN Model (MLP). The basic ideas of these models are Lazy Learner employs using the Nearest Neighbour, Decision tree follows divide and conquer strategy, Bayes follows Bayes theorem and MLP utilizes the back probagation technique. The execution results of these classifiers on actual Bupa dataset are recorded. To balance the Bupa dataset, the EMOTE method was applied on it. After balancing, the classifiers are executed on the EMOTE balanced dataset and the execution results are recorded and presented in Table 3.
Performance evaluation of various classification algorithms with BUPA datasets
Performance evaluation of various classification algorithms with BUPA datasets
The results of various classification algorithms shows that, the classifiers are able to perform better on proposed EMOTE dataset than the actual dataset. The reason for this is all the classifiers try to improve their performance just by focusing on various techniques, not by considering the relative distribution of each class in the dataset. On focusing such an issue, the proposed EMOTE balances the dataset. By which the classifiers are able to considerably lift its performance accuracy by 31% to 54% on the actual dataset. Particularly K-Star performs better than the other classifier on the Bupa imbalanced dataset. Still, K-Star proves that, the dataset formed by the proposed method is a well balanced one to improve the performance of the classifier.
FDUS + SMOTE is a combination of Fuzzy Distance based Under Sampling (FDUS) and SMOTE [16]. To analyze the performance of the FDUS + SMOTE, BUPA, HABERMAN and PIMA dataset are considered and balanced using FDUS + SMOTE Method. As a next step classifier is executed on the balanced dataset and the results are compared with the other popular methods like SMOTE + TOMEK [4], SMOTE + ENN [3], FDUS and SMOTE. From the comparative analysis, it is proved that the method FDUS + SMOTE performed better than the other techniques.
To prove the efficiency of the proposed
In order to perform the comparative analysis, the datasets are balanced using the proposed method EMOTE. As a next step classifier is executed on the same and from the results of the classifier the values of F-Measure and G-Mean are calculated. The highlighted values in the Table 4 are the calculated values of proposed method and rest are from reference [16].
F-Measure values of various data sets (%)
F-Measure values of various data sets (%)
In F-Measure, the value of F improves proportionally to the raise of precision and recall. The high value of the F-Measure signifies that the classifier performs better on the positive class. From the Table 4, it is identified that the F-Value of the proposed method is high, which in turn shows that the proportion of correctly classified positive instances are high. In addition to this, the F-Value of the proposed method is higher than the other methods. It proves that the classifier is able to perform better on the dataset which is balanced by proposed EMOTE than the datasets of other methods.
The value of G-Mean is high when both sensitivity and specificity are high. The high value of the G-Mean also indicates that the classifier performs well on both the majority and minority class. The values of Table 5 show that, the G-Mean values of the proposed method are high. By which it shows that the classifier is able to classify well on both minority and majority class. The table also state that, the G-Mean value of the proposed method is high than the other methods. This proves that, the proposed method is more efficient than the other methods in balancing the dataset.
G-Mean values of various data sets (%)
The Receiver Operating Characteristic (ROC) curves are the standard technique for Summarizing and visualizing the performance of the classifiers between true positive and false positive error rates [25]. In the ROC Curve, the Y-axis represents the True Positive Rate (Sensitivity) and X – axis represents the False Positive Rate (Specificity).
The basic characteristics of an ROC curve are defined as follows. The point (0, 0) represents that classifier never predicts the positive class. As an opposite of this point (1, 1) represents that classifier predicts all instances as positive, thus generating greater number of false positives. The point (0, 1) in the ROC Curve would be the ideal point, (i.e.) it represents that all positive instances are correctly classified as positive and no negative classes are in-correctly classified as positive.
To perform the ROC analysis of EMOTE, the churn telecom dataset and various classifiers like Navie Bayes, Random Forest, Ada Boost, KNN and SVM are considered. ROC curves are prepared first by executing the classifiers on actual imbalanced dataset and then on the dataset which is balanced by the proposed EMOTE. The resulted ROC curves of both the execution are shown in Fig. 3.

ROC comparison of various classifiers on churn telecom dataset.
The results of the ROC curve indicates that the performance of the classifier has huge variance between the actual imbalanced and the EMOTE balanced dataset. This shows that, the number of positive instances correctly classified as positive is less and the number of negative instances incorrectly classified as positive are high on the actual imbalanced dataset. Whereas, after balancing by EMOTE, the classifiers are able to improve the result by attaining very few misclassification on both positive and negative instances. The steep of the curve toward the point (0, 1) on the proposed EMOTE proves the same. This is the significant evidence to prove that, as the relative distribution of each class in the dataset become imbalanced, the proposed EMOTE has greater impact on the performance of the classifier.
CMTNN + SMOTE is a combination of both the Synthetic Minority Over Sampling Technique (SMOTE) and the Complementary Neural Network (CMTNN) to handle the imbalance issue on classification [18]. Pima Indians Diabetes data, Haberman’s Survival data, German credit data and SPECT heart data are the datasets considered for experiments [18]. K-NN is the classifier applied and from the resultsG-Mean and AUC is recorded. The recorded values are compared with the other widely used methods and proved that CMTNN + SMOTE performs better than other methods.
To analyze the performance of the proposed
G-Mean and AUC values of various Data Sets classified by K-NN (k = 5)
G-Mean and AUC values of various Data Sets classified by K-NN (k = 5)
The experimental results shows that the proposed EMOTE perform better than the other techniques in view of G-Mean and AUC. It also improves the performance extensively when compared with the actual imbalanced dataset. As an overall, the performance of the proposed method performed well than the other widely used methods. The proposed method shows the significant improvement in G-mean value on actual imbalanced dataset from 9.7% to 47.5%. In addition to this, the comparison of G-Mean value with the other methods state that, the G-Mean value of the proposed method was improved from 0.14 % to 41.4%. The reason for effective performance of the proposed method is its misclassification analysis. The misclassification analysis of the EMOTE improves the quality of the dataset using the nearest neighbours, which ultimately eliminates the possible misclassification on the datasets.
On the other view the AUC values of the proposed method also have vast difference when compared with the results of original data and other techniques. This in turn proves that, the classifier ranks a randomly chosen positive instance higher than the randomly chosen negative instance on the proposed EMOTE dataset. In particular the proposed EMOTE outperforms on Haberman’s Survival Data than the base method CMTNN. This is because the base method CMTNN removes huge amount of misclassification instance from the dataset. As a consequence, the left over instances are not sufficient for the classifier to generalize the correct result. But the proposed method tunes the misclassified instance into correctly classified instance by the inclusion of nearest neighbour not by removing the misclassified instance. This vital fact confirms that the proposed EMOTE is more efficient than the other techniques.
To investigate the performance of the EMOTE with the well known and widely accepted method SMOTE, the datasets PIMA and OIL [12] are considered. C4.5 is considered to prepare the ROC as like SMOTE. Figure 4 presents the resultant ROC Curves for the PIMA Dataset. Figure 4 (a) and (c) represents the ROC obtained, when C4.5 is executed on actual and EMOTE PIMA Dataset respectively. The AUC values also considered to evaluate actual Pima and EMOTE Pima dataset. The AUC values presented in the respective Figures show that, it was improved from 0.678 to 0.807 by the proposed method EMOTE. This also illustrate that, EMOTE balances the dataset well, by which it also improves the performance of the classifier.

ROC comparison of Pima Dataset.
Figure 4(b) represents the ROC depicted from the reference [7], which was the result of the classifier C4.5 executed on SMOTE PIMA Dataset. In ROC curve, the steeper the curve towards the upper left corner represents that, the better the classification. The ROC curves of the Fig. 4(b) and (c) reveal that, the steep of the curve in Fig. 4(c) is better than Fig. 4(b). It means that the number of positive instances, which correctly classified as positive are high and the number of negative instances, which in-correctly classified as positive is less in EMOTE dataset when compared with SMOTE Dataset. It ultimately proves that the performance of the EMOTE is better than the SMOTE in balancing the dataset.
To further highlight the performance of SMOTE, cross validation of SMOTE and under sampling against C4.5 by tuning the majority class is done on OIL Dataset. The results from reference [7] are shown in Table 7 and Fig. 5(a). In the Table 7 Acc+ is the accuracy of the positive instances (minority class) and Acc- is the accuracy of the negative instances (majority class). In Fig. 5(a), X-axis represents the percentage of under sampling done on the dataset, per iteration and Y-axis represents the accuracy of the classifier based on under sampling.
Comparison of EMOTE against SMOTE and Under Sampling

(a) and (b): Pictorial representation of comparison of EMOTE against SMOTE and under sampling.
To prove the efficiency of EMOTE against SMOTE, the same OIL Dataset was considered. To perform comparison, percentage of over sampling, Accuracy on positive instances (Acc+) and Accuracy on negative instances (Acc–) are calculated by executing the classifier C4.5 on oil dataset. The results are presented in Table 7 as highlighted portion and in Fig. 5(b). In Fig. 5(b), X-axis represents the percentage of oversampling made on the dataset, per iteration and Y-axis represents the accuracy of the classifier based on oversampling. The percentage of oversampling in the proposed method is based on the misclassified instances and their nearest neighbours; hence, the values of X-axis has variable interval in the Fig. 5(b).
From the result of SMOTE shown in Table 7, it is identified that, the value of ACC+ increases and similarly the value of ACC– decreases from 94.20 to 33.80, as with the percentage of under sampling increases by the SMOTE. This is because, to balance the dataset, SMOTE removes the majority class instances randomly from dataset starting with 10% to 800%. As an effect, the accuracy of majority class (Acc–) gets reduced as like instances. Whereas, the highlighted portion of the Table 7 shows that, the value of Acc+ effectively increases, as well as the value of Acc- is also stable in the proposed method EMOTE and it just varies from 99.44 to 99.33 but not like SMOTE from 94.20 to 33.80. In addition to this, it shows that, the maximum accuracy on majority class ACC+ (98%) was attained by SMOTE, only when the dataset was under sampled by 600%. Whereas, the same level of accuracy (98.41%) was effectively achieved by EMOTE just by oversampling the dataset with 2.34% i.e., only by replicating 22 instances out of 937 instances. This empirical result proves the impact of nearest neighbours on balancing the dataset and consequently confirm the efficiency of the EMOTE.
The ROC curves shown In Fig. 6 are results of the OIL dataset. In Fig. 6; 6(a) represents ROC of actual dataset, 6(b) depicts ROC of SMOTE balanced dataset from reference 7 and 6(c) shows ROC of EMOTE balanced dataset. By comparing all the curves, the curve in Fig. 6(c) obtained based on EMOTE dataset was steeper towards the upper left corner than the curves in Fig. 6(a) and (b). In addition to this, the associated AUC value shows that, the value was improved from 0.5146 to 0.981 by the EMOTE. All the above various levels of comparison with SMOTE confirm that, the proposed EMOTE perform best on balancing the dataset to improve the accuracy of the classifier.

ROC comparison of OIL dataset.
In the process of classification, the cause of imbalanced class distribution is not considered. Many earlier studies [11, 14] focused to improve classification accuracy, but not considered the problem of imbalanced class distribution. Hence, the ability of the classifiers constructed by these studies are not fine on the prediction of minority class samples. Many real time applications like fraud analysis on bank loan, medical research analysis and telecommunication churn analysis have the imbalanced class distribution problem. In this case, it is very tough to make accurate prediction on the customers or patients who are in need to be identified.
In this study, the
Footnotes
Acknowledgments
We express our sincere gratitude to Robert Holte for the support given to us by means of providing the oil spill dataset, which is used in his paper. We also thank the reviewers for their valuable comments and suggestions.
