EMOTE: Enhanced Minority Oversampling TEchnique

Abstract

Research focus increases rapidly on recent years in mining imbalanced data set, because of its challenge and its extensive application on the real world. A dataset is said to be imbalance, if categories of the classification attribute is not evenly represented. A fine balanced dataset is an important source for the classifiers to define the best prediction model. All the existing classifiers are inclined to perform poor on the imbalanced datasets. The reason for this is, all the classifiers seek to optimize their overall accuracy not by considering the relative distribution of each class. Hence, it is very essential to go for well balanced dataset for classification. In this paper, the comprehensive Enhanced Minority Oversampling TEchnique (EMOTE) is proposed to improve the performance of the classifier by balancing the dataset. The key idea of the proposed method is to balance the dataset by tuning the misclassified instances of the minority classes into correctly classified instances through oversampling their nearest neighbor. To investigate the performance of the proposed model, different oversampling and under sampling methods inclusive of the well known method SMOTE (Synthetic Minority Oversampling TEchnique) are considered. Various imbalanced datasets from the UCI machine learning repository are considered for experiments The experimental results shows that, the proposed method EMOTE outperformed the other methods in balancing the dataset. In addition to this it is also proved that, the classifier is able to effectively improve its performance on the dataset which is generated by EMOTE.

Keywords

Imbalanced dataset classification nearest neighbor oversampling under sampling

1 Introduction

A dataset is said to be imbalanced, if the class attributes are not evenly presented [7]. The rate of imbalance is 100 to 1 in fraud detection and in other applications; the same is reported as 1,00,000 to 1 [19]. Many attempts were tried to deal with imbalanced datasets in the domains like Telecommunication Management [10], Fraudulent calls in telecom, text classification [15] and detection of oil spills in satellite images [12]. Recent research works have found that, Machine Learning algorithms fail to prove their performance on the imbalance dataset [18]. Therefore, it is necessary to have a good technique to balance the dataset.

1.1 Classifier and imbalanced dataset

The classification is one of the extensively accepted Machine Learning and data mining Technique used to mine the data and makes the prediction regarding the future. By building the apt classifier, it predicts well about which class the new instance is. The classification techniques generally presume that the instances are evenly distributed between different classes. The classifier performs fine on the datasets which is uniformly distributed between various classes. Conversely, the real world datasets are imbalanced between the class attribute distributions [7]. The imbalance class issue occurs, when many more instances in one class (Majority Class) and highly less in other class (Minority Class) in the training dataset.

Classifiers built based on such imbalanced data outperformed on the majority class and performed very poorly on the minority class [21]. However, in many real cases, the minority class instances are the most important one for the prediction. The real time applications like fraud analysis on bank loan, medical research analysis, telecommunication churn analysis and etc., have the imbalanced class distribution problem.

1.2 Weka and IKVM

Weka [27] is a set of Machine learning algorithms for the purpose of data mining task. It also has the visualization tool for the analysis of data together with GUI to perform effortless access on these utilities. Weka performs various functions like data pre-processing, association, regression, classification and clustering. It was developed on java and the functions presented in weka can be called directly by means of java code.

IKVM [28] is a tool implemented to run java code on.NET Language. Jeroen Frijters, the Technical Director of Sumatra Software, based in the Netherlands is the main contributor to IKVM.NET. It permits to call all the classes of java using.NET Code. By using IKVM, the conversion from Weka jar file to Weka DLL can be done in a single compilation. There after all Classes of Weka are executed through.NET applications.

The paper emphasizes the inefficiency of the classifier on the imbalanced dataset. To solve the same, an Enhanced Minority Oversampling TEchnique (EMOTE) was defined to balance the dataset. By which the performance of the classifier is also improved. In the rest of the paper, to prove the efficiency of the proposed method, various experiments with different classifiers are done on different datasets retrieved from UCI Repository.

The paper is organized as follows: Section 2 reviews the related work which deals with imbalanced datasets. Section 3 reveals the details of the proposed method. Section 4 presents various experimental results which proves the efficiency of the proposed method and also presents the comparative study between the proposed method and the other widely accepted methods. Section 5 discusses the conclusion of the study.

2 Related work

As the real time data have the imbalance problem, researchers have focused and proposed several methods to solve this issues. The proposed methods try to solve this imbalance issue in one of two ways i.e. algorithm level and data level. In data level the imbalanced dataset are classified into minority class and majority class. The classifier may leads to misclassification when it disregards the minority class. To resolve this problem under sampling and oversampling techniques have been developed.

Piyasak Jeatrakul is introduced a method to enhance the accuracy on minority class [18]. The proposed method combines both the Synthetic Minority Over Sampling Technique (SMOTE) and the Complementary Neural Network (CMTNN) to deal with the problem of classifying the imbalanced data set. In the proposed method CMTNN is applied as an under sampling technique and the SMOTE is used for over sampling technique. To evaluate the performance of the proposed models, the test employs on the three types of classification algorithm includes ANN, SVM and K-NN with the datasets German credit data, Pima Indians Diabetes data, SPECT heart data and Haberman’s Survival data. From results of the classifier G-mean and AUC are calculated and the same is compared with other methods. The comparative results proved that, the proposed combined model CMTNN performed well than the othermethods.

Mostafizur Rahman is proposed an improved under sampling technique [17]. In the proposed method, initially the data set is divided into two sets, as minority and majority class. Then, the majority class is split into k clusters (k = 3). The clusters of majority class are individually combined with minority class to create k subsets. To evaluate the proposed model, all the combined datasets are classified using Fuzzy Unordered rule induction and decision tree algorithm. The dataset with highest accuracy are considered for further data mining process. To perform the experiment cardiovascular dataset are used and the obtained results are compared with SMOTE and other under sampling techniques. By the comparison it is proved that the proposed method well balances and generates the better training set for the classifier. Maisarah Zorkeflee is presented a new technique to handle the problem related to imbalance of data sets [16]. The proposed model is a combination of Fuzzy Distance based Under Sampling (FDUS) and SMOTE. The process starts by dividing the data set into two classes namely majority (Ai) and minority (Bi) class. Re-sampling data set using FDUS technique is repeated to produce the balanced dataset. During the process, if Ai becomes lesser than Bi then, the technique SMOTE is applied to balance the dataset. To analyze the performance of the proposed method, the measures like F-measure and G-mean are calculated for the BUPA, HABERMAN and PIMA dataset. The obtained results are compared with the other popular methods like SMOTE + TOMEK, SMOTE + ENN, FDUS and SMOTE. From the comparative analysis, it has been proved that the proposed FDUS + SMOTE performed best than the other techniques.

Nitesh V. Chawla et al. has proposed the oversampling technique called SMOTE (Synthetic Minority Oversampling Technique) [7]. In the proposed work the method of oversampling the minority class and under sampling the majority class was combined. For over sampling, an approach is proposed, in which, minority class samples are oversampled by creating the synthetic examples instead of duplicating the real data. Depend upon the required over sampling, the selection of nearest neighbours through the k nearest neighbours are done randomly. In the implementation, five nearest neighbours are used. From which two are chosen and the new sample is created in the direction of each. To analyze the efficiency of the technique, various datasets and the classifiers like C4.5 decision tree, Naive Bayes, Ripper are used. The results of the classifiers prove that the proposed method shows the improvement over the other re-sampling techniques. In addition to this the same was revealed through the AUC and ROC. The SMOTE creates synthetic minority examples without taking into consideration of the majority class samples; thus may cause the overgeneralization[17].

Babu is presented a work, in that a comparative study on efficiency of various decision tree algorithms and MLP (Multi Layer Perceptron) are analyzed [23]. The work is performed in two phases. In first phase the decision tree classification models like C4.5, Naive Bayes, Simple cart and Logistic regression are executed on the churn dataset. To evaluate the performance, Overall Accuracy, Time taken to build the model and Error values like MSE, RMSE, RAE, RRSE are computed from the results of the classifier. From the results it is identified that, C4.5 (J48) performs better than the other classifiers in terms of overall accuracy and in terms of performance errors. In the second phase the best decision tree model C4.5 is compared against MLP. From the results it is recognized that, the performance of MLP is better than C4.5 in terms of accuracy. But the time taken and memory utilized by MLP is too high when compared with C4.5. The depth view of this analysis suggest that, the difference of accuracy is comparatively very little and negotiable. But, the time taken by MLP to build the model is incomparable with C4.5. In this view it is concluded that C4.5 performs better than the other classifiers.

3 Materials and methods

In this section, the comprehensive Enhanced Minority Oversampling TEchnique (EMOTE) is proposed in order to balance the dataset and to improve the performance of the classifier.

The key idea of the proposed method is that, “ Populating the actual dataset with each minority class Misclassified Instances, along with their min- ority class Nearest Neighbor from correctly classif- ied Instances ”. The proposed technique is demonstrated in Fig. 1.

Fig.1

Enhanced Minority Oversampling TEchnique (EMOTE).

3.1 EMOTE: Enhanced Minority Oversampling TEchnique

The flow starts by considering the imbalanced data set (Ai) as input data set. In addition to this, the method also takes one additional parameter namely Number of Nearest Neighbours (k). By assigning the class attribute, the classifier is built on the imbalanced data set.

As a first task, based on the result of the classifier, the performance measures like Accuracy, Sensitivity and Specificity are calculated. From the calculated measures the minority class is identified. To improve the accuracy on part of the minority class, the instances of the actual dataset ( Ai ) is categorized into correctly classified instances and misclassified instances. By considering the same, two different datasets namely CCi (Dataset which contains copy of Correctly Classified Instances) and Mi (Dataset which contains copy of Misclassified Instances) are formed. This process is shown on the Steps5 to 12.

As a next task, the misclassified instance of minority class is considered for tuning. To convert the misclassified instance into correctly classified instance, initially the nearest neighbours of the misclassified instance are retrieved from the dataset CCi . Then from the retrieved nearest neighbours, only minority class instances are selected and the same is populated in actual dataset Ai with its associated misclassified instance. The same process is shown in Steps 13 to 23. The above step is repeated for each minority class misclassifiedinstances.

As a final task the classifier is rebuilt on the improved dataset. The above mentioned primary and secondary tasks are repeated till there was an improvement on the classifier accuracy.

4 Experiments and results

To test the performance of the proposed method, the C# code has been written for the proposed algorithm. For the purpose of data mining process the WEKA tool was used. Using IKVM, the file WEKA.jar is converted to WEKA.dll to use the classes of WEKA in C#. To build the classifier, the algorithm C4.5 [22] (WEKA version of J48) is used in the developed code. Eight dataset from the UCI Machine Learning Repository namely Pima Indian Diabetes Data, Haberman’s Survival Data, Spect(Heart Data), German Credit Data, Churn Telecom Data, Bupa Data, Adult Data, Mammography Data [2] and an dataset named Oil Dataset provided by Robert Holte [12] are used in the experiment. The oil dataset was personally requested and used in the experiment of the proposed method, since very famous and admirable method SMOTE was experimented with this dataset. The above mentioned nine dataset are selected and used in the experiment, because they are the datasets with imbalance of various ratios between the majority and minority class. The characteristics of these nine datasets are shown in Table 1.

Table 1
Characteristics of data sets used in the experiments

Data Sets No. of Minority Majority

Instances Class % Class %

Pima Diabetes 768 34.90 65.10

Haberman’s Survival 306 26.47 73.53

Spect (Heart) 267 41.20 58.80

German Credit 1000 30.00 70.00

Churn Telecom 2416 2.73 97.27

Bupa 345 42.03 57.97

Oil Spil 937 4.38 95.62

Adult 32561 24.08 75.92

Mammography 1407 36.74 63.97

Data Sets	No. of	Minority	Majority
Pima Diabetes	768	34.90	65.10
Haberman’s Survival	306	26.47	73.53
Spect (Heart)	267	41.20	58.80
German Credit	1000	30.00	70.00
Churn Telecom	2416	2.73	97.27
Bupa	345	42.03	57.97
Oil Spil	937	4.38	95.62
Adult	32561	24.08	75.92
Mammography	1407	36.74	63.97

Ten cross fold validation method is used to split the data into 90% of training set and 10% of testing set. To evaluate and to prove the efficiency of the proposed method, various types of test and comparison with existing methods has been carried out. They are

Performance of C4.5 Classification Algorithm on various actual imbalanced and EMOTE balanced dataset.

Performance of Various Classification Algorithms on actual imbalanced Bupa dataset and EMOTE balanced Bupa dataset.

Performance Comparison of EMOTE with FDUS + SMOTE

ROC Performance Evaluation of EMOTE

AUC Performance Comparison of EMOTE with CMTNN + SMOTE

Performance Comparison of EMOTE with SMOTE

4.1 Performance of C4.5 on EMOTE dataset

To evaluate the performance of the proposed EMOTE, as a first step the C4.5 algorithm was executed on the actual imbalanced dataset using Weka tool and results are recorded. As a second step, the EMOTE method was executed on actual dataset to balance the same. After balancing, the C4.5 is again executed on the balanced dataset.

The performance of all the classifiers for binary class problem is measured by means of a confusion matrix [6, 7]. To study about the performance of the classifier, churn telecom data is considered. The classifier is executed on the same and the confusion matrix of the execution is shown in Fig. 2.

Fig.2

Confusion matrix of churn telecom dataset.

From the confusion matrix, the overall performance of the classifier is $\begin{matrix} Accuracy & = & (20 + 2341) / (20 + 46 + 9 + 2341) \\ = & 97.72 % . \end{matrix}$

This particular measure is reasonable in the context of balanced datasets, as it reveals the overall performance of the classifier on both the classes. But in the presence of imbalanced datasets, the performance of the classifier is to be monitored individually on both the minority and majority class, since it may fail to prove on minority class. In such cases, it is more suitable to use the other performance measures like Sensitivity and Specificity. Based on above confusion matrix, the sensitivity and specificity for the churn dataset are $\begin{matrix} Sensivity (SN) & = & 20 / (20 + 46) = 30.30 % \\ Specificity (SP) & = & 2341 / (2341 + 9) = 99.62 % \end{matrix}$

From the above results it is identified that the classifier is able to perform well on the majority class with predictive accuracy of 99.62%. Whereas, not fine on the required minority class (Predictive accuracy is only 30.30%). To improve the performance of the classifier on such dataset, it needs to be balanced. To balance the dataset, the proposed method EMOTE was executed on the dataset. After balancing the dataset, the classifier is executed again on the dataset balance by EMOTE. From the results of the classifier, it is identified that the classifier is able to perform well on the required minority class (Prediction accuracy is 99.54%) also. The same experiments are done for all nine dataset and obtained results are recorded and presented in Table 2.

Table 2

Performance evaluation of C4.5 with various datasets

Data sets	Accuracy of Classifier
	Actual Dataset %	EMOTE’S Dataset %
Pima Diabetes Data	55.97	98.97
Haberman’s Survival	43.37	98.31
Spect (Heart Data)	59.09	93.00
German Credit	44.67	95.53
Churn Telecom	30.30	99.54
Bupa	53.79	93.46
Oil Spil	31.71	96.03
Adult	63.56	97.92
Mammography	74.08	97.85

The experimental results in Table 2 show that, the performance of classifier is better on the EMOTE balanced data set than the actual imbalanced data set. From the table it is also revealed that, the proposed method significantly lifts the accuracy of an actual dataset by a minimum of 33% to maximum of 69% on various datasets.

4.2 Performance of various classification algorithms on EMOTE dataset

In contrast to the above comparison, various classification algorithms are tested on the proposed method using Bupa dataset. To perform the same, the classifiers used in the experiments falls in various categories like Lazy Learner (K-Star), decision tree (NB Tree and Simple Cart), Bayes (Bayes Net) and ANN Model (MLP). The basic ideas of these models are Lazy Learner employs using the Nearest Neighbour, Decision tree follows divide and conquer strategy, Bayes follows Bayes theorem and MLP utilizes the back probagation technique. The execution results of these classifiers on actual Bupa dataset are recorded. To balance the Bupa dataset, the EMOTE method was applied on it. After balancing, the classifiers are executed on the EMOTE balanced dataset and the execution results are recorded and presented in Table 3.

Table 3
Performance evaluation of various classification algorithms with BUPA datasets

Classifier Accuracy of the Classifier

Actual Dataset EMOTE’S Dataset Lifts By

% % %

K-Star 62.75 99.45 36.70

NB Tree 43.45 90.89 47.44

Simple Cart 51.03 91.82 40.79

Bayes Net 33.33 86.95 53.62

MLP 52.41 83.41 31.00

Classifier	Accuracy of the Classifier
K-Star	62.75	99.45	36.70
NB Tree	43.45	90.89	47.44
Simple Cart	51.03	91.82	40.79
Bayes Net	33.33	86.95	53.62
MLP	52.41	83.41	31.00

The results of various classification algorithms shows that, the classifiers are able to perform better on proposed EMOTE dataset than the actual dataset. The reason for this is all the classifiers try to improve their performance just by focusing on various techniques, not by considering the relative distribution of each class in the dataset. On focusing such an issue, the proposed EMOTE balances the dataset. By which the classifiers are able to considerably lift its performance accuracy by 31% to 54% on the actual dataset. Particularly K-Star performs better than the other classifier on the Bupa imbalanced dataset. Still, K-Star proves that, the dataset formed by the proposed method is a well balanced one to improve the performance of the classifier.

4.3 Performance comparison of EMOTE with FDUS + SMOTE

FDUS + SMOTE is a combination of Fuzzy Distance based Under Sampling (FDUS) and SMOTE [16]. To analyze the performance of the FDUS + SMOTE, BUPA, HABERMAN and PIMA dataset are considered and balanced using FDUS + SMOTE Method. As a next step classifier is executed on the balanced dataset and the results are compared with the other popular methods like SMOTE + TOMEK [4], SMOTE + ENN [3], FDUS and SMOTE. From the comparative analysis, it is proved that the method FDUS + SMOTE performed better than the other techniques.

To prove the efficiency of the proposed EMOTE against FDUS + SMOTE, the same data sets are adapted for classification. The measures likeF-Measure and G Mean are considered for comparative analysis. F-Measure (Harmonic Mean) shows the relation between the precision and recall. Where precision is how many retrieved instances are relevant and recall is how many relevant instances are retrieved. G-Mean (Geometric Mean) is a measure, which shows the ability of a classifier on sensitivity and specificity. It has the maximum value when the value of sensitivity and specificity are high.

In order to perform the comparative analysis, the datasets are balanced using the proposed method EMOTE. As a next step classifier is executed on the same and from the results of the classifier the values of F-Measure and G-Mean are calculated. The highlighted values in the Table 4 are the calculated values of proposed method and rest are from reference [16].

Table 4
F-Measure values of various data sets (%)

Techniques Datasets

Bupa Haberman Pima

EMOTE Proposed 91.2 86.04 87.73

FDUS + SMOTE 79.69 85.71 81.59

SMOTE + Tomek 61.02 63.27 59.37

SMOTE + ENN 71.43 79.29 66.31

FDUS 88.41 80 60.09

SMOTE 40 66.02 58.33

Techniques	Datasets
EMOTE Proposed	91.2	86.04	87.73
FDUS + SMOTE	79.69	85.71	81.59
SMOTE + Tomek	61.02	63.27	59.37
SMOTE + ENN	71.43	79.29	66.31
FDUS	88.41	80	60.09
SMOTE	40	66.02	58.33

In F-Measure, the value of F improves proportionally to the raise of precision and recall. The high value of the F-Measure signifies that the classifier performs better on the positive class. From the Table 4, it is identified that the F-Value of the proposed method is high, which in turn shows that the proportion of correctly classified positive instances are high. In addition to this, the F-Value of the proposed method is higher than the other methods. It proves that the classifier is able to perform better on the dataset which is balanced by proposed EMOTE than the datasets of other methods.

The value of G-Mean is high when both sensitivity and specificity are high. The high value of the G-Mean also indicates that the classifier performs well on both the majority and minority class. The values of Table 5 show that, the G-Mean values of the proposed method are high. By which it shows that the classifier is able to classify well on both minority and majority class. The table also state that, the G-Mean value of the proposed method is high than the other methods. This proves that, the proposed method is more efficient than the other methods in balancing the dataset.

Table 5

G-Mean values of various data sets (%)

Techniques	Datasets
	Bupa	Haberman	Pima
EMOTE Proposed	92.01	87.61	88.68
FDUS + SMOTE	80.7	85.11	65.44
SMOTE + Tomek	71.34	72.28	61.58
SMOTE + ENN	72.88	73.36	69.97
FDUS	79.01	69.28	62.32
SMOTE	52.52	74.47	62.71

4.4 ROC analysis of the proposed model

The Receiver Operating Characteristic (ROC) curves are the standard technique for Summarizing and visualizing the performance of the classifiers between true positive and false positive error rates [25]. In the ROC Curve, the Y-axis represents the True Positive Rate (Sensitivity) and X – axis represents the False Positive Rate (Specificity).

The basic characteristics of an ROC curve are defined as follows. The point (0, 0) represents that classifier never predicts the positive class. As an opposite of this point (1, 1) represents that classifier predicts all instances as positive, thus generating greater number of false positives. The point (0, 1) in the ROC Curve would be the ideal point, (i.e.) it represents that all positive instances are correctly classified as positive and no negative classes are in-correctly classified as positive.

To perform the ROC analysis of EMOTE, the churn telecom dataset and various classifiers like Navie Bayes, Random Forest, Ada Boost, KNN and SVM are considered. ROC curves are prepared first by executing the classifiers on actual imbalanced dataset and then on the dataset which is balanced by the proposed EMOTE. The resulted ROC curves of both the execution are shown in Fig. 3.

Fig.3

ROC comparison of various classifiers on churn telecom dataset.

The results of the ROC curve indicates that the performance of the classifier has huge variance between the actual imbalanced and the EMOTE balanced dataset. This shows that, the number of positive instances correctly classified as positive is less and the number of negative instances incorrectly classified as positive are high on the actual imbalanced dataset. Whereas, after balancing by EMOTE, the classifiers are able to improve the result by attaining very few misclassification on both positive and negative instances. The steep of the curve toward the point (0, 1) on the proposed EMOTE proves the same. This is the significant evidence to prove that, as the relative distribution of each class in the dataset become imbalanced, the proposed EMOTE has greater impact on the performance of the classifier.

4.5 AUC performance comparison of EMOTE with CMTNN + SMOTE

CMTNN + SMOTE is a combination of both the Synthetic Minority Over Sampling Technique (SMOTE) and the Complementary Neural Network (CMTNN) to handle the imbalance issue on classification [18]. Pima Indians Diabetes data, Haberman’s Survival data, German credit data and SPECT heart data are the datasets considered for experiments [18]. K-NN is the classifier applied and from the resultsG-Mean and AUC is recorded. The recorded values are compared with the other widely used methods and proved that CMTNN + SMOTE performs better than other methods.

To analyze the performance of the proposed EMOTE in comparison with CMTNN + SMOTE, the same datasets are considered. After balancing the datasets using EMOTE, the classifier k-NN was executed on the balanced dataset. AUC and G-Mean are the measures calculated through the experiments. The calculated values are presented in the Table 6. The highlighted portion of Table 6 is the computed values of the proposed method and the rest are the results from reference [18].

Table 6
G-Mean and AUC values of various Data Sets classified by K-NN (k = 5)

Techniques Pima Indian Diabetes Data German Credit Data Haberman’s Survival Data SPECT Heart data

GM AUC GM AUC GM AUC GM AUC

Actual Data 65.27 0.7665 59.35 0.7483 40.11 0.5741 68 0.8121

Proposed EMOTE 94.86 0.8572 92.69 0.9071 87.61 0.8584 77.7 0.8423

ENN 71.15 0.7817 64.4 0.7566 46.47 0.5915 77.56 0.8369

Tomek Links 72.06 0.7865 67.42 0.7625 47.57 0.5918 74.1 0.8148

SMOTE 71.78 0.7742 68.69 0.7518 55.82 0.5836 74.2 0.8005

CMTNN Technique I (Majority) + SMOTE 72.11 0.7938 69.32 0.7572 56.28 0.5927 74.64 0.8264

CMTNN Technique II (Majority) + SMOTE 73.17 0.7956 69.94 0.7686 57.5 0.605 74.53 0.803

SMOTE + CMTNN Technique I 73.95 0.8104 72.35 0.7785 56.39 0.6226 74.13 0.8121

SMOTE + CMTNN Technique II 73.42 0.8058 71.21 0.7719 59.3 0.6302 75.3 0.8179

Techniques	Pima Indian Diabetes Data	German Credit Data	Haberman’s Survival Data	SPECT Heart data
Actual Data	65.27	0.7665	59.35	0.7483	40.11	0.5741	68	0.8121
Proposed EMOTE	94.86	0.8572	92.69	0.9071	87.61	0.8584	77.7	0.8423
ENN	71.15	0.7817	64.4	0.7566	46.47	0.5915	77.56	0.8369
Tomek Links	72.06	0.7865	67.42	0.7625	47.57	0.5918	74.1	0.8148
SMOTE	71.78	0.7742	68.69	0.7518	55.82	0.5836	74.2	0.8005
CMTNN Technique I (Majority) + SMOTE	72.11	0.7938	69.32	0.7572	56.28	0.5927	74.64	0.8264
CMTNN Technique II (Majority) + SMOTE	73.17	0.7956	69.94	0.7686	57.5	0.605	74.53	0.803
SMOTE + CMTNN Technique I	73.95	0.8104	72.35	0.7785	56.39	0.6226	74.13	0.8121
SMOTE + CMTNN Technique II	73.42	0.8058	71.21	0.7719	59.3	0.6302	75.3	0.8179

The experimental results shows that the proposed EMOTE perform better than the other techniques in view of G-Mean and AUC. It also improves the performance extensively when compared with the actual imbalanced dataset. As an overall, the performance of the proposed method performed well than the other widely used methods. The proposed method shows the significant improvement in G-mean value on actual imbalanced dataset from 9.7% to 47.5%. In addition to this, the comparison of G-Mean value with the other methods state that, the G-Mean value of the proposed method was improved from 0.14 % to 41.4%. The reason for effective performance of the proposed method is its misclassification analysis. The misclassification analysis of the EMOTE improves the quality of the dataset using the nearest neighbours, which ultimately eliminates the possible misclassification on the datasets.

On the other view the AUC values of the proposed method also have vast difference when compared with the results of original data and other techniques. This in turn proves that, the classifier ranks a randomly chosen positive instance higher than the randomly chosen negative instance on the proposed EMOTE dataset. In particular the proposed EMOTE outperforms on Haberman’s Survival Data than the base method CMTNN. This is because the base method CMTNN removes huge amount of misclassification instance from the dataset. As a consequence, the left over instances are not sufficient for the classifier to generalize the correct result. But the proposed method tunes the misclassified instance into correctly classified instance by the inclusion of nearest neighbour not by removing the misclassified instance. This vital fact confirms that the proposed EMOTE is more efficient than the other techniques.

4.6 Performance comparison of EMOTE with SMOTE

SMOTE is a combination of Over Sampling minority class with Synthetic instances and under sampling with majority class. To prove the efficiency of SMOTE, analysis has been done with different datasets using three different machine learning algorithms namely C4.5, Ripper and Naive Bayes. The ROC analysis is done to evaluate the performance of SMOTE. The ROC curve is prepared, first by oversampling the minority class with synthetic examples and next by under sampling the majority class. With the same it is proved that, SMOTE datasets are well balanced.

To investigate the performance of the EMOTE with the well known and widely accepted method SMOTE, the datasets PIMA and OIL [12] are considered. C4.5 is considered to prepare the ROC as like SMOTE. Figure 4 presents the resultant ROC Curves for the PIMA Dataset. Figure 4 (a) and (c) represents the ROC obtained, when C4.5 is executed on actual and EMOTE PIMA Dataset respectively. The AUC values also considered to evaluate actual Pima and EMOTE Pima dataset. The AUC values presented in the respective Figures show that, it was improved from 0.678 to 0.807 by the proposed method EMOTE. This also illustrate that, EMOTE balances the dataset well, by which it also improves the performance of the classifier.

Fig.4

ROC comparison of Pima Dataset.

Figure 4(b) represents the ROC depicted from the reference [7], which was the result of the classifier C4.5 executed on SMOTE PIMA Dataset. In ROC curve, the steeper the curve towards the upper left corner represents that, the better the classification. The ROC curves of the Fig. 4(b) and (c) reveal that, the steep of the curve in Fig. 4(c) is better than Fig. 4(b). It means that the number of positive instances, which correctly classified as positive are high and the number of negative instances, which in-correctly classified as positive is less in EMOTE dataset when compared with SMOTE Dataset. It ultimately proves that the performance of the EMOTE is better than the SMOTE in balancing the dataset.

To further highlight the performance of SMOTE, cross validation of SMOTE and under sampling against C4.5 by tuning the majority class is done on OIL Dataset. The results from reference [7] are shown in Table 7 and Fig. 5(a). In the Table 7 Acc+ is the accuracy of the positive instances (minority class) and Acc- is the accuracy of the negative instances (majority class). In Fig. 5(a), X-axis represents the percentage of under sampling done on the dataset, per iteration and Y-axis represents the accuracy of the classifier based on under sampling.

Table 7

Comparison of EMOTE against SMOTE and Under Sampling

SMOTE and Under Sampling			Proposed EMOTE
Population	Acc+	Acc-	Population	Acc+	Acc-
%	%	%	%	%	%
10	64.7	94.2	–	68.29	99.44
15	62.8	91.3
25	64	89.1	2.34	98.41	99.44
50	89.5	78.9
75	83.7	73	0.2	95.38	99.66
100	78.3	68.7
125	84.2	68.1	0.72	83.33	99.55
150	83.3	57.8
175	85	57.8	1.65	98.86	99.1
200	81.7	56.7	0.2	93.33	99.44
300	89	55
400	95.5	44.2	1.42	97.11	99.21
500	98	35.5
600	98	40	1.3	97.43	99.33
700	96	32.8	1.18	95.34	99.44
800	90.7	33.8	2.14	100	99.33

Fig.5

(a) and (b): Pictorial representation of comparison of EMOTE against SMOTE and under sampling.

To prove the efficiency of EMOTE against SMOTE, the same OIL Dataset was considered. To perform comparison, percentage of over sampling, Accuracy on positive instances (Acc+) and Accuracy on negative instances (Acc–) are calculated by executing the classifier C4.5 on oil dataset. The results are presented in Table 7 as highlighted portion and in Fig. 5(b). In Fig. 5(b), X-axis represents the percentage of oversampling made on the dataset, per iteration and Y-axis represents the accuracy of the classifier based on oversampling. The percentage of oversampling in the proposed method is based on the misclassified instances and their nearest neighbours; hence, the values of X-axis has variable interval in the Fig. 5(b).

From the result of SMOTE shown in Table 7, it is identified that, the value of ACC+ increases and similarly the value of ACC– decreases from 94.20 to 33.80, as with the percentage of under sampling increases by the SMOTE. This is because, to balance the dataset, SMOTE removes the majority class instances randomly from dataset starting with 10% to 800%. As an effect, the accuracy of majority class (Acc–) gets reduced as like instances. Whereas, the highlighted portion of the Table 7 shows that, the value of Acc+ effectively increases, as well as the value of Acc- is also stable in the proposed method EMOTE and it just varies from 99.44 to 99.33 but not like SMOTE from 94.20 to 33.80. In addition to this, it shows that, the maximum accuracy on majority class ACC+ (98%) was attained by SMOTE, only when the dataset was under sampled by 600%. Whereas, the same level of accuracy (98.41%) was effectively achieved by EMOTE just by oversampling the dataset with 2.34% i.e., only by replicating 22 instances out of 937 instances. This empirical result proves the impact of nearest neighbours on balancing the dataset and consequently confirm the efficiency of the EMOTE.

The ROC curves shown In Fig. 6 are results of the OIL dataset. In Fig. 6; 6(a) represents ROC of actual dataset, 6(b) depicts ROC of SMOTE balanced dataset from reference 7 and 6(c) shows ROC of EMOTE balanced dataset. By comparing all the curves, the curve in Fig. 6(c) obtained based on EMOTE dataset was steeper towards the upper left corner than the curves in Fig. 6(a) and (b). In addition to this, the associated AUC value shows that, the value was improved from 0.5146 to 0.981 by the EMOTE. All the above various levels of comparison with SMOTE confirm that, the proposed EMOTE perform best on balancing the dataset to improve the accuracy of the classifier.

Fig.6

ROC comparison of OIL dataset.

5 Conclusion

In the process of classification, the cause of imbalanced class distribution is not considered. Many earlier studies [11, 14] focused to improve classification accuracy, but not considered the problem of imbalanced class distribution. Hence, the ability of the classifiers constructed by these studies are not fine on the prediction of minority class samples. Many real time applications like fraud analysis on bank loan, medical research analysis and telecommunication churn analysis have the imbalanced class distribution problem. In this case, it is very tough to make accurate prediction on the customers or patients who are in need to be identified.

In this study, the Enhanced Minority Oversampling TEchnique (EMOTE) was proposed to solve the problem of imbalanced class distributions on the dataset. The experiments performed on nine different data sets using the machine learning algorithm C4.5 and other. The results of the classifiers is calculated in terms of effective measures like F-Measure, G-Mean and AUC and compared against various widely accepted methods. The analysis shows that the proposed EMOTE relatively generate a balanced dataset without any loss of information and without the inclusion of greater number of instances. Thus, the classifier employed on the dataset generated by EMOTE can effectively improve the classification accuracy than the original imbalanced dataset. EMOTE, not only increases the accuracy on the prediction of minority class samples but also has the stability on the majority class than the other methods. As a summary, the experiments show that the proposed method EMOTE outperforms than the other broadly accepted methods. Hence, it has been concluded that nearest neighbours of the misclassified instances play a vital role in tuning into correctly classified instances and also concludes that the proposed method is found to be more precious for such datasets where the class attributes are not evenly distributed.

Footnotes

Acknowledgments

We express our sincere gratitude to Robert Holte for the support given to us by means of providing the oil spill dataset, which is used in his paper. We also thank the reviewers for their valuable comments and suggestions.

References

Hassanat

A.B.

, Abbadi

M.A.

and Altarawneh

G.A.

, Solving the problem of the K parameter in the KNN classifier using an ensemble learning approach, International Journal of Computer Science and Information Security (IJCSIS)12(8) (2014), 33–39.

Asuncion and D.J. Newman, UCI Machine Learning Repository, University of California, Irvine, School of Information and Computer Sciences, 2007.

Batista

G.E.

, Prati

R.C.

and Monard

M.C.

, A study of the behavior of several methods for balancing machine learning training data, ACM SIGKDD Explorations Newsletter6(1) (2004), 20–29.

Batista

G.E.

, Bazzan

A.L.

and Monard

M.C.

, Balancing training data for automated annotation of keywords: A case study, WOB (2003), 10–18.

Bradley

A.P.

, The use of the area under the ROC curve in the evaluation of machine learning algorithms Pattern Recognition, 30(7) (1997), 1145–1159.

Chawla

N.V.

, Data mining for imbalanced data sets: An overview. Data Mining and Knowledge Discovery Handbook, Springer US, 2010, pp. 875–886.

Chawla

N.V.

, Bowyer

K.W.

and Hall

L.O.

, SMOTE: Synthetic Minority Over sampling Technique, 16 (2002), 321–357.

Drummond

and Holte

R.C.

, C4.5, class imbalance, and cost sensitivity: Why under-sampling beats over-sampling, In Proceedings of the ICML’03 Workshop on Learning From Imbalanced Datasets, 2003.

Duda

, Hart

and Stork

, Pattern Classification, Wiley-Inter science (2nd ed.), 2001.

10.

Ezawa

K.J.

, Singh

and Norton

S.W.

, Learning Goal Oriented Bayesian Networks for Telecommunications Risk Management, In Proceedings of the International Conference on Machine Learning, ICML-96, 1996, pp. 139–147.

11.

and Japkowicz

, Class imbalances versus small disjuncts, SIGKDD Explorations6(1) (2004), 40–49.

12.

Kubat

, Holte

and Matwin

, Machine learning for the detection of oil spills in satellite radar images, machine learning, 30 (1998), 195–215.

13.

Latourrette

, Toward an explanatory similarity measure for nearest-neighbor classification, in Proceedings of the 11th European Conference on Machine Learning, London, 2000, pp. 238–245.

14.

Lee

T.S.

and Chen

I.F.

, A two-stage hybrid credit scoring model using artificial Neural networks and multivariate adaptive regression splines, Expert Systems with Applications28(4) (2005), 743–752.

15.

Lewis

and Catlett

, Heterogeneous Uncertainity Sampling for Supervised Learning, In Proceedings of the Eleventh International Conference of Machine Learning, 1994, pp. 148–156.

16.

Zorkeflee

, Din

A.M.

and Ku-Mahamud

K.R.

, Fuzzy And Smote Resampling Technique For Imbalanced Data Sets, Proceedings of the 5th International Conference on Computing and Informatics, ICOCI2015, Istanbul, Turkey, Universiti Utara Malaysia, 2015.

17.

Mostafizur Rahman

and Davis

D.N.

, Addressing the class imbalance problem in medical datasets, International Journal of Machine Learning and Computing3(2) (2013), 224–229.

18.

Jeatrakul

, Wong

K.W.

and Fung

C.C.

, Classification of Imbalanced Data by Combining the Complementary Neural Network and SMOTE Algorithm, Springer-Verlag, 2010.

19.

Provost

and Fawcett

, Robust classification for imprecise environments, Machine Learning42(3) (2001), 203–231.

20.

Mulak

and Talhar

, Analysis of distance measures using K nearest neighbour algorithm on KDD dataset, International Journal of Science and Research4(7) (2015), 2101–2104.

21.

Wang

, A hybrid sampling SVM approach to imbalanced data classification, Hindawi Publishing Corporation Abstract and Applied Analysis2014 (2014), 1–7.

22.

Quinlan

J.R.

, C4.5: Programs for machine learning, San Mateo: Morgan Kaufmann, 1993.

23.

Ramesh

, A study on efficiency of decision tree and multi layer perceptron to predict the customer churn in telecommunication using WEKA, International Journal of Computer Applications140(4) (2016), 26–30.

24.

Song

, Huang

, Zhou

, Zha

and Giles

C.L.

, Iknn: Informative k-nearest neighbor pattern classification, in Proceedings of the 11th European conference on Principles and Practice of Knowledge Discovery in Databases, Berlin, 2007, pp. 248–264.

25.

Fawcett

, ROC Graphs: Notes and Practical Considerations for Data Mining Researchers, Intelligent Enterprise Technologies Laboratory, HP Laboratories Palo Alto, HPL-2003, 4, 2003, pp. 1–27.

26.

Yen

S.J.

and Lee

Y.S.

, Cluster-based under-sampling approaches for imbalanced data distributions, Expert Systems with Applications36 (2009), 5718–5727.

27.

https://en.wikipedia.org/wiki/Weka_(machine_learning)

28.

https://sourceforge.net/projects/ikvm/files

Classifier	Accuracy of the Classifier
	Actual Dataset	EMOTE’S Dataset	Lifts By
	%	%	%
K-Star	62.75	99.45	36.70
NB Tree	43.45	90.89	47.44
Simple Cart	51.03	91.82	40.79
Bayes Net	33.33	86.95	53.62
MLP	52.41	83.41	31.00