Entropy difference and kernel-based oversampling technique for imbalanced data learning

Abstract

Class imbalance is often a problem in various real-world datasets, where one class contains a small number of data and the other contains a large number of data. It is notably difficult to develop an effective model using traditional data mining and machine learning algorithms without using data preprocessing techniques to balance the dataset. Oversampling is often used as a pretreatment method for imbalanced datasets. Specifically, synthetic oversampling techniques focus on balancing the number of training instances between the majority class and the minority class by generating extra artificial minority class instances. However, the current oversampling techniques simply consider the imbalance of quantity and pay no attention to whether the distribution is balanced or not. Therefore, this paper proposes an entropy difference and kernel-based SMOTE (EDKS) which considers the imbalance degree of dataset from distribution by entropy difference and overcomes the limitation of SMOTE for nonlinear problems by oversampling in the feature space of support vector machine classifier. First, the EDKS method maps the input data into a feature space to increase the separability of the data. Then EDKS calculates the entropy difference in kernel space, determines the majority class and minority class, and finds the sparse regions in the minority class. Moreover, the proposed method balances the data distribution by synthesizing new instances and evaluating its retention capability. Our algorithm can effectively distinguish those datasets with the same imbalance ratio but different distribution. The experimental study evaluates and compares the performance of our method against state-of-the-art algorithms, and then demonstrates that the proposed approach is competitive with the state-of-art algorithms on multiple benchmark imbalanced datasets.

Keywords

Imbalanced dataset oversampling entropy difference kernel space SVM

1. Introduction

Imbalanced classification, where the majority class is dominant to the minority class, is widely used in data mining and machine learning, such as enterprise credit evaluation [1], bankruptcy prediction [2], fraud detection [3, 4] and others [5].

Many famous machine learning models have presented to solve classification problems based on reasonable assumptions of balanced class distribution. However, these algorithms do not necessarily work well on imbalanced data, because their goal is to pursue a minimized global error rate [6]. In other words, a classifier can achieve a high degree of accurate classification, although it does not correctly predict any of the instances in the minority class. For example, assuming that 0.1% of credit card transactions are fraudulent, a simple classifier that classifies all transactions as legitimate will get a classification accuracy of 99.9%. In this case, however, all fraud cases remain undetected.

Techniques aimed at improving classification performance in the case of class imbalance can be divided into two broad categories: algorithm level methods and data level methods.

•
Algorithm level methods include modifying classical algorithms, cost-sensitive methods and ensemble of classifiers. The strategies which modify the classification algorithm to deal with the imbalance are algorithm level techniques [7]. Such techniques include training separate classifiers for each class and changing decision thresholds [8, 9]. The purpose of cost-sensitive methods is to provide classification algorithms which different misclassification costs for each class. However, they require knowledge of the cost of misclassification, which are commonly unknown or difficult to quantify and dataset-dependent. In addition, these algorithms should be able to incorporate the misclassification cost of each class or sample into its optimization. Therefore, these methods are considered to operate at the data and algorithm level [10]. Ensemble of classifiers are designed to improve the accuracy of a single classifier by training several different models and combining their decisions to predict a single class label [10, 11]. The main disadvantage of the ensemble method is the high time complexity.
•
Data level methods could be seen as a classifier-independent type of technology, which are used to rebalance the data distribution to make standard algorithms center on the goals of the user [12]. In particular, data level methods can be categorized as under-sampling majority class instances [13] and oversampling minority class instances [14, 15]. Unlike algorithm-level methods, which are bound to problem-specific classifiers and need to be implemented by classifiers, data-level methods, especially oversampling methods, are universal and therefore more versatile [10].

Nonlinear separability could markedly influence the performance of machine learning algorithms in the imbalanced datasets. Fortunately, SVMs are suitable for imbalanced problem as the final decision function depends only on one smaller subset of training instances known as support vectors inherently [16]. Moreover, SVMs work well on nonlinear problems by identifying the separating boundary based on a kernel function that maps the dataset to a feature space. In this paper, entropy difference (ED) is introduced in the kernel space as a measure of distribution imbalance. It is well known that entropy can measures the chaos of data distribution. ED is one of the class-position statistics, it reflects inter-class distribution by intra-class concentration. SMOTE technology is used to synthesize instances where the intra-class concentration is small. In order to reduce the ED of datasets, this paper evaluates each synthetic instance and only qualified instances can be retained.

Josey et al. [16] developed a weighted kernel-based SMOTE (WK-SMOTE) method to solve the two problems of nonlinear and class imbalance at the same time, and obtained some impressive results. In order to improve the classification performance of WK-SMOTE, in this paper, we fully consider the balance of data distribution and propose an entropy difference and kernel-based SMOTE technique for solving the imbalanced problem. First, the EDKS method maps the input data into a feature space to increase the separability of the data. Then EDKS calculates the ED in kernel space, determines the majority and minority class and finds the sparse regions in the minority class. Moreover, the proposed method balances the data distribution by synthesizing new instance and evaluating its retention capability. The main contribution of this paper is to use ED to measure the data distribution in kernel space to ensure that new instances stay within a reasonable area.

The remainder of this paper is organized as follows: Section 2 introduces the class imbalance problem, discusses some oversampling methods for rebalancing imbalanced datasets and shows some kernel methods related to nonlinear separable classification problems in brief. In Section 3, we describe the proposed entropy difference and kernel-based SMOTE in detail. The expermental results are present in Section 4. The conclusions of the paper and further work on this topic are discussed in Section 5.
2. Related work

Although many different types of methods have been proposed for class imbalance problems, Section 2.1 only reviews the work of data resampling related to this article. Some kernel methods related to nonlinear separable classification problems are also introduced in Section 2.2.

2.1 Imbalance problem

Undersampling methods try to create a balanced subset of the original imbalanced dataset by reducing the size of majority class instances. Random undersampling (RUS) is a nonheuristic method which aims to rebalance class distribution through the random elimination of majority class instances. Obviously, the drawback to RUS is that it loses a lot of majority instances information. In order to keep more useful information about the majority instances, recently, the clustering technique has been introduced into undersampling methods [13, 17, 18]. Clustering-based undersampling method [13] involves clustering the majority instances by k-means algorithm [19], and the authors set the parameter k equal to the number of minority instances. Then, the k cluster centers, which are produced by the k-means algorithm over the majority instances, are used to replace the entire majority class dataset. Consequently, both the majority and minority class datasets contain the same number of instances.

Oversampling techniques balance the number of instances by increasing the number of the minority class instances. The simplest method of data oversampling is Random Over-Sampling (ROS) [7], which replicates the minority instances randomly and adds them into the training dataset. However, training a classification model from a balanced data set processed by ROS may lead to overfitting. Chawla et al. [20] proposed a traversal algorithm based on linear interpolation named SMOTE, which can effectively expand the range of a small number of instances and avoid overfitting. Its main idea is to select randomly one or more of the k nearest neighbors for minority class instance and to create new positive instances along the line segments joining selected instances. Since then, many variants of SMOTE have been proposed, such as adaptive synthetic sampling approach [15], Borderline-SMOTE [21], safe-level synthetic minority oversampling technique [22], density-based synthetic minority oversampling technique [23] and majority weighted minority oversampling technique (MWMOTE) [24]. In [25], Fernández et al. provided a summative analysis about the variations of SMOTE and pointed out their differences from the original SMOTE algorithm systematically. Overall, although SMOTE reduces the risk of overfitting, it also leads to the problem of over generalization [26]. In [27], the method proposed by Zhu et al. is designed for combating the multi-class imbalance problem and avoids over generalization by a corresponding selection weight to each neighbor direction. Menardi and Torelli’s research showed the effect of class imbalance on model training and evaluation [28]. Moreover, they have proposed an unified and systematic data processing framework for imbalanced datasets based on a smoothed bootstrap resampling technique. Liu et al. [29] made full use of the correlation among different attributes and developed a fuzzy rule-based oversampling technique for addressing the imbalance problem, solving the problems of imbalance and missing values at the same time. RWO-sampling [30] is a random walk oversampling method, which redistributes different class instances by generating synthetic instances from the real data. This method also allowed the extension of the classification border. In [31], Das et al. remarked that existing oversampling methods for addressing the class imbalance problem generally did not consider the probability distribution of the minority class. Thus, they presented two probabilistic oversampling methods, termed as RACOG and wRACOG, which utilize the joint probability distribution of data attributes and Gibbs sampling to creat artificial instances for the minority class.

It is well known that the class imbalance is not a problem in itself and various data distributions may be the real cause of the performance degradation of classifiers. There have been many papers discussing how to deal with these special issues. SMOTE-IPF [14] has been devoted to controlling borderline instances and the noise produced by SMOTE. Besides, Mahalanobis distance-based oversampling technique (MDO) [32] payed attention to maintaining the covariance structure of minority instances and reduced the risk of overlap between different class regions that are considered serious challenge in the multi-class problems.

It is important to measure imbalance degree between the minority and the majority. The traditional measure of class imbalance is the imbalance ratio (IR), which is defined as the ratio of the number of majority instances to the number of minority instances. The simplicity of IR leads to its application in most studies of imbalanced data. IR reflects the imbalance degree of dataset in size, but does not measure the imbalance degree in distribution. Even if the datasets is quantitatively balanced, the imbalance of class distribution may still exist [33]. Furthermore, the classification accuracy of minority class is related to the number of information instances rather than the number of minority instances [34]. Therefore, this paper uses the entropy difference to measure the imbalanced degree of dataset distribution, which is completely different from the previous IR. The advantages of using ED can be clearly seen in Fig. 1. The two datasets have different ED and the same IR. For Fig. 1a, without overlapping regions and clear classification boundaries between two classes make it easily recognized by any simiple classifier. Figure 1b is quite different. Obviously, IR does not have the ability to distinguish these two datasets with different distributions. In short, those representative instances of minority class are crucial in the learning of minority distribution. Previous works [33, 34] have shown that the more representative instances in the minority class, the better the classifier’s performance when fixed IR. Therefore, it is not appropriate to use IR as an only indicator to measure the imbalance degree.

Figure 1.

$\circ$ : negative, $\blacksquare$ : positive. Two datasets have different ED and same IR.

Entropy often used to measure the uncertainty of data distribution, and it can be considered as the contrary of distribution information. In other words, the more randomly distributed a dataset is, the less informations it contains [35]. For imbalanced data, a more dispersed and more imbalanced intra-class distribution will imply a high level of entropy. In this occasion, entropy is introduced into the kernel space as a measure of data distribution.

2.2 Nonlinear separable problem

The classical oversampling methods, such as SMOTE [20], AdaSyn [15], borderline [21], are usually to generate instances for the minority class to balance the dataset. A new minority instance is inserted into the line segment between $x_{p}$ and $x_{q}({x_{p}},{x_{q}}\in{D^{\min}})$ as

$\displaystyle{x_{\textit{new}}}={x_{p}}+({x_{q}}-{x_{p}})\times\delta$ (1)

where $\delta$ is a random number from the uniform distribution $\in U[{0,1}]$ . The main difference between these algorithms lies in the choice of $x_{p}$ and $x_{q}$ .

Figure 2.

$\circ$ : negative, $\blacksquare$ : positive, $*$ : synthetic. Comparison of minority instances generated in input and feature space. (a) Input space $x={[{{x_{1}},{x_{2}}}]^{T}}$ , (b) Kernel space $\phi(x)={[{{x_{1}}^{2},{x_{2}}^{2},\sqrt{2}{x_{1}}{x_{2}}}]^{T}}$ .

Although SMOTE has been proved to obtain good classification performance in many applications areas, its performance is largely limited on nonlinear separable problems. For example, consider the two-dimensional classification problem shown in Fig. 2a. In order to balance the dataset artificially, SMOTE identifies minority classes and their neighbors in the input space, and generates a random minority instances on the line segments connecting them. It is worth noting that some synthetic instances have been clearly generated in the majority class region. This problem has been addressed in literatures by modifying the selection of neighbors in SMOTE. In [21], the danger set is composed of the minority instances with a large number of majority neighbors. These borderline danger instances are then used to generate instances. Such methods, however, depend on specific data distribution characteristics, which limits their application to specific domains. On the other hand, classifiers such as neural networks and SVM solve nonlinear problems by mapping the instances to a feature space and building decision function in this space. The two-dimensional non-linear separable classification task in Fig. 2a can be transformed into a three-dimensional feature space using the transformation function $\phi(\cdot)$ that maps the input point $x={[{{x_{1}},{x_{2}}}]^{T}}$ to

$\displaystyle\phi(x)={[{{x_{1}}^{2},{x_{2}}^{2},\sqrt{2}{x_{1}}{x_{2}}}]^{T}}$ (2)

As shown in Fig. 2b, the nonlinear separable problem is transformed into a linear separable problem. Hence, performing oversampling in this feature space leads to more representative instances generation. In addition to modifying the decision function, the selection of the kernel function also affects the generalization ability of SVM classifier [36, 37, 38]. Some kernel-based methods for imbalanced classification have also been proposed. For example, in [39] and [40], preimages of instances generated in feature space are used to balance the imbalanced data set.

A synthetic instance between two minority instances ${x_{p}},{x_{q}}\in{D^{\min}}$ can be generated in feature space as

$\displaystyle\phi({x^{pq}})=\phi({x_{p}})+{\delta^{pq}}({\phi({x_{p}})-\phi({x% _{q}})})$ (3)

where $\delta\in U[{0,1}]$ . Josey et al. [16] proved that adding class instances in feature space can maintain the separability of classes for nonlinear separable problems.

Theorem 1 [16]: Given a binary dataset $D$ and the reproducing kernel Hilbert space (RKHS) associated with kernel function $K({\cdot,\cdot})$ where $D$ is linearly separable, there exists a separating hyperplane for $D\cup{x^{pq}}$ in the RKHS where ${x^{pq}}$ is the image of synthetic instance generated by Eq. (3) in the input space.

The situation is different for instances generated in the input space. Addition of synthetic instances increases the representation of the minority class and forces the classification model to move the decision boundary toward the majority class.

3. The proposed EDKS algorithm

In this section, we propose an ED and kernel-based SMOTE technique (EDKS). EDKS for SVM involves three key steps. In the first step, the entropy difference between two classes of data is calculated in feature space and minority classes are determined (Section 3.1). The second step, the pairs of seed and neighbor used to generate synthetic instances are identified from the minority classes, and then synthetic instances are generated on the line segments between the pairs (Section 3.2). Finally, the new synthetic instance retention capability is evaluated and only qualified instances are retained (Section 3.3). Particularly, this section is discussed in feature space. Figure 3 shows the frame of this paper.

Figure 3.

The structure of EDKS.

3.1 Entropy difference

Given a training dataset $D$ , which contains instance $X=\{x_{i}|i=1,2,\cdots,N\}$ , its classes are written as $C=\{C_{l}|l=1,2\}$ and corresponding instance number denote as $\{{{N_{1}},{N_{2}}}\}$ . Furthermore, in the kernel space, the distance metric to identify neighbors needs to be redefined. Consider two instances ${x_{i}}$ and ${x_{j}}$ , which are transformed to feature space as $\phi({x_{i}})$ and $\phi({x_{j}})$ , respectively. The distance between these two instances is identified as follows in feature space:

$\displaystyle{d^{\phi}}({{x_{i}},{x_{j}}})^{2}=\|{\phi({x_{i}})-\phi({x_{j}})}% \|^{2}=K({{x_{i}},{x_{i}}})-2K({{x_{i}},{x_{j}}})+K({{x_{j}},{x_{j}}})$ (4)

the $k$ -nearest minority instances of a seed instance are identified in the feature space using Eq. (4).

We define a density-based instance-position statistic for the $i^{\text{th}}$ instance of a given dataset using the Eq. (5):

$\displaystyle{\mu_{k}}({\phi({{x_{i}}})})=\frac{1}{k}\sum\limits_{t=1}^{k}% \textit{sim}_{t}(\phi({{x_{i}}}),Q_{i}({\phi({{x_{i}}})}))$ (5)

where $k$ is the number of nearest neighbors in the same class and $Q_{i}$ contains $k$ -nearest neighbors of instance $\phi({x_{i}})$ in the same class, $\textit{sim}_{t}(\phi({x_{i}}),{Q_{i}})$ represents the similarity between the instance $\phi({x_{i}})$ and its the $t^{\text{th}}$ nearest neighbor $Q_{i}$ , which generally uses the average value of Euclidean distance. Therefore, ${\mu_{k}}({\phi({x_{i}})})$ is usually an average density metric formula, which measures how far away from its $k$ -nearest neighbors. The entropy-based class-position statistic for the $i^{\text{th}}$ class is given by:

$\displaystyle{\omega_{i}}=\frac{{\mu_{k}}({\phi({{x_{i}}})})}{\sum\limits_{{x_% {i}}\in{C_{l}}}{\mu_{k}}({\phi({{x_{i}}})})}$ (6)

where ${\omega_{j}}$ is the proportion of $\phi({x_{j}})$ in the total density metric of $c_{l}$ . Therefore, the intra-class concentration of each instance can be measured through an entropy-based class-position statistic based on $k$ -nearest neighbors. The concentrations around $\phi({x_{j}})$ are higher which lead to smaller ${\mu_{k}}({\phi({x_{i}})})$ and ${\omega_{j}}$ . In other words, the magnitude of ${\omega_{j}}$ reflects the intra-class concentration of $\phi({x_{j}})$ .

$\displaystyle{E_{l}}=-\frac{1}{{{N_{l}}}}\sum\limits_{{x_{i}}\in{C_{l}}}{% \omega_{i}}{{\log}_{2}}{\omega_{i}}$ (7)

The entropy of each class is calculated by Eq. (7). Let $C_{1}$ and $C_{2}$ represent the minority and majority class, respectively. It is easy to know ${E_{1}}\geqslant{E_{2}}>0$ . As we all know, entropy is determined by the size and symmetry of information. The experimental results show that, on the imbalanced dataset, the entropy of majority classes and minority classes usually depends on the information quantity. i.e., the intra-class entropy of the minority class is usually larger than that of the majority class. On this basis, information symmetry affects the magnitude of intra-class entropy. In order to measure the imbalance degree in distribution of dataset, a new metric is given by

$\displaystyle\text{ED}=\theta={E_{1}}-{E_{2}}$ (8)

where $\theta\geqslant 0$ . Obviously, the more balanced data distribution comes less ED. When ED $=$ 0, the distribution of dataset achieves the balance based on entropy. Algorithm 1 describes the specific details of the above process.

[H] EDID ( $D, K, k$ )Input: The dataset $D$ . The kernel matrix $K$ . Number of nearest neighbors $k$ .output: The entropy difference ED. The entropy of the majority set $E_{\textit{maj}}$ . The minority set $C_{\min}$ .1. Divide $D$ into $C_{1}$ and $C_{2}$ by class label.2. Compute intra-class $k$ -nearest neighbors set $Q({\phi({{x_{i}}})})$ for all instances $\phi(x_{i})$ in $D$ according to Eq. (4).3. Compute the instance-position statistic $\mu_{k}(\phi(x))$ for all instances in $D$ according to Eq. (5).4. Calculate the class-position statistic $\omega_{i}$ for all instances in $D$ according to Eq. (6).5. Calculate the entropy $E_{l}$ for all instances in $C_{l}(l=1,2)$ using Eq. (7).6. Calculate the entropy difference $\theta={E_{1}}-{E_{2}}$ .7. if $\theta>0$ then8. return ED $=\theta,E_{\textit{maj}}=E_{2},C_{\min}=C_{1}$ .9. else10. return ED $=-\theta,E_{\textit{maj}}=E_{1},C_{\min}=C_{2}$ .11. end if

Remark

In traditional IR measurement methods, the number of minority classes should be less than the majority class. However, the result of measuring the imbalance degree based on the ED is not always consistent. The instance number of minority class can be equal to or even more than that of majority class, which proves that the majority class distribution plays a dominant role in the classification results.

3.2 Synthesize minority instances

In this section, we introduce synthesize minority instances. Firstly, instance-position statistics for each minority class instance are computed by Eq. (5), and $\phi({x_{p}})$ of the minority instance with the largest instance-position statistics is used as the seed of the oversampling algorithm. Then, in $Q({\phi({{x_{p}}})})$ find $\phi({x_{q}})$ as the instance with the largest instance-position statistics. Finally, linear interpolation is performed between $\phi({x_{p}})$ and $\phi({x_{q}})$ .

In order to train the SVM classifier in kernel space, the inner product of every pair of instances also known as the Gram matrix needs to be computed. The inner product of training instances is conveniently represented by ${\text{K}^{1}}\in{R^{N\times N}}$ , where element ${\text{K}_{ij}^{1}}=\text{K}({x_{i}},{x_{j}})=\phi{({x_{i}})^{T}}\phi({x_{j}})% ,{x_{i}},{x_{j}}\in D$ . The kernel matrix ${\text{K}^{1}}$ is augmented to include the synthetic instances generated in Eq. (3). The seed and neighbor pairs only use those instances from the original dataset. The kernel matrix K obtained by the addition of new instances of the minority class is decomposed as

$\displaystyle\text{K}=\left[{\begin{array}[]{ll}{{\text{K}^{1}}}&{{\text{K}^{2% }}}\\ {({\text{K}^{2}})^{T}}&{{\text{K}^{3}}}\\ \end{array}}\right]$ (9)

where ${\text{K}^{2}}\in{R^{N\times P}}$ with element ${\text{K}_{ij}^{2}}=\text{K}({x_{i}},x_{j}^{pq})$ , ${x_{i}}\in D,x_{j}^{pq}\in{D^{\textit{syn}}}$ ; ${\text{K}^{3}}\in{R^{P\times P}}$ with element ${\text{K}_{ij}^{3}}=\text{K}(x_{i}^{lm},x_{j}^{pq})$ , $x_{i}^{lm},x_{j}^{pq}\in{D^{\textit{syn}}}$ .

The elements of ${\text{K}^{2}}$ are inner product in feature space of one of the original instance say, ${x_{i}}$ to one of the synthetic instance say, $x_{j}^{pq}$ . They are obtained as

$\displaystyle\text{K}({x_{i}},x_{j}^{pq})=\phi{({x_{i}})^{T}}\phi(x_{j}^{pq})=% \phi{({x_{i}})^{T}}[{\phi({x_{p}})+{\delta^{pq}}({\phi({x_{q}})-\phi({x_{p}})}% )}]=({1-{\delta^{pq}}})\text{K}({{x_{i}},{x_{p}}})+{\delta^{pq}}\text{K}({{x_{% i}},{x_{q}}})$ (10)

Similarly, the elements of ${\text{K}^{3}}$ are dot product in feature space which is composed by synthetic instances. The dot product of two elements $x_{i}^{lm}$ and $x_{j}^{pq}$ is given by

$\displaystyle\text{K}(x_{i}^{lm},x_{j}^{pq})=\phi{(x_{i}^{lm})^{T}}\phi(x_{j}^% {pq})=[{\phi({x_{l}})+{\delta^{lm}}({\phi({x_{m}})-\phi({x_{l}})})}]^{T}\times% [{\phi({x_{p}})+{\delta^{pq}}({\phi({x_{q}})-\phi({x_{p}})})}]=({1-{\delta^{pq% }}})({1-{\delta^{lm}}})\text{K}({x_{p}},{x_{l}})+({1-{\delta^{pq}}})({{\delta^% {lm}}})\text{K}({x_{p}},{x_{m}}){}+({{\delta^{pq}}})({1-{\delta^{lm}}})\text{K% }({x_{q}},{x_{l}})+({{\delta^{pq}}})({{\delta^{lm}}})\text{K}({x_{q}},{x_{m}})$ (11)

From Eqs (9)–(11), it is clear that the augmented kernel matrix K is only composed of the training instances in $D$ and the kernel function $\text{K}(\cdot,\cdot)$ with no explicit knowledge of the mapping $\phi(\cdot)$ . Hence, any valid kernel function can be used to train the SVM classifier with the proposed method to artificially balance the dataset.

It is worth noting that other oversampling algorithms like borderline and AdaSyn can also be easily adapted to operate in the feature space of SVM classifier using this method. The Euclidean distance used in such algorithms is replaced by the feature space distance ${d^{\phi}}({x_{p}},{x_{q}})$ in Eq. (4) and kernel matrix is augmented based on the selected seed and neighbor using Eqs (5) and (6).

3.3 Evaluate synthesize instances quality

In order to reduce the ED of dataset, we evaluate each synthetic instance and retain the qualified instances which can reduce ED when add it to dataset. Therefore, the qualified instances have the ability to balance the majority and minority class on data distribution.

The implementation of EDKS is described in Algorithm 2. First, EDKS uses Algorithm 1 to determine the minority class, and finds its seed instances as well as neighbor instances. Then it uses Eqs (9)–(11) to oversampling the minority class in feature space. Finally, the retention capability of new synthetic instances is evaluated.

[H] EDKS ( $D, k$ )Input: The imbalanced dataset $D$ . Number of nearest neighbors $k$ . Algorithm EDID. output: Kernel matrix K after oversampling. 1. Initialize $\textit{count}=0$ . 2. Compute kernel matrix ${\text{K}^{1}}$ .3. $\theta,E_{\textit{maj}},C_{\min}=\text{EDID}$ $(D,\text{K},k)$ .4. Compute the instance-position statistic $\mu_{k}(\phi(x))$ for all instances in $C_{\min}$ according to Eq. (5).5. while $\theta>0$ do6. Sample a minority instance $\phi({{x_{p}}})$ with the maximum value of $\mu_{k}(\phi(x_{p}))$ for all instances in $C_{\min}$ .7. Choose $\phi({{x_{q}}})$ with the maximum value of $\mu_{k}(\phi(x_{q}))$ for all instances in $Q({\phi({{x_{p}}})})$ .8. Calculate the augmented kernel matrix K using Eqs (10) and (11). 9. Recalculate $Q_{\textit{now}}({\phi({{x}})}),\mu_{\textit{now}}(\phi(x)),\omega_{\textit{% now}},e_{1}$ . 10. if $e_{1}-E_{\textit{maj}}\geqslant\theta$ , then11. while $e_{1}-E_{\textit{maj}}\geqslant\theta$ do12. Delete the last row and column of the kernel matrix K. 13. Calculate the augmented kernel matrix K using Eqs (10) and (11). 14. Recalculate $Q_{\textit{now}}({\phi({{x}})}),\mu_{\textit{now}}(\phi(x)),\omega_{\textit{% now}},e_{1}$ . 15. end while16. end if17. $E_{\min}=e_{1},\mu_{k}(\phi(x))=\mu_{\textit{now}}(\phi(x))$ 18. $\theta=E_{\min}-E_{\textit{maj}},\textit{count}=\textit{count}+1$ 19. end while

Before concluding this section, we want to briefly discuss the reasonableness of the synthetic data obtained by EDKS. To overcome the limitations of SMOTE, we propose an entropy difference and kernel-based SMOTE. Kernel machines work by mapping the input data into a feature space to reduce or eliminate overlapping areas, and then building linear algorithms in the feature space to implement nonlinear counterparts in the input data space. On this basis, entropy difference is used to measure the data distribution, determine the location of the data to be generated, and detect the quality of the synthesized data. EDKS not only improves the quality of synthetic data, but also avoids data overfitting to a greately extent.

4. Experimental results and analyses

In order to illustrate the effectiveness of the proposed EDKS algorithm in solving class imbalance problems, the following experiments are carried out.

4.1 Experimental setup

Table 1
Confusion matrix for two-class classification problem

	Predicted positive	Predicted negative
Actual positive	True positive (TP)	False negative (FN)
Actual negative	False positive (FP)	True negative (TN)

The performance measurement is a key factor in comparison of various models. Accuracy is the most commonly used evaluation metric. However, in the framework of imbalanced datasets, accuracy is no longer a proper measure, as it does not distinguish between the numbers of correctly classified examples of different classes. Therefore, it may lead to erroneous conclusions, i.e., a classifier achieving an accuracy of 90% in a dataset with an IR value of 9, is not accurate if it classifies all examples as negatives [41]. In our experiments, Area Under the Curve (AUC) [42], Recall and G-mean [43] are together used as the performance metrics. AUC, the area under the receiver operating characteristic curve (ROC), is widely applied to estimate the accuracy of imbalanced models with all possible scopes of thresholds. It is an objective measure which is not affected by subjective factors on account of its independence from the decision criterion selected and prior probabilities of class distributions [7, 44]. Recall refers to the classification accuracy on the positive instances, and G-mean is the harmonic mean of Recall and Precision. In a two-class problem, the confusion matrix (shown in Table 1) records the results of correctly and incorrectly recognized instances of each class. The definitions of accuracy, recall, precision and G-Mean are as follows:

$\displaystyle\textit{Accuracy}=\frac{{TP+TN}}{{TP+FN+FP+TN}}$ (12) $\displaystyle\textit{Recall}=\frac{{TP}}{{TP+FN}}$ (13) $\displaystyle\textit{Precision}=\frac{{TP}}{{TP+FP}}$ (14) $\displaystyle\textit{G-mean}=\sqrt{\frac{{TP}}{{TP+FN}}\times\frac{{TN}}{{TN+% FP}}}$ (15)

Apparently, the greater the values of these three selected performance metrics, the better the performance of the algorithm.

The datasets used in the evaluation are briefly described here. Nineteen binary datasets from UCI machine learning repository [45] and the KEEL data repository [46] are selected to evaluate the effectiveness of the proposed algorithm. Their detailed information is presented in Table 2. The number of total instances, the number of attributes, the number of minority instances, IR and ED have shown. The performance results on each datasets are achieved from the average of ten independent performs with stratified five fold cross-validation.

The proposed EDKS is evaluated against the following seven baseline algorithms.

Table 2

Description of the 19 datasets used in the experiments

Dataset	Size	Feature	Min	IR	ED
abalone17vs78910	2338	7	58	39.3	0.09
blocks0	5472	10	559	8.8	0.008
ecoli0vs1	220	7	77	1.9	0.030
ecoli01vs235	244	7	24	9.2	0.148
ecoli0146vs5	280	6	20	13	0.178
ecoli0234vs5	202	7	20	9.1	0.168
ecoli067vs35	222	7	22	9.1	0.164
german	1000	24	300	2.3	0.013
glass0123vs456	214	9	51	3.2	0.062
glass1	214	9	76	1.8	0.027
glass4	214	9	13	15.5	0.230
glass6	214	9	29	6.4	0.111
iris0	150	4	50	2	0.044
pima	768	8	268	1.9	0.012
segment0	2308	19	329	6	0.02
sonar	208	60	97	1.2	0.007
vehicle0	846	18	199	3.3	0.023
yeast0359vs78	506	8	50	9.1	0.092
yeast2vs4	514	8	78	5.6	0.091

•

SVM [47]: Traditional SVM classifier is used on datasets without additional oversampling or weighting.

•

Sampling methods:

–

SMOTE [9]: It is used to generate synthetic minority instances as shown in Eq. (1). ${x_{q}}$ is selected from one of the five nearest neighbors of a random minority instance ${x_{p}}$ . The modified datasets is then used to train an SVM classifier.

–

Borderline [21]: It adds new synthetic instances near the border between the two classes. The minority instances that have more than half of their $m$ neighbors from the majority class are used to generate SMOTE instances. Here, $m$ is empirically determined such that $|S^{\min}|/2$ instances are used as $x_{p}$ in Eq. (1).

–

Safe-Level [22]: It adds new synthetic instances to the security areas of the minority class. The $k$ neighbors of the minority instances all come from the minority class sets, which are used to generate security instances.

–

AdaSyn [15]: In adaptive synthetic sampling, the number of synthetic instances generated for each minority instance is proportional to the number of majority neighbors it has.

–

ROS [7]: Instances of the minority class are randomly selected and added to the training dataset.

•

WK-SMOTE [16]: It oversampling in the feature space of SVM classifier and then the synthetic instances and the original instances are distinguished by weighting.

Proper selection of hyper-parameters is critical to obtain a robust classifier that generalizes well to unknown instances. The hyper-parameters of SVM include the choice of the kernel, its parameters, and the soft-margin constraint $C$ . Numerous methods have been previously developed to optimize the choice of kernel and its parameters including kernel-alignment [36] and multiple kernel learning [37]. To simplify the parameter selection process, the built-in algorithm TPE of Hyperopt Library in Python is used to optimize the AUC, recall and G-mean of the verification set selected from the training set. The best performing parameters were then used to train on the whole training dataset and the model obtained was evaluated on the testing dataset. The RBF kernel were tuned as described above with the width parameter of RBF kernel selected over a range of ${10^{-5}}$ – ${10^{1}}$ . The soft-margin constraint $C$ was optimized using TPE in the range of ${2^{0}}$ – ${2^{10}}$ .

4.2 Results analysis and discussions

Tables 3–5 show the AUC values, Recall rate and G-mean values of the eight algorithms on nineteen datasets, respectively. The rank of each algorithm is given in brackets and the score with maximum value is highlighted in bold for each dataset. The average rank of each algorithm is given in last row. Table 6 shows the comprehensive performance of eight algorithms on different datasets. Comprehensive performance refers to the average of AUC, Recall and G-mean rankings of the algorithm on all datasets.

Table 3
AUC score and rank of the eight algorithms on nineteen datasets

Dataset	SVM	SMOTE	Borderline	Safe	AdaSyn	ROS	WK-SMOTE	EDKS
abalone17vs78910	0.842(7)	0.915(5)	0.921(3)	0.822(8)	0.918(4)	0.909(6)	0.963(1)	0.932(2)
blocks0	0.985(4)	0.982(5)	0.979(6)	0.977(7)	0.972(8)	0.987(2)	0.986(3)	0.990(1)
ecoli0vs1	0.996(4.5)	0.996(4.5)	0.994(8)	0.996(4.5)	0.996(4.5)	0.995(7)	0.997(2)	0.998(1)
ecoli01vs235	0.942(6)	0.943(5)	0.948(4)	0.966(3)	0.927(8)	0.936(7)	0.977(1)	0.969(2)
ecoli0146vs5	0.975(2)	0.961(5)	0.966(4)	0.959(6)	0.910(8)	0.969(3)	0.933(7)	0.981(1)
ecoli0234vs5	0.969(3.5)	0.964(6)	0.981(1)	0.968(5)	0.953(8)	0.958(7)	0.979(2)	0.969(3.5)
ecoli067vs35	0.963(3)	0.953(4)	0.939(5)	0.968(2)	0.912(7)	0.908(8)	0.933(6)	0.986(1)
german	0.784(2.5)	0.769(6)	0.766(7)	0.770(5)	0.754(8)	0.776(4)	0.815(1)	0.784(2.5)
glass0123vs456	0.973(4)	0.934(8)	0.976(3)	0.984(2)	0.969(5)	0.985(1)	0.940(7)	0.952(6)
glass1	0.784(7)	0.785(6)	0.815(1)	0.776(8)	0.802(4.5)	0.802(4.5)	0.813(2)	0.806(3)
glass4	0.982(2.5)	0.971(5)	0.983(1)	0.933(8)	0.940(7)	0.977(4)	0.959(6)	0.982(2.5)
glass6	0.977(6)	0.971(7)	0.980(5)	0.981(3.5)	0.981(3.5)	0.985(1)	0.917(8)	0.982(2)
iris0	1.000(4)	1.000(4)	0.990(8)	1.000(4)	1.000(4)	1.000(4)	1.000(4)	1.000(4)
pima	0.835(1)	0.818(4)	0.814(6)	0.831(2)	0.819(3)	0.816(5)	0.798(7)	0.759(8)
segment0	0.999(4)	0.999(4)	0.999(4)	0.999(4)	0.997(8)	0.999(4)	0.999(4)	0.999(4)
sonar	0.907(8)	0.930(6)	0.940(4)	0.946(2)	0.927(7)	0.939(5)	0.942(3)	0.953(1)
vehicle0	0.996(3)	0.995(5.5)	0.996(3)	0.995(5.5)	0.991(7)	0.996(3)	0.999(1)	0.990(8)
yeast0359vs78	0.700(7)	0.719(5)	0.741(4)	0.761(3)	0.686(8)	0.711(6)	0.819(1)	0.802(2)
yeast2vs4	0.980(2)	0.969(3)	0.961(5)	0.983(1)	0.925(8)	0.968(4)	0.947(7)	0.950(6)
Average rank	4.27(3)	5.16(7)	4.31(4)	4.39(5)	6.34(8)	4.50(6)	3.84(2)	3.18(1)

Table 4

Recall score and rank of the eight algorithms on nineteen datasets

Dataset	SVM	SMOTE	Borderline	Safe	AdaSyn	ROS	WK-SMOTE	EDKS
abalone17vs78910	0.015(8)	0.718(7)	0.730(6)	0.875(4)	0.901(2)	0.746(5)	0.894(3)	0.915(1)
blocks0	0.736(8)	0.946(6)	0.960(3)	0.885(7)	0.963(2)	0.947(5)	0.954(4)	0.987(1)
ecoli0vs1	0.961(6.5)	0.976(3)	0.974(5)	0.959(8)	0.976(3)	0.961(6.5)	0.976(3)	0.985(1)
ecoli01vs235	0.627(8)	0.763(6)	0.806(4)	0.805(5)	0.814(3)	0.738(7)	0.902(2)	0.950(1)
ecoli0146vs5	0.731(8)	0.793(4)	0.743(7)	0.790(5)	0.876(3)	0.755(6)	0.883(2)	0.907(1)
ecoli0234vs5	0.727(8)	0.813(5)	0.794(6)	0.820(4)	0.864(3)	0.760(7)	0.911(2)	0.942(1)
ecoli067vs35	0.662(8)	0.687(6)	0.727(4)	0.700(5)	0.789(3)	0.683(7)	0.871(2)	0.897(1)
german	0.389(8)	0.616(5)	0.631(4)	0.505(7)	0.564(6)	0.647(3)	0.758(2)	0.800(1)
glass0123vs456	0.808(8)	0.839(7)	0.907(3)	0.894(4)	0.920(2)	0.871(5)	0.870(6)	0.943(1)
glass1	0.442(8)	0.713(6.5)	0.793(2.5)	0.713(6.5)	0.733(4)	0.729(5)	0.793(2.5)	0.864(1)
glass4	0.653(8)	0.783(4)	0.767(5.5)	0.767(5.5)	0.800(3)	0.700(7)	0.973(2)	0.985(1)
glass6	0.649(8)	0.736(7)	0.769(4)	0.743(6)	0.801(3)	0.754(5)	0.835(2)	0.897(1)
iris0	1.000(4)	1.000(4)	0.990(8)	1.000(4)	1.000(4)	1.000(4)	1.000(4)	1.000(4)
pima	0.561(8)	0.724(5)	0.779(2)	0.673(7)	0.758(3)	0.707(6)	0.810(1)	0.750(4)
segment0	0.977(8)	0.985(6)	0.994(3)	0.983(7)	0.990(4)	0.988(5)	0.996(2)	0.997(1)
sonar	0.753(8)	0.766(7)	0.817(3)	0.801(4)	0.783(5)	0.770(6)	0.828(2)	0.886(1)
vehicle0	0.932(8)	0.978(6)	0.991(1)	0.969(7)	0.984(4.5)	0.984(4.5)	0.988(2)	0.985(3)
yeast0359vs78	0.188(8)	0.573(4)	0.535(5)	0.246(7)	0.660(3)	0.528(6)	0.788(2)	0.824(1)
yeast2vs4	0.631(8)	0.813(4)	0.731(7)	0.720(6)	0.833(3)	0.750(5)	0.868(2)	0.918(1)
Average rank	7.71(8)	5.45(5)	4.37(4)	5.73(7)	3.29(3)	5.53(6)	2.50(2)	1.42(1)

Table 5

G-Mean score and rank of the eight algorithms on nineteen datasets

Dataset	SVM	SMOTE	Borderline	Safe	AdaSyn	ROS	WK-SMOTE	EDKS
abalone17vs78910	0.055(8)	0.795(6)	0.816(4)	0.692(7)	0.853(3)	0.811(5)	0.906(1)	0.893(2)
blocks0	0.854(8)	0.946(4)	0.945(6)	0.927(7)	0.926(4)	0.946(4)	0.959(1)	0.948(2)
ecoli0vs1	0.980(3)	0.984(2)	0.978(4.5)	0.975(7)	0.978(4.5)	0.976(6)	0.986(1)	0.950(8)
ecoli01vs235	0.788(8)	0.858(3)	0.820(7)	0.886(2)	0.831(6)	0.844(5)	0.919(1)	0.846(4)
ecoli0146vs5	0.845(8)	0.879(5)	0.855(7)	0.880(4)	0.892(1)	0.860(6)	0.887(3)	0.890(2)
ecoli0234vs5	0.836(8)	0.884(5)	0.880(6)	0.901(2)	0.888(4)	0.859(7)	0.925(1)	0.897(3)
ecoli067vs35	0.807(7)	0.814(6)	0.842(3)	0.820(5)	0.823(4)	0.794(8)	0.896(2)	0.899(1)
german	0.597(8)	0.692(3)	0.685(4)	0.657(7)	0.670(5.5)	0.703(2)	0.733(1)	0.670(5.5)
glass0123vs456	0.875(7)	0.891(6)	0.928(1)	0.916(4)	0.920(2)	0.903(5)	0.866(8)	0.917(3)
glass1	0.621(8)	0.699(5)	0.704(3)	0.686(7)	0.702(4)	0.696(6)	0.721(2)	0.749(1)
glass4	0.718(8)	0.868(3)	0.862(5)	0.864(4)	0.852(6)	0.734(7)	0.918(1)	0.880(2)
glass6	0.788(8)	0.845(7)	0.875(3)	0.852(6)	0.871(4)	0.861(5)	0.880(2)	0.894(1)
iris0	1.000(4)	1.000(4)	0.990(8)	1.000(4)	1.000(4)	1.000(4)	1.000(4)	1.000(4)
pima	0.706(8)	0.738(4)	0.741(2)	0.734(5)	0.740(3)	0.728(6)	0.797(1)	0.709(7)
segment0	0.988(8)	0.992(3)	0.990(5.5)	0.991(4)	0.972(7)	0.994(2)	0.998(1)	0.990(5.5)
sonar	0.803(8)	0.813(6)	0.831(4)	0.843(3)	0.812(7)	0.816(5)	0.867(2)	0.885(1)
vehicle0	0.954(7)	0.966(6)	0.970(2.5)	0.968(4.5)	0.944(8)	0.968(4.5)	0.984(1)	0.970(2.5)
yeast0359vs78	0.426(8)	0.675(2)	0.670(3)	0.472(7)	0.651(4)	0.645(5)	0.702(1)	0.524(6)
yeast2vs4	0.788(8)	0.876(2)	0.826(7)	0.841(3)	0.827(6)	0.832(5)	0.909(1)	0.839(4)
Average rank	7.37(8)	4.32(3)	4.50(4)	4.87(6)	4.58(5)	5.16(7)	1.79(1)	3.39(2)

Table 6

Comprehensive performance of eight algorithms

Methods	EDKS	WK-SMOTE	Borderline	AdaSyn	SMOTE	safe	ROS	SVM
Ranking	2.66	2.71	4.39	4.73	4.97	4.99	5.06	6.45

It can be seen that seven kinds of sampling techniques can enhance classification performance on imbalanced datasets with the classifier compared to the original data processing. But the ROS leads to classifier overfitting, resulting in limited performance improvement, which is proved in the experiments. On the other hand, SMOTE and its variants of the algorithm can avoid overfitting, and improve the performance of the imbalance problem using the oversampling technique and interpolation theory. However, the performance of datasets using these algorithms are not always better than ROS. For example, the German and glass6 datasets show better performance when using the ROS. The experiments prove that there is no absolute best algorithm to deal with all imbalanced datasets.

Table 3 shows the AUC results obtained by the algorithms, it objectively reflects the comprehensive prediction ability of classifiers for imbalanced datasets. EDKS algorithm achieves the highest AUC score on five datasets, especially on ecoli0146vs5, ecoli067vs35 and sonar. In addition, the performance of EDKS algorithm is generally good on other datasets, and significantly improved the performance of AUC in ecoli0146vs5, ecoli067vs35 and sonar. However, the AUC scores on some datasets are not satisfactory, such as Pima and vehicle0. It may be that the ED of these two datasets are very small, and the technique which we propose can not distinguish the minority class from the majority class very well, which leads to the unsatisfactory AUC score.

For imbalanced problems, the recall of the minority class is important, which means the proportion of the minority classes classified correctly, and the minority class is usually more important than the majority class. Table 4 shows that the EDKS algorithm significantly improves recall on 17 of the 19 datasets. Our proposed algorithm achieves the optimal G-mean value on some datasets, but the average ranking of G-mean score is not ideal. This is because the EDKS algorithm only uses the original data as the seed instance and does not consider the synthesis instance, thus increase the risk of overfitting.

Figure 4.

ED and IR on all dataset, where blue and red represent ED and IR, respectively.

Figure 4 shows the imbalance degree between ED and IR metrics for all datasets. It can be seen that if these datasets are sorted by ED and IR respectively, the order obtained is basically the same. However, Some datasets have higher ED ranking than IR, such as ecoli067vs35, ecoli0vs1, glass1, etc. Interestingly, EDKS generally performs better on these datasets than other algorithms. It shows that the EDKS algorithm is more effective when the data distribution is more imbalanced than the instance size.

As for the robustness of the EDKS algorithm, if the noise is a majority class instance, there is no significant impact on the algorithm. However, if the noise comes from the minority class, it may have an impact on the subsequent model learning process. This is because the noise may be involved in the oversampling process. The following strategies can be used to deal with this problem: The method of outlier detection is used to detect the noise of the minority instances after the datasets is mapped to the feature space, such as KF algorithm [48]. In the process of oversampling, do not use noise data. If there was a missing value, it could be filled in the input space by using the padding technique, like [49], [50], and etc.

5. Conclusion

In this paper, EDKS, an entropy difference and kernel-based oversampling algorithm, is proposed to balance the class distribution in an SVM classifier. It generalizes the popular SMOTE algorithm for nonlinear separable data by finding sparse regions and generating minority instances in the feature space of classifier instead of the input data space. Compared with many baseline methods on the benchmark imbalanced datasets in the UCI machine learning library and KEEL database, the proposed algorithm can improve AUC, recall rate and G-mean score.

In future research, we will study the influence of different intra-class distributions on entropy in depth, and further improve the diversity of synthetic data by developing techniques to expand the range of seeds and neighbors in kernel space. For the case of ${E_{\min}}<{E_{maj}}$ , it is necessary to study in depth from the theoretical point of view to derive the specific change rule of entropy with the difference of information quantity and information symmetry.

Footnotes

Acknowledgments

The authors would like to thank the anonymous reviewers for their valuable comments and suggestions. This research was supported by National Natural Science Foundation of China (61573266).

References

Sun

Lang

Fujita

and Li

, Imbalanced enterprise credit evaluation with DTESBD: Decision tree ensemble based on SMOTE and bagging with differentiated sampling rates, Inf. Sci. 425 (2018), 76–91.

Zhou

, Performance of corporate bankruptcy prediction models on imbalanced dataset: The effect of sampling methods, Knowl.-Based Syst. 41 (2013), 16–25.

Tsang

Koh

Y.S.

Dobbie

and Alam

, Detecting online auction shilling frauds using supervised learning, Expert Syst. Appl. 41(6) (2014), 3027–3040.

Almendra

, Finding the needle: A risk-based ranking of product listings at online auction sites for non-delivery fraud prediction, Expert Syst. Appl. 40(12) (2013), 4805–4811.

Guo

H.X.

Y.J.

Jennifer

M.Y.

Huang

Y.Y.

and Gong

, Learning from class imbalanced data: Review of methods and applications, Expert Syst. Appl. 73 (2016), 220–239.

Liu

and Yu

, A comparative study on rough set based class imbalance learning, Knowl.-Based Syst. 21(8) (2008), 753–763.

and Garcia

E.A.

, Learning from imbalanced data, IEEE Trans. Knowl. Data Eng. 21(9) (Sep. 2009), 1263–1284.

Chawla

N.V.

Japkowicz

and Drive

, Editorial: Special issue on learning from imbalanced data sets, ACM SIGKDD Explor. Newslett. 6(1) (2004), 1–6.

Kotsiantis

Kanellopoulos

and Pintelas

, Handling imbalanced datasets: A review, Science 30(1) (2006), 25–36.

10.

Galar

Fernandez

Barrenechea

Bustince

and Herrera

, A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches, IEEE Trans. Syst. Man Cybern. Part C 42(4) (2012), 463–484.

11.

Zhou

L.G.

and Fujita

, Posterior probability based ensemble strategy using optimizing decision directed acyclic graph for multi-class classification, Inf. Sci. 400-401 (2017), 142–156.

12.

Yang

Xie

Z.Q.

and Zhang

J.P.

, A novel virtual sample generation method based on Gaussian distribution, Knowl.-Based Syst. 24(6) (2011), 740–748.

13.

Lin

W.C.

Tsai

C.F.

Y.H.

and Jhang

J.S.

, Clustering-based undersampling in class imbalanced data, Inf. Sci. 409-410 (2017), 17–26.

14.

Sez

J.A.

Luengo

Stefanowski

and Herrera

, SMOTE-IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering, Inf. Sci. 291 (2015), 184–203.

15.

Bai

Garcia

E.A.

and Li

, Adasyn: adaptive synthetic sampling approach for imbalanced learning, Neural Networks, 2008, IJCNN 2008, (IEEE World Congress on Computational Intelligence), in: IEEE International Joint Conference on Neural Networks, 2008, pp. 1322–1328.

16.

Mathew

Pang

C.K.

Luo

and Leong

W.H.

, Classification of imbalanced data by oversampling in kernel space of support vector machines, IEEE Transactions on Neural Networks and Learning Systems 29(9), 4065–4076.

17.

Yen

S.J.

and Lee

Y.S.

, Cluster-based undersampling approaches for imbalanced data distributions, Expert Syst. Appl. 36 (2009), 5718–5727.

18.

Ofek

Rokach

Stern

and Shabtai

, Fast-CBUS: A fast clustering-based undersampling method for addressing the class imbalance problem, Neurocomputing 243 (2017), 88–102.

19.

Hartigan

J.A.

and Wong

M.A.

, A k-means clustering algorithm, Appl. Stat. 28 (1979), 100–108.

20.

Chawla

N.V.

Bowyer

K.W.

Hall

L.O.

and Kegelmeyer

W.P.

, SMOTE: Synthetic minority over-sampling technique, J. Artif. Intell. Res. 16 (2002), 321–357.

21.

Han

Wang

W.Y.

and Mao

B.H.

, Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning, in: Proceedings of the 1st International Conference on Intelligent Computing, Hefei, China, 2005, 878–887.

22.

Bunkhumpornpat

Sinapiromsaran

and Lursinsap

, Safe-level-SMOTE: safe-level-synthetic minority oversampling technique for handling the class imbalanced problem, in: Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, 2009, pp. 475–482.

23.

Bunkhumpornpat

, DBSMOTE: Density-based synthetic minority over-sampling technique, Appl. Intell. 36 (2012), 664–684.

24.

Barua

Islam

M.M.

Yao

and Murase

, MWMOTE-majority weighted minority oversampling technique for imbalanced data set learning, IEEE Trans. Knowl. Data Eng. 26(2) (2014), 405–425.

25.

Fernández

Garcia

Herrera

and Chawla

N.V.

, SMOTE for learning from imbalanced data: Progress and challenges, marking the 15-year anniversary, J. Artif. Intell. Res. 61 (2018), 863–905.

26.

Wang

and Japkowicz

, Imbalanced data set learning with synthetic samples, in: Proc. IRIS Machine Learning Workshop, 2004, p. 19.

27.

Zhu

T.F.

Lin

Y.P.

and Liu

Y.H.

, Synthetic minority oversampling technique for multiclass imbalance problems, Pattern Recognit. 72 (2017), 327–340.

28.

Menardi

and Torelli

, Training and assessing classification rules with imbalanced data, Data Min. Knowl. Discovery 28(1) (2014), 92–122.

29.

Liu

Yang

and Li

, Fuzzy rule-based oversampling technique for imbalanced and incomplete data learning, Knowledge-Based Systems 158 (2018), 154–174.

30.

Zhang

H.X.

and Li

M.F.

, RWO-sampling: A random walk over-sampling approach to imbalanced data classification, Inf. Fusion 20 (2014), 99–116.

31.

Das

Krishnan

N.C.

and Cook

D.J.

, Racog and wracog: Two probabilistic oversampling techniques, IEEE Trans. Knowl. Data Eng. 27(1) (2015), 222–234.

32.

Abdi

and Hashemi

, To combat multi-class imbalanced problems by means of over-sampling techniques, IEEE Trans. Knowl. Data Eng. 28(1) (2016), 238–251.

33.

Japkowicz

and Stephen

, The class imbalance problem: A systematic study, IOS Press, 2002.

34.

Tang

and He

, GIR-based ensemble sampling approaches for imbalanced learning, Pattern Recognition, 2017.

35.

Kullback

and York

, Information theory and entropy, Model Based Inference in the Life Sciences a Primer on Evidence, 2008, 51–82.

36.

Cristianini

Kandola

Elisseeff

and Shawe-Taylor

, On kernel target alignment, in: Proc. Neural Inf. Process. Syst. (NIPS), Vancouver, BC, Canada, Dec. 2001, pp. 367–373.

37.

Rakotomamonjy

Bach

Canu

and Grandvalet

, SimpleMKL, J. Mach. Learn. Res. 9 (Nov. 2008), 2491–2521.

38.

Karatzoglou

Smola

Hornik

and Zeileis

, kernlab-An S4 package for kernel methods in R, J. Stat. Softw. 11(9) (Nov. 2004), 1–20.

39.

Zeng

Z.Q.

and Gao

, Improving SVM classification with imbalance data set, in: Proc. 16th Int. Conf. Neural Inf. Process., Bangkok, Thailand, Dec. 2009, pp. 389–398.

40.

Perez-Ortiz

Gutierrez

P.A.

and Hervas-Martinez

, Borderline Kernel Based Over-Sampling, in: Proc. Int. Conf. Hybrid Artif. Intell. Syst., Salamanca, Spain, Sep. 2013, pp. 472–481.

41.

Alshomrani

Bawakid

Shim

S.O.

Fernández

and Herrera

, A proposal for evolutionary fuzzy systems using feature weighting: Dealing with overlapping in imbalanced datasets, Knowl.-Based Syst. 73 (2015), 1–17.

42.

Bradley

A.P.

, The use of the area under the ROC curve in the evaluation of machine learning algorithms, Pattern Recognit. 30(7) (1997), 1145–1159.

43.

Rijsbergen

C.J.V.

, Information Retrieval, Butterworths, London, UK, 1979.

44.

Loyola-Gonzlez

Medina-Prez

M.A.

Martnez-Trinidad

J.F.

Carrasco-Ochoa

J.A.

Monroy

and Garca-Borroto

, PBC4cip: A new contrast pattern-based classifier for class imbalance problems, Knowl.-Based Syst. 115 (2017), 100–109.

45.

Bache

and Lichman

, UCI Machine Learning Repository, UCI machine learning repository, 2013.

46.

Alcalá-Fdez

Fernández

Luengo

Derrac

and García

, Keel data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework, Journal of Multiple-Valued Logic Soft Computing 17 (2011).

47.

Burges

C.J.C.

, A tutorial on support vector machines for pattern recognition, Data Mining Knowl. Discovery 2(2) (1998), 121–167.

48.

Kang

Chen

X.S.

S.S.

et al., A noise-filtered under-sampling scheme for imbalanced classification, IEEE Transactions on Cybernetics 47(12) (2016), 4263–4274.

49.

Pan

Yang

Cao

and Zhang

, Missing data imputation by K nearest neighbours based on grey relational structure and mutual information, Appl Intell 43 (2015), 614–632.

50.

Tran

C.T.

Zhang

Peter

Xue

and Bui

L.T.

, Effective and efficient approach to classification with incomplete data, Knowl.-Based Syst. 10 (2018), 1016.

Entropy difference and kernel-based oversampling technique for imbalanced data learning

Abstract

Keywords

1. Introduction

2.1 Imbalance problem

Remark

4. Experimental results and analyses

4.1 Experimental setup

Table 1 Confusion matrix for two-class classification problem

Table 3 AUC score and rank of the eight algorithms on nineteen datasets

Footnotes

Acknowledgments

References

Table 1
Confusion matrix for two-class classification problem

Table 3
AUC score and rank of the eight algorithms on nineteen datasets