Logistic discrimination based on G-mean and F-measure for imbalanced problem

Abstract

As a well known statistical method, logistic discrimination has been successfully used in many practical applications including medical diagnosis and personal credit assessment. In this paper, we apply this model to imbalanced problem which is also referred to as skewed or rare class problem, characterized by having many more instances of one class (negative class or majority class) than the other (positive class or minority class). However, traditional logistic discrimination tries to pursue a high accuracy by assuming that all classes have similar size, leading to the fact that instances with positive classes are often overlooked and misclassified to negative ones. To fully consider class imbalance, we re-learn the two basic measures for imbalanced problem, g-mean and f-measure, and design two new cost functions, i.e., g-mean based metric (GM) and f-measure based metric (FM), to supervise logistic discrimination learning the corresponding parameters, where GM is the geometric mean estimation of recall of both positive and negative class as g-mean and FM is a harmonic mean between recall and precision of positive class as f-measure. The experiments on UCI data sets show that the proposed method presents significant advantage comparing to state-of-the-art classification methods on all metrics used in this paper including accuracy, recall, f-measure and g-mean.

Keywords

Imbalanced problem g-mean f-measure logistic discrimination

1 Introduction

Recently, imbalanced problem, also named class imbalance skewed or rare class problem, has drawn a significant number of interest in academia, industry and government. For two-class, this problem is characterized as having many more instances of one class (negative class or majority class) than the other (positive class or minority class) [1, 2]. In many real-world applications, the correct prediction of instances in positive class is often more meaningful than the contrary case. Take “mammography data set” as an example. This data set contains 10923 “healthy” patients and 260 “cancerous” patients and, in medical detection, how to correctly recognize the “cancerous” patients is overwhelmingly meaningful. However, most standard classification methods such as C4.5, naive bayes and neural network, try to pursue a high accuracy by assuming balanced class distribution, namely the number of instances in any class is similar to each other, leading to the fact that the positive class instances are often overlooked and misclassified to negative class. However, accuracy is not a suitable evaluation metric when there is class-imbalance, instead, recall, f-measure andg-mean are more appropriate evaluation metrics for class-imbalance problem [3, 4].

Many approaches have been proposed to tackle this problem, which can be roughly categorized into two groups: data level and algorithm level approaches. For the former, algorithms run on the re-balanced class distribution obtained by manipulating the data space. Examples include randomly or informatively under-sampling instances of negative class [5], randomly over-sampling instances of positive class, over-sampling based on cluster algorithm [6, 7], random walk over-sampling approach (RWO-Sampling) [8], and over-sampling the positive class by creating new synthetic instances [9 –11]. Sampling data space technique is often used to deal with imbalanced learning problem, however the real class distribution is always unknown and differs from data to data. As to algorithm level, approaches try to adapt existing classifier learning algorithms (the corresponding cost functions) such that learned models bias towards correctly classifying instances in positive class, such as two-phase rule induction [12] and cost-sensitive learning [13], one-class learning, imbalanced learning by combining inverse random under sampling and random tree [14]. Besides, the moving average [15, 16] can also used to class imbalanced problem.

In this paper, we re-consider imbalanced problem and, based on logistic discrimination, propose a novel method to tackle the problem. Unlike traditional logistic discrimination trying to learn the corresponding parameters to achieve high performance by maximizing log likelihood function, the proposed method achieves high performance on imbalanced data by maximizing two cost functions proposed in this paper. These two cost functions are the g-mean based metric (GM) and the f-measure based metric (FM), where GM is the geometric mean estimation of the recalls of both positive and negative classes and FM is a harmonic mean between recall and precision of positive classes. The idea was inspired by the following observations: (1) g-mean is the exact geometric mean of the recalls of both positive and negative classes, and (2) f-measure is the exact harmonic mean both the recall and precision of positive classes. In this way, the proposed method pays more attention to positive classes and achieves better performance on imbalanced problem. The main contributions of this paper are as follows:

applying logistic discrimination to imbalanced problem and, therefore increasing its application field;

to fully consider class imbalance, two cost functions are designed, i.e., g-mean based metric (GM) and f-measure based metric (FM), to supervise the corresponding parameters learning process;

based on the two designed cost functions, proposing a novel method to boost the performance of logistic discrimination on the imbalanced problem.

The experimental results on 15 UCI data sets show that the proposed method can greatly boost the performance of logistic discrimination on measures of recall, f-measure and g-mean while keeping its high performance on measure of accuracy. Compared with other state-of-the-art classification methods, this method shows a great advantage. The experimental results also indicate reasonably designing loss functions can effectively boost the generalization performance of classification on imbalanced problem.

The rest of this paper is organized as follows. After presenting the related work in Section 2, Section 3 introduces the two new proposed cost functions, i.e., g-mean based metric (GM) and f-measure based metric (FM); based on GM and FM, Section 4 describes the proposed imbalanced learning method; Section 5 describes the evaluation metrics used to the imbalanced class recognition problem; Section 6 presents the experimental results and, finally, Section 7 concludes this work.

2 Related work

2.1 Imbalanced learning

Technically speaking, the data set which exhibits an unequal distribution between its classes can be considered imbalanced (skewed). There are two forms of imbalance, i.e., within-class imbalance and between-class imbalance. For the former, a sub-concepts exist in limited instances, which increase the difficulty of correctly classifying instances. With respect to between-class imbalance, one class extremely out-represents another. Usually, the second form of unequal distribution is discussed incommunity [1].

There are many factors that influence the performance of a capable classifier on data sets with class imbalance, where the skewed data distribution, small sample size, separability and the existence of within-class sub-concepts rank the top 4 [17].

The skewed data distribution is often denoted by imbalance degree which is the ratio of the sample size of the positive class to that of the negative class. Reported studies indicate that a relatively balanced distribution usually attains a better result. However, it cannot be stated explicitly to what imbalance degree the class distribution deteriorates the classification performance, since other factors such as sample size and separability also affect performance.

Small sample size means the sample size is limited, uncovering regularities inherent in small class is unreliable. In [18], the authors suggest that the imbalanced class distribution may not be a hindrance to classification by providing a large enough data set.

The difficulty in separating the rare class from the prevalent class is the key issue of the imbalanced problem. Assuming that there exist highly discriminative patterns among each class, and then non-sophisticated rules are required to distinguish class objects. However, if patterns among each class are overlapping, discriminative rules are hard to be induced.

Within-class concepts mean that a single class is composed of various sub-clusters or sub-concepts. Instances of a class are collected from different sub-concepts. These sub-concepts do not always contain the same number of instances. The presence of within-class concepts worsens the imbalance distribution problem. In general, we only consider imbalanced data distribution in imbalanced learning and fix other factors.

2.2 Logistic discrimination

Logistic discrimination is a type of typical probability statistical classification model. For the two-class case, that is, $p (y = + | x) = σ (w^{T} x)$ (1) $p (y = - | x) = 1 - p (y = + | x)$ (2) where σ (s) is the logistic sigmoid function defined as $σ (a) = \frac{1}{1 + exp (- a)}$ (3)

In the terminology of statistics, this model is known as logistic regression.

For a given data set D = { x _i, y_i}, where y_i ∈ {+ , -} is the label of instancex_i, with i = 1, 2, . . . , N, the likelihood function of this model can be written as $p (y | w) = \prod_{i = 1}^{N} p (y_{i} = + | x_{i})^{δ_{i}} (1 - p (y_{i} = + | x_{i}))^{1 - δ_{i}}$ (4) where δ_i = 0 if y_i =+ and 1 otherwise. Defining an error function by taking the negative logarithm of the likelihood, we have the cross entropy error function in the form

$\begin{matrix} L (w) = - ln p (y | w) \\ = \sum_{i = 1}^{N} {δ_{i} ln p (y_{i} = + | x_{i}) \\ + (1 - δ_{i}) ln (1 - p (y_{i} = + | x_{i}))} \end{matrix}$ (5)

The logistic discrimination uses Equation (5) as the cost function; however, it is not suitable for the class imbalanced problem because the cross-entropy error function defined in Equation (5) does not consider the importance of each class. To handle this problem, we re-learn the two widely used measures, g-mean and f-measure, which are employed to evaluate the performance of models on imbalanced class recognition problem, and design two new cost functions called g-mean based metric (GM) and f-measure based metric (FM) to supervise logistic discrimination learning the parameters. More details are shown in Section 3.

2.3 Sampling technique

Sampling technique such as under-sampling and over-sampling which have been demonstrated effectively for imbalanced learning, is an important method to deal with classification of imbalanced data set.

Under-sampling tries to balance an imbalanced data set by randomly sampling a subset of majority instances. Reported researches have demonstrated that random under-sampling is more effective than random over-sampling for imbalanced learning. Several under-sampling methods are proposed to further improve the performance of imbalanced learning including Tomek’s links [19], Condensed Nearest Neighbor Rule [20], one-sided selection [21] and Neighborhood Cleaning rule [22].

Unlike under-sampling, over-sampling aims to balance an imbalanced data by adding a set sampled from the minority class: for a set of randomly selected minority examples, augment the original set S by replicating the selected examples and adding them to S. Many over-sampling imbalanced methods are proposed, such as Synthetic Minority Over-sampling Technique (SMOTE) and Adaptive Synthetic Sampling [23].

There are other studies on sampling techniques for imbalanced learning. For example, Chawlaet al. [24] integrate SMOTE into a standard boosting procedure to improve the prediction on the positive class. Gustavo et al. [25] combine over-sampling and under-sampling methods to resolve the imbalanced problem. Estabrooks et al. [26] propose a multiple resampling method which selects the most appropriate resampling rate adaptively. T. Jo et al. [6] put forward a cluster-based over-sampling method which deals with between-class imbalance and within-class imbalance simultaneously.

In this paper, we also apply sampling technique to logistic discrimination to enhance its performance on imbalanced data set. For clarity, only two widely used sampling techniques are selected: random under-sampling and over-sampling. The corresponding experimental results are presented in Section 6.

3 G-mean and F-measure based metrics

Traditional logistic discrimination tries to maximize the models classification accuracy using maximum likelihood as the corresponding cost function to estimate model parameters, as aforementioned. However, this approach ignores the different value of data in each class, which leads to the fact that the learned model has low performance on positive class.

To tackle this problem, we redefine the cost function of logistic regression to guarantee that the learned model considers the performance on both positive class and negative class. The relevant symbols are defined as follows.

Let C_j = { x _i|y_i = j} be the set of instances with class equal to j ∈ {+ , -}, and N_j be the size of C_j, namely N₊ = |C₊| and N_- = |C_-|. Similarly, let n_j be the number of instances correctly classified as class j by learned models, then n₊/N₊ and n_-/N_- are the recalls(or accuracies) of learned models on positive and negative class respectively. Furthermore, we define p_ij, P_j and ${\bar{P}}_{j}$ as $p_{ij} = p (y_{i} = j | x_{i})$ (6) $P_{j} = \sum_{x_{i} \in C_{j}} p_{ij}$ (7) ${\bar{P}}_{j} = N_{j} - P_{j}$ (8)

From Equations (7) and (8), P_j is the number estimation of instances being correctly classified as class j (corresponding to n_j) and ${\bar{P}}_{j}$ is that of instances with class j being incorrectly classified. For the two-class problem, we have ${\bar{P}}_{+} = \sum_{x_{i} \in C_{+}} p_{i -}$ (9) ${\bar{P}}_{-} = \sum_{x_{i} \in C_{-}} p_{i +}$ (10)

Based on Equations (7) and (8), two cost functions are constructed to supervise logistic discrimination learning the parameters, namely g-mean based metric (GM) and f-measure based metric (FM). More details about the metrics are discussed in next two subsection.

3.1 G-mean based Metric

GM (G-mean based Metric) is the cost function constructed based on g-mean, where g-mean is an often used metric to evaluate model performance. Specifically, g-mean is defined as $g - mean = \sqrt{\frac{n_{+}}{N_{+}} \frac{n_{-}}{N_{-}}}$ (11) where N_j and n_j are the number of instances with class j and the number of instances correctly classified as class j respectively, as defined before. Equation (11) shows that g-mean expects learned models with high performance on both positive and negative class. Similarly to Equation (11), a cost function can be defined as $G (w) = \sqrt{\frac{P_{+}}{N_{+}} \frac{P_{-}}{N_{-}}}$ (12) where P_j is the number estimation of instances being correctly classified as class j, defined by Equation (7). From Equations (11) and (12), Equation (12) is the estimation of g-mean, since P_j is the estimation of n_j as aforementioned. For given a training set, N_j in Equation (12) is a constant and maximizing Equation (12) is therefore equivalent to maximizing the following equation

$J (w) = P_{+} P_{-}$ (13)

Due to the monotonic consistency, we can define a cost function by taking the logarithm of J ( w ), which gives

$GM = ln J (w) = ln P_{+} + ln P_{-}$ (14)

This paper uses Equation (14) as a cost function to supervise logistic discrimination learning its parameters. In this way, logistic discrimination considers the performance of both positive and negative class.

Theorem 1.The gradient of GM defined by Equation (14) is $\nabla GM = \sum_{j \in {+, -}} (- 1)^{γ_{j}} \frac{\sum_{x_{i} \in C_{j}} p_{i +} p_{i -} x_{i}}{P_{j}}$ (15) where γ_j = 0 if j =+ and 1 otherwise.

Proof. Taking the gradient of GM (Equation 14) with respect to w , we obtain $\nabla GM = \frac{1}{P_{+}} \frac{\partial P_{+}}{\partial w} + \frac{1}{P_{-}} \frac{\partial P_{-}}{\partial w}$ (16) where

$\begin{matrix} \frac{\partial P_{-}}{\partial w} & = & \sum_{x_{i} \in C_{-}} \frac{\partial}{\partial w} p (y_{i} = - | x_{i}) \\ = & - \sum_{x_{i} \in C_{+}} \frac{exp (w^{T} x_{i})}{(1 + exp (w^{T} x_{i}))^{2}} x_{i} \\ = & - \sum_{x_{i} \in C_{-}} p_{i +} p_{i -} x_{i} \end{matrix}$ (17) and similarly, $\frac{\partial P_{+}}{\partial w} = \sum_{x_{i} \in C_{+}} \frac{\partial p (y_{i} = + | x_{i})}{\partial w} = \sum_{x_{i} \in C_{+}} p_{i +} p_{i -} x_{i}$ (18)

Substituting Equations (17) and (18) into Equation (16) ends up with

$\begin{matrix} \nabla GM & = & \frac{\sum_{x_{i} \in C_{+}} p_{i +} p_{i -} x_{i}}{P_{+}} - \frac{\sum_{x_{i} \in C_{-}} p_{i +} p_{i -} x_{i}}{P_{-}} \\ = & \sum_{j \in {+, -}} (- 1)^{γ_{j}} \frac{\sum_{x_{i} \in C_{j}} p_{i +} p_{i -} x_{i}}{P_{j}} \end{matrix}$ (19) where γ_j = 0 if j =+ and 1 otherwise. □

3.2 F-measure based Metric

FM (F-measure based Metric) is the cost function constructed based on f-measure. As g-mean, f-measure is also a commonly used metric to evaluate the performance of imbalanced classification algorithms. For the two-class problem, assume class + is the minority class and then the f-measure is defined as $f - measure = \frac{(1 + δ^{2}) \times {recall}_{+} \times {precision}_{+}}{δ^{2} \times {recall}_{+} + {precision}_{+}}$ (20) where δ is often set to be 1, which is corresponding to F₁ measure, namely, $F_{1} = \frac{2 \times {recall}_{+} \times {precision}_{+}}{{recall}_{+} + {precision}_{+}}$ (21) where recall₊ = n₊/N₊ and precision₊ = n₊/(n_- + N_- - n_-) are the recall and precision of positive class (class +) respectively (more discussion about recall, precision and f-measure refers to Section 5). Therefore, $F_{1} = \frac{2 n_{+}}{n_{+} + N_{-} - n_{-} + N_{+}}$ (22)

According to Equations (6) and (7), F₁ can be estimated by

$\begin{matrix} FM & = & \frac{2 P_{+}}{P_{+} + N_{-} - P_{-} + N_{+}} \\ = & \frac{2 \sum_{x_{i} \in C_{+}} p_{i +}}{\sum_{x_{i} \in D} p_{i +} + N_{+}} \end{matrix}$ (23)

In this paper, we use Equation (23) as an objective function to supervising logistic regression learning its parameters. In this way, logistic discrimination considers the performance of both positive and negative class.

Theorem 2.The gradient of FM defined by Equation (23) is

$\begin{matrix} \nabla FM & = & \frac{2 \sum_{x_{i} \in C_{+}} p_{i +} p_{i -} x_{i}}{\sum_{x_{i} \in D} p_{i +} + N_{+}} \\ - \frac{2 \sum_{x_{i} \in C_{+}} p_{i +} \sum_{x_{i} \in D} p_{i +} p_{i -} x_{i}}{{(\sum_{x_{i} \in D} p_{i +} + N_{+})}^{2}} \end{matrix}$ (24)

Proof. Taking the gradient of FM (Equation 23) with respect to w , we obtain

$\begin{matrix} \nabla FM & = & \frac{2 (\sum_{x_{i} \in D} p_{i +} + N_{+})}{(\sum_{x_{i} \in D} p_{i +} + N_{+})^{2}} \sum_{x_{i} \in C_{+}} \frac{\partial p_{i +}}{\partial w} \\ - \frac{2 (\sum_{x_{i} \in C_{+}} p_{i +})}{(\sum_{x_{i} \in D} p_{i +} + N_{+})^{2}} \sum_{x_{i} \in D} \frac{\partial p_{i +}}{\partial w} \\ = & \frac{2}{\sum_{x_{i} \in D} p_{i +} + N_{+}} \sum_{x_{i} \in C_{+}} \frac{\partial p_{i +}}{\partial w} \\ - \frac{2 (\sum_{x_{i} \in C_{+}} p_{i +})}{(\sum_{x_{i} \in D} p_{i +} + N_{+})^{2}} \sum_{x_{i} \in D} \frac{\partial p_{i +}}{\partial w} \end{matrix}$ (25)

Substituting Equation (17) into Equation (25), we have

$\begin{matrix} \nabla FM & = & \frac{2 \sum_{x_{i} \in C_{+}} p_{i +} p_{i -} x_{i}}{\sum_{x_{i} \in D} p_{i +} + N_{+}} \\ - \frac{2 \sum_{x_{i} \in C_{+}} p_{i +} \sum_{x_{i} \in D} p_{i +} p_{i -} x_{i}}{(\sum_{x_{i} \in D} p_{i +} + N_{+})^{2}} \end{matrix}$ (26) □

4 Logistic discrimination for imbalanced learning

Based on the proposed cost functions in Section 3, we design a novel approach for logistic discrimination to tackle imbalanced problem. This approach is an iterative process. In the learning stage, the approach uses Quasi-Newton method BFGS to iteratively optimize cost functions to learn parameters. Formally, the iterative process is as follow $w^{(k + 1)} = w^{(k)} - λ H^{(k)} \nabla f (w^{(k)})$ (27) where λ is the step length along with the Newton direction of the k-th iteration, H_k is the approximate hesse matrix calculated by

$\begin{matrix} H^{(k + 1)} & = & H^{(k)} + (1 + \frac{r^{(k) T} H^{(k)} r^{(k)}}{q^{(k) T} r^{(k)}}) \frac{q^{(k)} q^{(k) T}}{q^{(k) T} r^{(k)}} \\ - \frac{q^{(k)} r^{(k) T} H^{(k)} + H^{(k)} r^{(k)} q^{(k) T}}{q^{(k) T} r^{(k)}} \end{matrix}$ (28) where $q^{(k)} = w^{(k + 1)} - w^{(k)},$ (29) $r^{(k)} = \nabla f (w^{(k + 1)}) - \nabla f (w^{(k)})$ (30)

Here, the cost function f ( w ) is corresponding to GM (Equation 14) or FM (Equation 23).

The details about the learning process is shown in Algorithm 1. In training stage, Algorithm 1 firstly initializes w randomly and the hesse matrix H ⁽⁰⁾ to be unit matrix of which the value of each diagonal element is equal to 1 and 0 for others (lines 1∼2), and calculates w ⁽¹⁾ using Equation (27) based on w ⁽⁰⁾ and H ⁽⁰⁾ (lines 3∼4). Then Algorithm 1 uses BFGS to iteratively optimizes the objective function f ( w ) to find out the best parameter vector w (lines 5∼12). Specifically, for the kth iteration, the algorithm calculates the gradients of the cost function as g ^(k) using Equations (15) or (24), and based on g ^(k+1) and g ^(k),

Algorithm 1. Logistic Discrimination for Imbalanced Learning

The training stage:

Input:

D: the training data set

ɛ: the parameter greater than zero

Output: the learned parameter vector w

Process:

1. randomly initialize w ⁽⁰⁾ ;

2. H ⁽⁰⁾= I (unit matrix);

3. g ⁽⁰⁾= ∇ f ( w ^(k));//update the gradient of cost function

//by Eq. (15) or Eq. (24)

4. w ⁽¹⁾= w ⁽⁰⁾- λ H ⁽⁰⁾ g ⁽⁰⁾; // update w using Eq. (27)

5. k = 1;

6. Repeat Until |f ( w ^(k+1)) - f ( w ^(k)| ≤ ɛ:

7. g^(k)=∇ w^(k));//update the gradient of cost function

//using Eq. (15) or Eq. (24);

8. update q ^(k-1) and r ^(k-1) using Eq. (29) and Eq. (30);

9. update H ^(k) using Eq. (28);

10. d ^(k)= - H _k ∇ f ( w );

11. w^(k+1)=w^(k)+λd^(k);

12. return w ;

The testing stage:

Input:

w : the parameter vector of the learned logistic discrimination

x : the unlabeled instance

Output: the label of x

Process:

1. Calculating p (y = + | x ) and p (y = + | x )

using Eq. (1) and Eq. (2) respectively;

2. $y = {arg}_{j} max_{j \in {+, -}} p (y = j | x)$

3. returny

updates q ^(k-1) and r ^(k-1) using (17) (lines 7∼8). Then, it updates H ^(k) using Equation (28) (line 9) and, finally, updates w ^(k+1) using Equation (27) (line 10∼11). The stopping condition is that the absolute of the difference between cost functions for two consecutive iterations is not larger than 0.001 (line 13). In the prediction stage, for an unlabeled instance x , the algorithm calculates the probability of x belonging to each class using Equations (1) and (2), and then finds out the class with highest probability as the label of x , as traditional logistic regression does.

5 Evaluation metrics

Evaluation metric is extremely essential to assess the effectiveness of an algorithm and, traditionally, accuracy and error rate are the most frequently used ones. Consider two-class classification problem and let + and - be the positive and negative class respectively, as defined before. Then the instances classified by a learned model can be grouped into four categories as shown in Table 1, and thus accuracy and error rate are defined as: $accuracy = \frac{TP + TN}{TP + TN + FP + FN}$ (31) $errorRate = 1 - accuracy$ (32)

However, the evaluation metrics used to balanced classification are very different from that used to imbalanced problem, and accuracy is inadequate for imbalanced learning. For example, assuming that, the positive class is represented by only 5% in a training data set, a naive approach of classifying every instance to be a negative class instance would provide an accuracy of 95%. Although 95 percent accuracy across the entire data set appears well, this description fails to reflect the fact that the naive method incorrectly classifies all the instances of positive class. That is to say, the accuracy metric in this case does not provide adequate information on a classifier’s functionality with respect to the type of classification required. Therefore, other metrics are proposed. In lieu of accuracy, other assessment metrics including recall, precision, f-measure and g-mean are frequently adopted in the research community to evaluate the performance of models on imbalanced learning problems. These metrics are designed based on recalls (or accuracies) of both positive class and negative class as well as the precision of negative class, specifically ${precision}_{+} = \frac{TP}{TP + FP},$ (33) ${recall}_{+} = \frac{TP}{TP + FN},$ (34) ${recall}_{-} = \frac{TN}{TN + FP},$ (35)

Then, precision, recall and f-measure are respectively defined as $precision = {precision}_{+},$ (36) $recall = {recall}_{+},$ (37)

$\begin{matrix} f - measure \\ = \frac{(1 + δ^{2}) \times recall \times precision}{δ^{2} \times recall + precision} \\ = \frac{(1 + δ^{2}) \times {recall}_{+} \times {precision}_{+}}{δ^{2} \times {recall}_{+} + {precision}_{+}}, \end{matrix}$ (38) where δ, often set to be 1, is a coefficient to adjust the relative importance of precision versus recall.

From Equations (34) and (37), we know that recall shows how many instances of the positive class are correctly classified. Similarly, from Equations (33) and (36), precision is a measure of exactness, i.e., of the instances labeled as positive class, how many are actually labeled correctly [1]. Like accuracy and error rate, these two metrics share an inverse relationship between each other. Recall cannot provide how many instances are incorrectly labeled as positive class and precision cannot assert how many positive instances are classified incorrectly. In practice, we expect recall and precision are high at the same time; however, high recall doesn’t means high precision. Classifier with higher recall and lower precision, or with lower recall and higher precision, may not be good classifier [4].

In order to evaluate a classifier with both high recall and high precision, f-measure is proposed as shown in Equation (38). F-measure combines recall and precision as a measure of the effectiveness of classification in terms of a ratio of the weighted importance on either recall or precision as determined by coefficient set by users. So, f-measure represents a harmonic mean between recall and precision.

Like f-measure, g-mean is another metric evaluates the degree of inductive bias in terms of a ratio of recalls (or accuracies) of positive and negative class. Specifically, g-mean measures the balanced performance of a classifier using the geometric mean of the recall of positive class and that of negative class. In this way, g-mean considers that each class is equally important rather than each instance. Formally, g-mean is as follows

$g - mean = \sqrt{{recall}_{+} \times {recall}_{-}}$ (39)

In addition to the metrics mentioned above, there are several curves usually used to imbalanced learning, such as receiver precision-recall curves, cost curves and so on. In the paper, we employ accuracy, recall, f-measure and g-mean to evaluate the classification performance on imbalanced data sets. Though accuracy is inadequate to evaluate the classification performance, poor accuracy means a bad classifier. An efficient classifier should improve recall, f-measure or g-mean without significantly decreasing accuracy.

6 Experiments

6.1 Data sets and experimental setup

15 data sets are randomly selected from the UCI repository [27]. Data sets with positive classes are derived from these 15 data sets according to the following strategies: (1) for two-class data set, if the corresponding imbalance degree (the ratio of the sample size of the positive class to that of the negative class) is greater than 0.25, randomly removing the instances of the positive class so that the degree is smaller than 0.25; (2) for multi-class data set, delete some labels until the number of labels is equal to 2 and then using (1) to cope with the data set. For each data set, 10-folds cross validation is performed: one fold for test the performance of learned models and the others for training models. We repeat five times the cross validation and thus fifty trials are executed on each data set for a model. More details about the data sets refers to Table 2, where #Insts, #Attrs and #Degree are the size, attribute number and imbalance degree of the corresponding data sets.

In order to facilitate the experiment, we design a platform named Loyang Spoon(LySpoon) based on python, and on this platform, we implement the algorithms used in this paper.

For simplicity, in this section, LD is the traditional logistic discrimination which uses log-likelihood as the cost function. LD-GM and LD-FM are the proposed methods in this paper that uses GM defined by Equation (14) and FM defined by Equation (23) as cost functions respectively. Similarly, LD-US and LD-OS are logistic discriminations that run on the data sets obtained by under sampling and over sampling from original training set respectively. Here, we compare both LD-GM and LD-FM with LD, LD-US and LD-OS. More discussion refers to next subsection.

6.2 Experimental results

To evaluate the performance of the proposed method, we compare both LD-GM and LD-FM with LD, LD-US and LD-OS respectively (more details about these methods refer to Section 6.1). The corresponding results are shown in Tables 3, 4, 5 and 6, where these tables are the performances of models on measures of accuracy, recall, g-mean and f-measure respectively. In these tables, for columns of LD, LD-US and LD-OS, the first ⊕/⊙/⊖ next to a result indicates that LD-GM statistically wins/ties/losses compared with the corresponding method (column) for the corresponding data set (row). Similarly, the second ⊕/⊙/⊖ indicates that LD-FM statistically wins/ties/losses compared with the corresponding method for the corresponding data set. For example, from the first row (corresponding to data set auto-mpg) and the third column (corresponding algorithm LD), we observe that both LD-GM and LD-FM statistically outperform LD on auto-mpg. For the column of LD-FM, the ⊕/⊙/⊖ indicates LD-RM wins/ties/losses compared with LD-FM. The last rows in these tables are average results. Note: pair-wise t-test at 95% significance level is used for these comparisons.

Table 3 is the performances of models on measure of accuracy. As shown in Table 3, LD-GM, LD-FM, LD-US and LD-OS is statistically outperformed by LD on the most of the data sets. The reason is that GM, FM, under sample and over sample pay more attention to positive class instances. Specially, the average accuracy of LD is 10.32 percentage points higher than LD-US (refer to Table 3), which is caused by that the small instances obtained by under sampling from majority class losses real distribution of majority class. We also observe from Table 3 that LD-GM and LD-FM statistically outperform LD-US and LD-OS on most of the 15 data set. From Equations (36) and (37), precision is more sensitive than recall on incorrectly classifying instances with negative classes as positive classes, which causes that f-measure is more sensitive than g-mean on incorrectly classifying the ones (refer to Equation 14 and Equation 23). Therefore, LD-GM is outperformed by LD-FM. Specifically, LD-GM is outperformed by LD-FM on 8 out of the 15 data sets, and outperforms it only on 2 data sets.

Table 4 is the performances of models on measure of recall. Table 4 shows that both LD-GM and LD-FM outperform LD on the most of the 15 data sets, and LD-GM and LD-FM improve LD with 33.44 and 24.78 percent on the average of recall. Besides, LD-OS is outperformed by LD-GM and performs similar to LD-FM. Specifically, LD-OS statistically wins/ties/losses on 1/9/5 and 5/5/5 out of the 15 data sets comparing to LD-GM and LD-FM respectively.

Combining the results of Tables 3 and 4, we have that (1) LD-GM and LD-FM can improve the performance of logistic discrimination on positive class under the condition of trying to keep its high performance on whole data sets and (2) LD-US obtains high performance on the measure of recall by sacrificing the high performance of logistic discrimination on accuracy.

Table 5 shows the performance of models on g-mean. Since GM fully considers the performance of models on both positive classes and negative classes, the performance of LD-GM on g-mean is better than other methods, as shown in Table 5. In addition, LD-FM performs similar to LD-OS and better than other methods. It is notable that, LD-FM outperforms LD-US on g-mean and is outperformed on the measure of recall, as shown in Tables 4 and 5. This result indicates that under sample reduces the overall performance of logistic discrimination on imbalance problem.

FM fully considers both recall and precision₊, and therefore, similar to Table 5, 6 shows LD-FM performs best on measure of f-measure. From results of Tables 3, 4, 5 and 6, each of LD-GM and LD-FM has its own advantages, which is caused by FM and GM focusing on different factors.

According to [28], when comparing two or more algorithms, a reasonable approach is to compare their rank on each data set or average rank across all data sets. On a data set, the best performing algorithm gets the rank of 1, the second best one get the rank of 2, and so on. In case of ties, average ranks are assigned. Figure 1 displays the rank of LD-GM, LD-FM, LD, LD-US and LD-OS on g-mean (top) and f-measure (bottom). From Fig. 1, LD-GM or LD-FM rank first on 12 out of the 15 data sets on both g-mean and f-measure and others only take up the first place on two data sets.

Combining the results above, we conclude that logistic discrimination based on GM and FM can significantly improve the performance of logistic discrimination on positive class while keeping its high performance on negative class. Besides, the proposed method shows better performance comparing to logistic discrimination based on under-sample or over-sample on imbalanced problem.

7 Conclusion

In this paper, based on the measures of g-mean and f-measure, we first construct two novel cost functions called f-measure based metric (FM) and g-mean based metric (GM), and then propose a novel method to improve the performance of logistic discrimination on imbalanced problem, and also, we apply sampling technique to enhance its performance. Experimental results show that these methods can significantly improve the performance of logistic discrimination on positive class. Besides, the proposed method presents significant advantage comparing to logistic discrimination, under-sample or over-sample based logistic discrimination on measures of recall, g-mean and f-measure while keeping the high performance of logistic discrimination on accuracy.

From the results shown in this paper, each of logistic discriminations based on GM and FM has its own advantages on different measures. Therefore, how to design a novel cost function with the merit of both GM and FM to further improve the performance of logistic discrimination on imbalanced problem is a worthy research work.

Footnotes

Acknowledgments

This work is in part supported by the National Natural Science Foundation of China (Grant No. 61501393, No. 61402393, No. 61472370, No. 61202207, No. 61572417, No. 61202194), in part by Project of Science and Technology Department of Henan Province (No. 162102210310, No. 152102210129), and in part by Science and Technology Research key Project of the Education Department of Henan Province (Grant No. 15A520026).

1

!Please contact hpguo.gm@gmail.com for the related source code

References

H.B.

and Garcia

E.A.

, Learning from imbalanced Data, IEEE Transactions on Knowledge and Data Engineering21 (2009), 1263–1284.

Yao

, Wang

, Jiang

and Liu

, Fault diagnosis method based on cs-boosting for unbalanced training data, Journal of Vibration, Measurement & Diagnosis33(1) (2013), 111–115.

Liu

X.Y.

, Li

Q.Q.

and Zhou

Z.H.

, Learning imbalanced multiclass data with optimal dichotomy weights, Proceeding of IEEE 13th International Conference on Data Mining, Dallas, TX, USA, 2013, pp. 478–487.

Martin

P.D.

, Evaluation: From precision, recall and f-measure to ROC, informedness, markedness and correlation, Journal of Machine Learning Technologies2(1) (2011), 37–63.

Liu

X.Y.

, Wu

J.X.

and Zhou

Z.H.

, Exploratory Under Sampling for Class Imbalance Learning,, Proceeding of 6th IEEE International Conference on Data Mining, Hong Kong, China, 2006, pp. 965–969.

and Japkowicz

, Class Imbalances versus small disjuncts, ACM SgKDD Explorations Newsletter6(1) (2004), 40–49.

Varassin

C.G.

, Plastino

, Leitão

and Zadrozny

, Undersampling strategy based on clustering to improve the performance of splice site classification in human genes in Database and Expert Systems Applications, Proceeding of 24th IEEE International Workshop on DEXA, Prague, Czech Republic, 2013, pp. 85–89.

Zhang

and Li

, RWO-Sampling: A random walk oversampling approach to imbalanced data classification, Information Fusion20 (2014), 99–116.

Chawla

N.V.

, Bowyer

K.W.

, Hall

L.O.

and Kegelmeyer

W.P.

, SMOTE: Synthetic minority over-sampling technique, Journal of Artificial Intelligence Research16 (2002), 321–357.

10.

Sáez

J.A.

, Luengo

, Stefanowski

and Herrera

, SMOTEĺ-CIPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering, Information Sciences291 (2015), 184–203.

11.

Zieba

, Tomczak

J.M.

and Gonczarek

, RBM-SMOTE: Restricted boltzmann machines for synthetic minority oversampling technique, Proceeding of the 7th Asian Conference Intelligent Information and Database Systems, Part I, Bali, Indonesia, 2015, pp. 377–386.

12.

Joshi

M.V.

, Agarwal

R.C.

and Kumar

, Mining needle in a haystack: Classifying rare classes via two-phase rule induction, Proceeding of ACM SIGMOD International Conference on Management of Data, Santa Barbara, CA, USA, 2001, pp. 91–102.

13.

Zhang

and Zhou

Z.H.

, Cost-sensitive face recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence32(10) (2010), 1758–1769.

14.

Zhang

C.X.

, Wang

G.W.

, Zhang

J.S.

, Guo

and Ying

Q.Y.

, IRUSRT: A novel imbalanced learning technique by combining inverse random under sampling and random tree, Communications in Statistics - Simulation and Computation43(10) (2014), 2714–2731.

15.

Merigó

J.M.

and Gil-Lafuente

A.M.

, The induced generalized OWA operator, Information Sciences179 (2009), 729–741.

16.

Merigó

J.M.

and Yager

R.R.

, Generalized moving averages, distance measures and OWA operators, International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems21 (2013), 533–559.

17.

Sun

Y.M.

, Kamel

M.S.

, Wong

A.K.C.

and Yang

, Costsensitive boosting for classification of imbalanced data, Pattern Recognition40(12) (2007), 3358–3378.

18.

Japkowicz

and Stephen

, The class imbalance problem: A systematic study, Intelligent Data Analysis6(5) (2002), 429–449.

19.

Tomek

, Two modifications of CNN, IEEE Transactions on Systems Man and Communications6(11) (1976), 769–772.

20.

Angiulli

, Fast condensed nearest neighbor rule, Proceeding of the 22nd International Conference of Machine Learning, Bonn, Germany, 2005, pp. 25–32.

21.

Kubat

and Matwin

, Addressing the curse of imbalanced training sets: One-sided selection, Proceeding of the 14th International Conference ofMachine Learning, Nashville, TN, USA, 1997, pp. 179–186 .

22.

Laurikkala

, Improving identification of difficult small classes by balancing class distribution, Proceeding of 8th Conference on AI in Medicine in Europe, Cascais, Portugal, 2001, pp. 63–66.

23.

, Bai

, Garcia

E.A.

and Li

, ADASYN: Adaptive synthetic sampling approach for imbalanced learning, Proceeding of International Joint Conference on Neural Networks, Hong Kong, China, 2008, pp. 1322–1328.

24.

Chawla

N.V.

, Lazarevic

, Hall

L.O.

and Bowyer

K.W.

, SMOTEBoost: Improving prediction of the minority class in boosting, 7th European Conference on Principles and Practice of Knowledge Discovery in Databases, Cavtat-Dubrovnik, Croatia, 2003, pp. 107–119.

25.

Batista

G.E.

, Prati

R.C.

, Monard

M.C.

, A Study of the behavior of several methods for balancing machine learning training data, SIGKDD Explorations6(1) (2004), 20–29.

26.

Estabrooks

, Jo

and Japkowicz

, A multiple resampling method for learning from imbalanced data sets, Computational Intelligence20(1) (2004), 18–36.

27.

Blake

and Merz

, UCI repository of machine learning databases . (Available from: http://www.ics.uci.edu/mlearn/MLRepository.html).

28.

Demsar

, Statistical comparisons of classifiers over multiple data sets, Journal of Machine Learning Research6 (2006), 1–30.