Abstract
As a well known statistical method, logistic discrimination has been successfully used in many practical applications including medical diagnosis and personal credit assessment. In this paper, we apply this model to imbalanced problem which is also referred to as skewed or rare class problem, characterized by having many more instances of one class (negative class or majority class) than the other (positive class or minority class). However, traditional logistic discrimination tries to pursue a high accuracy by assuming that all classes have similar size, leading to the fact that instances with positive classes are often overlooked and misclassified to negative ones. To fully consider class imbalance, we re-learn the two basic measures for imbalanced problem, g-mean and f-measure, and design two new cost functions, i.e., g-mean based metric (GM) and f-measure based metric (FM), to supervise logistic discrimination learning the corresponding parameters, where GM is the geometric mean estimation of recall of both positive and negative class as g-mean and FM is a harmonic mean between recall and precision of positive class as f-measure. The experiments on UCI data sets show that the proposed method presents significant advantage comparing to state-of-the-art classification methods on all metrics used in this paper including accuracy, recall, f-measure and g-mean.
Introduction
Recently, imbalanced problem, also named class imbalance skewed or rare class problem, has drawn a significant number of interest in academia, industry and government. For two-class, this problem is characterized as having many more instances of one class (negative class or majority class) than the other (positive class or minority class) [1, 2]. In many real-world applications, the correct prediction of instances in positive class is often more meaningful than the contrary case. Take “mammography data set” as an example. This data set contains 10923 “healthy” patients and 260 “cancerous” patients and, in medical detection, how to correctly recognize the “cancerous” patients is overwhelmingly meaningful. However, most standard classification methods such as C4.5, naive bayes and neural network, try to pursue a high accuracy by assuming balanced class distribution, namely the number of instances in any class is similar to each other, leading to the fact that the positive class instances are often overlooked and misclassified to negative class. However, accuracy is not a suitable evaluation metric when there is class-imbalance, instead, recall, f-measure andg-mean are more appropriate evaluation metrics for class-imbalance problem [3, 4].
Many approaches have been proposed to tackle this problem, which can be roughly categorized into two groups: data level and algorithm level approaches. For the former, algorithms run on the re-balanced class distribution obtained by manipulating the data space. Examples include randomly or informatively under-sampling instances of negative class [5], randomly over-sampling instances of positive class, over-sampling based on cluster algorithm [6, 7], random walk over-sampling approach (RWO-Sampling) [8], and over-sampling the positive class by creating new synthetic instances [9–11]. Sampling data space technique is often used to deal with imbalanced learning problem, however the real class distribution is always unknown and differs from data to data. As to algorithm level, approaches try to adapt existing classifier learning algorithms (the corresponding cost functions) such that learned models bias towards correctly classifying instances in positive class, such as two-phase rule induction [12] and cost-sensitive learning [13], one-class learning, imbalanced learning by combining inverse random under sampling and random tree [14]. Besides, the moving average [15, 16] can also used to class imbalanced problem.
In this paper, we re-consider imbalanced problem and, based on logistic discrimination, propose a novel method to tackle the problem. Unlike traditional logistic discrimination trying to learn the corresponding parameters to achieve high performance by maximizing log likelihood function, the proposed method achieves high performance on imbalanced data by maximizing two cost functions proposed in this paper. These two cost functions are the g-mean based metric (GM) and the f-measure based metric (FM), where GM is the geometric mean estimation of the recalls of both positive and negative classes and FM is a harmonic mean between recall and precision of positive classes. The idea was inspired by the following observations: (1) g-mean is the exact geometric mean of the recalls of both positive and negative classes, and (2) f-measure is the exact harmonic mean both the recall and precision of positive classes. In this way, the proposed method pays more attention to positive classes and achieves better performance on imbalanced problem. The main contributions of this paper are as follows: applying logistic discrimination to imbalanced problem and, therefore increasing its application field; to fully consider class imbalance, two cost functions are designed, i.e., g-mean based metric (GM) and f-measure based metric (FM), to supervise the corresponding parameters learning process; based on the two designed cost functions, proposing a novel method to boost the performance of logistic discrimination on the imbalanced problem.
The experimental results on 15 UCI data sets show that the proposed method can greatly boost the performance of logistic discrimination on measures of recall, f-measure and g-mean while keeping its high performance on measure of accuracy. Compared with other state-of-the-art classification methods, this method shows a great advantage. The experimental results also indicate reasonably designing loss functions can effectively boost the generalization performance of classification on imbalanced problem.
The rest of this paper is organized as follows. After presenting the related work in Section 2, Section 3 introduces the two new proposed cost functions, i.e., g-mean based metric (GM) and f-measure based metric (FM); based on GM and FM, Section 4 describes the proposed imbalanced learning method; Section 5 describes the evaluation metrics used to the imbalanced class recognition problem; Section 6 presents the experimental results and, finally, Section 7 concludes this work.
Related work
Imbalanced learning
Technically speaking, the data set which exhibits an unequal distribution between its classes can be considered imbalanced (skewed). There are two forms of imbalance, i.e., within-class imbalance and between-class imbalance. For the former, a sub-concepts exist in limited instances, which increase the difficulty of correctly classifying instances. With respect to between-class imbalance, one class extremely out-represents another. Usually, the second form of unequal distribution is discussed incommunity [1].
There are many factors that influence the performance of a capable classifier on data sets with class imbalance, where the skewed data distribution, small sample size, separability and the existence of within-class sub-concepts rank the top 4 [17].
The skewed data distribution is often denoted by imbalance degree which is the ratio of the sample size of the positive class to that of the negative class. Reported studies indicate that a relatively balanced distribution usually attains a better result. However, it cannot be stated explicitly to what imbalance degree the class distribution deteriorates the classification performance, since other factors such as sample size and separability also affect performance.
Small sample size means the sample size is limited, uncovering regularities inherent in small class is unreliable. In [18], the authors suggest that the imbalanced class distribution may not be a hindrance to classification by providing a large enough data set.
The difficulty in separating the rare class from the prevalent class is the key issue of the imbalanced problem. Assuming that there exist highly discriminative patterns among each class, and then non-sophisticated rules are required to distinguish class objects. However, if patterns among each class are overlapping, discriminative rules are hard to be induced.
Within-class concepts mean that a single class is composed of various sub-clusters or sub-concepts. Instances of a class are collected from different sub-concepts. These sub-concepts do not always contain the same number of instances. The presence of within-class concepts worsens the imbalance distribution problem. In general, we only consider imbalanced data distribution in imbalanced learning and fix other factors.
Logistic discrimination
Logistic discrimination is a type of typical probability statistical classification model. For the two-class case, that is,
In the terminology of statistics, this model is known as logistic regression.
For a given data set D = {
The logistic discrimination uses Equation (5) as the cost function; however, it is not suitable for the class imbalanced problem because the cross-entropy error function defined in Equation (5) does not consider the importance of each class. To handle this problem, we re-learn the two widely used measures, g-mean and f-measure, which are employed to evaluate the performance of models on imbalanced class recognition problem, and design two new cost functions called g-mean based metric (GM) and f-measure based metric (FM) to supervise logistic discrimination learning the parameters. More details are shown in Section 3.
Sampling technique such as under-sampling and over-sampling which have been demonstrated effectively for imbalanced learning, is an important method to deal with classification of imbalanced data set.
Under-sampling tries to balance an imbalanced data set by randomly sampling a subset of majority instances. Reported researches have demonstrated that random under-sampling is more effective than random over-sampling for imbalanced learning. Several under-sampling methods are proposed to further improve the performance of imbalanced learning including Tomek’s links [19], Condensed Nearest Neighbor Rule [20], one-sided selection [21] and Neighborhood Cleaning rule [22].
Unlike under-sampling, over-sampling aims to balance an imbalanced data by adding a set sampled from the minority class: for a set of randomly selected minority examples, augment the original set S by replicating the selected examples and adding them to S. Many over-sampling imbalanced methods are proposed, such as Synthetic Minority Over-sampling Technique (SMOTE) and Adaptive Synthetic Sampling [23].
There are other studies on sampling techniques for imbalanced learning. For example, Chawlaet al. [24] integrate SMOTE into a standard boosting procedure to improve the prediction on the positive class. Gustavo et al. [25] combine over-sampling and under-sampling methods to resolve the imbalanced problem. Estabrooks et al. [26] propose a multiple resampling method which selects the most appropriate resampling rate adaptively. T. Jo et al. [6] put forward a cluster-based over-sampling method which deals with between-class imbalance and within-class imbalance simultaneously.
In this paper, we also apply sampling technique to logistic discrimination to enhance its performance on imbalanced data set. For clarity, only two widely used sampling techniques are selected: random under-sampling and over-sampling. The corresponding experimental results are presented in Section 6.
G-mean and F-measure based metrics
Traditional logistic discrimination tries to maximize the models classification accuracy using maximum likelihood as the corresponding cost function to estimate model parameters, as aforementioned. However, this approach ignores the different value of data in each class, which leads to the fact that the learned model has low performance on positive class.
To tackle this problem, we redefine the cost function of logistic regression to guarantee that the learned model considers the performance on both positive class and negative class. The relevant symbols are defined as follows.
Let C
j
= {
From Equations (7) and (8), P
j
is the number estimation of instances being correctly classified as class j (corresponding to n
j
) and is that of instances with class j being incorrectly classified. For the two-class problem, we have
Based on Equations (7) and (8), two cost functions are constructed to supervise logistic discrimination learning the parameters, namely g-mean based metric (GM) and f-measure based metric (FM). More details about the metrics are discussed in next two subsection.
GM (
Due to the monotonic consistency, we can define a cost function by taking the logarithm of J (
This paper uses Equation (14) as a cost function to supervise logistic discrimination learning its parameters. In this way, logistic discrimination considers the performance of both positive and negative class.
Substituting Equations (17) and (18) into Equation (16) ends up with
FM (
According to Equations (6) and (7), F1 can be estimated by
In this paper, we use Equation (23) as an objective function to supervising logistic regression learning its parameters. In this way, logistic discrimination considers the performance of both positive and negative class.
Substituting Equation (17) into Equation (25), we have
Based on the proposed cost functions in Section 3, we design a novel approach for logistic discrimination to tackle imbalanced problem. This approach is an iterative process. In the learning stage, the approach uses Quasi-Newton method BFGS to iteratively optimize cost functions to learn parameters. Formally, the iterative process is as follow
Here, the cost function f (
The details about the learning process is shown in Algorithm 1. In training stage, Algorithm 1 firstly initializes
D: the training data set
ɛ: the parameter greater than zero
1. randomly initialize
2.
3.
4.
5. k = 1;
6.
7.
8. update
9. update
10.
11.
12.
1. Calculating p (y = + |
using Eq. (1) and Eq. (2) respectively;
2.
3.
updates
Evaluation metric is extremely essential to assess the effectiveness of an algorithm and, traditionally, accuracy and error rate are the most frequently used ones. Consider two-class classification problem and let + and - be the positive and negative class respectively, as defined before. Then the instances classified by a learned model can be grouped into four categories as shown in Table 1, and thus accuracy and error rate are defined as:
However, the evaluation metrics used to balanced classification are very different from that used to imbalanced problem, and accuracy is inadequate for imbalanced learning. For example, assuming that, the positive class is represented by only 5% in a training data set, a naive approach of classifying every instance to be a negative class instance would provide an accuracy of 95%. Although 95 percent accuracy across the entire data set appears well, this description fails to reflect the fact that the naive method incorrectly classifies all the instances of positive class. That is to say, the accuracy metric in this case does not provide adequate information on a classifier’s functionality with respect to the type of classification required. Therefore, other metrics are proposed. In lieu of accuracy, other assessment metrics including recall, precision, f-measure and g-mean are frequently adopted in the research community to evaluate the performance of models on imbalanced learning problems. These metrics are designed based on recalls (or accuracies) of both positive class and negative class as well as the precision of negative class, specifically
Then, precision, recall and f-measure are respectively defined as
From Equations (34) and (37), we know that recall shows how many instances of the positive class are correctly classified. Similarly, from Equations (33) and (36), precision is a measure of exactness, i.e., of the instances labeled as positive class, how many are actually labeled correctly [1]. Like accuracy and error rate, these two metrics share an inverse relationship between each other. Recall cannot provide how many instances are incorrectly labeled as positive class and precision cannot assert how many positive instances are classified incorrectly. In practice, we expect recall and precision are high at the same time; however, high recall doesn’t means high precision. Classifier with higher recall and lower precision, or with lower recall and higher precision, may not be good classifier [4].
In order to evaluate a classifier with both high recall and high precision, f-measure is proposed as shown in Equation (38). F-measure combines recall and precision as a measure of the effectiveness of classification in terms of a ratio of the weighted importance on either recall or precision as determined by coefficient set by users. So, f-measure represents a harmonic mean between recall and precision.
Like f-measure, g-mean is another metric evaluates the degree of inductive bias in terms of a ratio of recalls (or accuracies) of positive and negative class. Specifically, g-mean measures the balanced performance of a classifier using the geometric mean of the recall of positive class and that of negative class. In this way, g-mean considers that each class is equally important rather than each instance. Formally, g-mean is as follows
In addition to the metrics mentioned above, there are several curves usually used to imbalanced learning, such as receiver precision-recall curves, cost curves and so on. In the paper, we employ accuracy, recall, f-measure and g-mean to evaluate the classification performance on imbalanced data sets. Though accuracy is inadequate to evaluate the classification performance, poor accuracy means a bad classifier. An efficient classifier should improve recall, f-measure or g-mean without significantly decreasing accuracy.
Data sets and experimental setup
15 data sets are randomly selected from the UCI repository [27]. Data sets with positive classes are derived from these 15 data sets according to the following strategies: (1) for two-class data set, if the corresponding imbalance degree (the ratio of the sample size of the positive class to that of the negative class) is greater than 0.25, randomly removing the instances of the positive class so that the degree is smaller than 0.25; (2) for multi-class data set, delete some labels until the number of labels is equal to 2 and then using (1) to cope with the data set. For each data set, 10-folds cross validation is performed: one fold for test the performance of learned models and the others for training models. We repeat five times the cross validation and thus fifty trials are executed on each data set for a model. More details about the data sets refers to Table 2, where #Insts, #Attrs and #Degree are the size, attribute number and imbalance degree of the corresponding data sets.
In order to facilitate the experiment, we design a platform named Loyang Spoon(LySpoon) based on python, and on this platform, we implement the algorithms used in this paper.
For simplicity, in this section, LD is the traditional logistic discrimination which uses log-likelihood as the cost function. LD-GM and LD-FM are the proposed methods in this paper that uses GM defined by Equation (14) and FM defined by Equation (23) as cost functions respectively. Similarly, LD-US and LD-OS are logistic discriminations that run on the data sets obtained by under sampling and over sampling from original training set respectively. Here, we compare both LD-GM and LD-FM with LD, LD-US and LD-OS. More discussion refers to next subsection.
Experimental results
To evaluate the performance of the proposed method, we compare both LD-GM and LD-FM with LD, LD-US and LD-OS respectively (more details about these methods refer to Section 6.1). The corresponding results are shown in Tables 3, 4, 5 and 6, where these tables are the performances of models on measures of accuracy, recall, g-mean and f-measure respectively. In these tables, for columns of LD, LD-US and LD-OS, the first ⊕/⊙/⊖ next to a result indicates that LD-GM statistically wins/ties/losses compared with the corresponding method (column) for the corresponding data set (row). Similarly, the second ⊕/⊙/⊖ indicates that LD-FM statistically wins/ties/losses compared with the corresponding method for the corresponding data set. For example, from the first row (corresponding to data set auto-mpg) and the third column (corresponding algorithm LD), we observe that both LD-GM and LD-FM statistically outperform LD on auto-mpg. For the column of LD-FM, the ⊕/⊙/⊖ indicates LD-RM wins/ties/losses compared with LD-FM. The last rows in these tables are average results.
Table 3 is the performances of models on measure of accuracy. As shown in Table 3, LD-GM, LD-FM, LD-US and LD-OS is statistically outperformed by LD on the most of the data sets. The reason is that GM, FM, under sample and over sample pay more attention to positive class instances. Specially, the average accuracy of LD is 10.32 percentage points higher than LD-US (refer to Table 3), which is caused by that the small instances obtained by under sampling from majority class losses real distribution of majority class. We also observe from Table 3 that LD-GM and LD-FM statistically outperform LD-US and LD-OS on most of the 15 data set. From Equations (36) and (37), precision is more sensitive than recall on incorrectly classifying instances with negative classes as positive classes, which causes that f-measure is more sensitive than g-mean on incorrectly classifying the ones (refer to Equation 14 and Equation 23). Therefore, LD-GM is outperformed by LD-FM. Specifically, LD-GM is outperformed by LD-FM on 8 out of the 15 data sets, and outperforms it only on 2 data sets.
Table 4 is the performances of models on measure of recall. Table 4 shows that both LD-GM and LD-FM outperform LD on the most of the 15 data sets, and LD-GM and LD-FM improve LD with 33.44 and 24.78 percent on the average of recall. Besides, LD-OS is outperformed by LD-GM and performs similar to LD-FM. Specifically, LD-OS statistically wins/ties/losses on 1/9/5 and 5/5/5 out of the 15 data sets comparing to LD-GM and LD-FM respectively.
Combining the results of Tables 3 and 4, we have that (1) LD-GM and LD-FM can improve the performance of logistic discrimination on positive class under the condition of trying to keep its high performance on whole data sets and (2) LD-US obtains high performance on the measure of recall by sacrificing the high performance of logistic discrimination on accuracy.
Table 5 shows the performance of models on g-mean. Since GM fully considers the performance of models on both positive classes and negative classes, the performance of LD-GM on g-mean is better than other methods, as shown in Table 5. In addition, LD-FM performs similar to LD-OS and better than other methods. It is notable that, LD-FM outperforms LD-US on g-mean and is outperformed on the measure of recall, as shown in Tables 4 and 5. This result indicates that under sample reduces the overall performance of logistic discrimination on imbalance problem.
FM fully considers both recall and precision +, and therefore, similar to Table 5, 6 shows LD-FM performs best on measure of f-measure. From results of Tables 3, 4, 5 and 6, each of LD-GM and LD-FM has its own advantages, which is caused by FM and GM focusing on different factors.
According to [28], when comparing two or more algorithms, a reasonable approach is to compare their rank on each data set or average rank across all data sets. On a data set, the best performing algorithm gets the rank of 1, the second best one get the rank of 2, and so on. In case of ties, average ranks are assigned. Figure 1 displays the rank of LD-GM, LD-FM, LD, LD-US and LD-OS on g-mean (top) and f-measure (bottom). From Fig. 1, LD-GM or LD-FM rank first on 12 out of the 15 data sets on both g-mean and f-measure and others only take up the first place on two data sets.
Combining the results above, we conclude that logistic discrimination based on GM and FM can significantly improve the performance of logistic discrimination on positive class while keeping its high performance on negative class. Besides, the proposed method shows better performance comparing to logistic discrimination based on under-sample or over-sample on imbalanced problem.
Conclusion
In this paper, based on the measures of g-mean and f-measure, we first construct two novel cost functions called f-measure based metric (FM) and g-mean based metric (GM), and then propose a novel method to improve the performance of logistic discrimination on imbalanced problem, and also, we apply sampling technique to enhance its performance. Experimental results show that these methods can significantly improve the performance of logistic discrimination on positive class. Besides, the proposed method presents significant advantage comparing to logistic discrimination, under-sample or over-sample based logistic discrimination on measures of recall, g-mean and f-measure while keeping the high performance of logistic discrimination on accuracy.
From the results shown in this paper, each of logistic discriminations based on GM and FM has its own advantages on different measures. Therefore, how to design a novel cost function with the merit of both GM and FM to further improve the performance of logistic discrimination on imbalanced problem is a worthy research work.
Footnotes
Acknowledgments
This work is in part supported by the National Natural Science Foundation of China (Grant No. 61501393, No. 61402393, No. 61472370, No. 61202207, No. 61572417, No. 61202194), in part by Project of Science and Technology Department of Henan Province (No. 162102210310, No. 152102210129), and in part by Science and Technology Research key Project of the Education Department of Henan Province (Grant No. 15A520026).
1
!Please contact hpguo.gm@gmail.com for the related source code
