IMLBoost for intelligent diagnosis with imbalanced medical records

Abstract

Class imbalance of medical records is a critical challenge for disease classification in intelligent diagnosis. Existing machine learning algorithms usually assign equal weights to all classes, which may reduce classification accuracy of imbalanced records. In this paper, a new Imbalance Lessened Boosting (IMLBoost) algorithm is proposed to better classify imbalanced medical records, highlighting the contribution of samples in minor classes as well as hard and boundary samples. A tailored Cost-Fitting Loss (CFL) function is proposed to assign befitting costs to these critical samples. The first and second derivations of the CFL are then derived and embedded into the classical XGBoost framework. In addition, some feature analysis skills are utilized to further improve performance of the IMLBoost, which also can speed up the model training. Experimental results on five UCI imbalanced medical datasets have demonstrated the effectiveness of the proposed algorithm. Compared with other existing classification methods, IMLBoost has improved the classification performance in terms of F1-score, G-mean and AUC.

Keywords

Imbalance learning class-imbalanced boosting medical data intelligent diagnosis

1. Introduction

The automatic classification of diseases based on heterogeneous medical records is an important research topic in intelligent diagnosis. Existing studys often apply Machine Learning (ML) methods to construct disease classification models with collected medical records as training data [1, 2, 3], and the common used ML models for classification include k-nearest neighbor, decision tree, support vector machine, naive bayes classifier, neural network, ensemble learning, and so on [4, 5, 6, 7, 8, 9]. Among these models, ensemble learning models have been demonstrated to have better performance on classifying medical records than other models with single classifier [10, 11]. As one of the most representative ensemble learning methods, boosting algorithm has achieved remarkable success on many classification and prediction problems. It has many implementations, such as Adaboost [12], XGBoost [13], LightGBM [14], CatBoost [15], and so on. Considering performance, affordable time and memory complexity of these models, XGBoost has become the first choice in related studies and competitions.

However, there is a common challenge in classification of real-life medical records: class imbalance. It is a phenomenon where samples in some classes (minor classes) are significantly less than samples in other classes (major classes). For example, in prevalent pneumonia, the number of negative and positive samples is very different, and for positive samples, various pneumonia subtypes are also imbalanced. The imbalance of disease samples often leads to bad classification results of ML models. Most of them assume that data samples in all classes are balanced [4, 5, 6, 7]. Hence, when dealing with imbalanced records, ML models often pay attention to the accuracy of major classes, and perform poorly on minor classes. However, the minor classes are usually of more interest and the errors coming from them are more important in medical diagnosis. For instance, if cancer patients (belonging to minor class) are diagnosed as healthy, they will miss the best time for treatment, which may be fatal. For this reason, traditional machine learning models cannot perform well on classification of unbalanced medical data. Even the XGBoost is applied, the performance is brittle when it comes to the class-imbalanced datasets [16, 17, 18, 19], due to the equal weight assignment to each class. Therefore, it is meaningful to develop an effective method to deal with the problem of class imbalance in learning based intelligent diagnosis.

Many efforts have been done to solve the imbalance learning problem. Current strategies for classification of imbalanced data can be divided into two types: data-level methods and algorithm-level methods. The data-level approaches usually use various sampling (over-sampling or under-sampling) techniques to eliminate class imbalance [20, 21, 22, 23, 24, 25]. But sampling methods have obvious disadvantages. On the one hand, over-sampling may introduce duplicated samples, which slows down the training process and makes the model susceptible to over-fitting. On the other hand, under-sampling may discard valuable samples that are important for training models. Due to these disadvantages of sampling, most of present works focus on algorithm-level approaches [26]. The main purpose of algorithm-level approaches is that models can assign different costs to different classes. And for this purpose, many researches [27, 28, 29, 30] have been devoted to designing class-imbalanced loss functions. They advocated that the cost of misclassified samples in minor classes is higher than that of major classes, thus the loss caused by samples in minor classes should be treated emphatically.

There are also two other important issues that should be considered in data classification: one is the classification difficulty of samples, and the other one is effect of boundary samples. Generally, not all samples in the dataset are useful for classification, and a good class-imbalanced loss function usually gives higher weights to loss caused by important samples rather than treating each sample equally. For example, studies [27, 28, 29] considered quantity differences of classes when dealing with imbalanced data and they assigned high weights to minor classes. However, the classification difficulty difference of samples was always ignored until the well-known focal loss [31] was proposed. Samples are difficult to be classified (e.g. with high classification errors) are called hard samples, otherwise called as easy samples. Focal loss function assigns a relatively high cost to samples in minor classes and hard samples. Recently, some studies have attempted to apply focal loss to classification models, but the effect is not satisfactory [26, 32, 33, 34, 35]. There is also a general defect in these studies, where the importance of boundary samples are not considered. In fact, boundary samples have important contributions to the classification performance due to their strong distinguishing ability of classes. If boundary samples in one class are misclassified to other classes, this may lead to blurred classification boundary. Hence, boundary samples should be assigned larger weights than samples within classes. Besides, some models [17, 21, 23, 36] are designed only for binary classification problems. As to multi-class data classification, it can be regarded as multiple binary classification tasks. Using this dismantling strategy, some methods [22, 24, 25, 27] have been applied to multiclass data.

Based on the above discussion, a novel IMbalance Lessened Boosting(IMLBoost) algorithm is proposed in this work to better classify imbalanced medical records for disease diagnosis. It integrates a tailored Cost-Fitting Loss (CFL) function with an innovative boosting framework. The major contributions of this paper can be summarized as follows:

•
The boosting framework adopts additive training to improve performance of individual classification trees, in which the first and second derivation of the CFL function is derived to enable the optimization of the training procedure.
•
The tailored CFL function can assign befitting costs to different samples, taking the influence of minor samples, hard samples and boundary samples into account. It is composed of two items: extended focal loss and smoothed hinge loss. Specifically, extended-focal loss is designed for handling minor classes and hard samples, which can highlight the contribution of minor classes while automatically enhance the contribution of hard samples by focusing factors. Smoothed-hinge loss is proposed to adjust the effect of boundary samples, which is insensitive to interior samples under a boundary threshold.
•
Some data preprocessing skills, feature engineering and parameter optimization are used to further improve boosting performance.

The remainder of this paper is organized as below. Section 2 gives a brief summary of related works. Section 3 shows the theory of the proposed model. In Section 4, experiments and results are shown to verify the effectiveness of the model. Finally, conclusions are given in Section 5.
2. Related works

In the past few years, machine learning methods have been widely used for automatic classification of diseases [4, 5, 6, 7], but the imbalance of medical data brings huge challenges to the practical implement of these methods. Since boosting algorithm is simple and effective, many scholars choose it to deal with the classification of imbalanced datasets. And there has been significant interest in embedding some novel ideas into boosting algorithms. Typical works can be mainly divided into two types: data-level approaches and algorithm-level approaches.

(1)
Data level approaches. The data-level approaches eliminate data imbalance by resampling. The resampling techniques either remove samples from major classes or add samples to minor classes. ‘Sampling $+$ Boosting’ methods have been proposed to address the problem of data imbalance in the last years. Chawla et al. [22] proposed SMOTEBoost, which combines Synthetic Minority Oversampling Technique(SMOTE) and boosting procedure, and it utilizes SMOTE to produce new samples for minor classes in each boosting iteration. Seiffert et al. [24] presented a hybrid sampling/boosting algorithm RUSBoost, which randomly removes samples of major classes to keep the number balanced. Guo et al. [23] proposed DataBoost-IM, which combines data generation and boosting to improve the performance of all classes. Farshid et al. [25] presented CUSBoost, which combines a clustering-based under-sampling approach with boosting. And in the application of medical datasets, Kabir et al. [37] applied several sampling methods before using XGBoost for class-imbalanced breast cancer classification. However, over-sampling is easy to over fitting to major classes and can not learn robust features. While under-sampling may cause serious missing information of minor classes and lead to under-fitting. It’s generally known that medical data should be true and precise, the medical data synthesized by machine learning are not convincing. Therefore, we should avoid manually generating data to solve the class imbalance problem in medical field.
(2)
Algorithm level approaches. The algorithm-level approaches mainly modify existing boosting algorithms to handle imbalanced data. Most of these algorithms adjust loss functions to assign high costs to samples from minor classes. The objective of these loss functions should not just maximize classification accuracy, but also pay more attention to minor classes, because high accuracy can be achieved as long as samples in major classes are correctly classified. Inspired by this, Fan et al. [27] proposed AdaCost, where weight increasement for misclassified samples in minor classes is larger than that for major classes. Subsequently, three other versions (AdaC1, AdaC2 and AdaC3) [28] were produced, and all of them attempted to adjust the weights and AdaBoost confidence parameters to improve accuracy of minor classes. Mahesh et al. [29] proposed RareBoost, which updates weights of different classes in different ways, more specifically, the prediction weights of positive and negative samples are updated according to TP/FP(True Positive/False Positive) and TN/FN(True Negative/False Negative) respectively. Wang et al. [30] proposed NIBoost, which also updates weights according to the predicted labels and error rate.

Besides solutions to the data quantity imbalance problem, other works also studied sample classification difficulty problem, which is also common in medical data analysis. These studies [31, 38, 39] assign higher costs to hard samples compared to easy samples, but this may cause classifiers to focus on harmful samples (e.g. noisy data or mislabeled data) [34]. It is feasible to combine several loss functions for different emphasis samples to reduce the influence of harmful samples. In [17], Wang et al. applied weighted cross-entropy and focal loss on XGBoost for the imbalanced classification, and it turns out to be better than just using focal loss. Except for allocation of costs, studies at algorithm level can also deal with the imbalance by ensemble learning itself, because it can take advantage of all weak learners while avoiding limitations of single classifiers. For example, Farshid et al. [21] proposed MEBoost, which mixes the decision tree and extra trees with boosting, and results show that it is a promising technique to handle imbalance problem, but it is implemented only for two-class data.

In addition to these aforementioned methods, many other techniques of imbalance learning have been developed, such as transferring learning, metric learning, meta learning/domain adaptation, decoupling representation&classifier and so on. The transferring learning [40, 41] can transfer the knowledge learned from major classes to minor classes so as to improve accuracy of minor classes. The metric learning can calculate the similarity of samples to reduce the distance between similar classes and increase the distance between different classes, then the classification boundary is more clear [42, 43], which can improve the accuracy of all classes. The meta learning/domain adaptation can process samples of major classes and minor classes in different ways, and learn how to reweight adaptively [44, 45]. The decoupling representation&classifier divides the process of learning into two steps, normal sampling in feature learning stage and balanced sampling in classifier learning stage, which can lead to better model learning results [46, 47]. However, these techniques are often attached with deep neural networks, but medical datasets are usually too small to train neural network models. From the feasibility of implementation, this work focuses on the algorithm-level improvement of boosting models.
3. IMLBoost algorithm

Generally, the number of medical records for different disease types is rather imbalanced, and traditional machine learning methods usually have low accuracy for minor classes. To solve the problem, we propose the IMLBoost model, which involves a boosting classification framework with a novel class-imbalanced loss function. The boosting framework can perform better than other single classifiers on imbalance data classification. The loss function can improve the classification accuracy of minor classes, while hard samples and boundary samples are also considered to improve the performance.

3.1 The boosting framework for classification

For the classification of imbalanced medical data, traditional machine learning algorithms usually can not get satisfied performance, where samples in minor classes are easy to be misclassified. Ensemble learning methods use a set of independently trained classifiers to deal with this problem, which improve adaptability and robustness of base classifiers, and perform better than individual classifiers. Therefore, an effective ensemble learning algorithm XGBoost is chosen as the base classification framework. It is a gradient boosting ensemble model based on CART regression tree [13]. It is trained in an additive manner, adding a tree to fit residual of previous prediction in each iteration. The objective function of XGBoost at the $t$ -th iteration for a given dataset with $n$ samples can be defined as follows. We need to search suitable $f_{t}$ to minimize the objective.

$\displaystyle\textit{Obj}^{(t)}=\sum_{i=1}^{n}l(y_{i},∼{}\hat{y}_{i}^{(t)})+% \Omega(f_{t})=\sum_{i=1}^{n}l(y_{i},∼{}\hat{y}_{i}^{(t-1)}+f_{t}(x_{i}))+% \Omega(f_{t})$

(1) $\displaystyle\text{where}\ \Omega(f_{t})=\gamma T(f_{t})+\frac{1}{2}\lambda% \left\|w(f_{t})\right\|^{2},$

where $x_{i}$ is the $i$ -th input sample, and it represents a medical record. If a record consists of $m$ features, then $x_{i}$ is a $m$ -dimensional vector. $y_{i}$ is the target of the $i$ -th sample, and it represents the true disease type. It may be $-$ 1 or 1 in binary classification, or a label-encoding vector in multiclass classification. $\hat{y}_{i}^{(t)}$ is the prediction of the $i$ -th sample at the $t$ -th iteration, and it represents the model’s predicted disease type. $l(y_{i},\hat{y}_{i}^{(t)})$ is a loss function, which measures the difference between predictions and targets. $f_{t}$ is the $t$ -th classification tree model to fit residual between adjacent predictions, that is, $\hat{y}_{i}^{(t)}=\hat{y}_{i}^{(t-1)}+f_{t}(x_{i})$ . Then for a given $f$ , $T$ is the number of leaves in the tree; $w$ corresponds to the leaves weights in the tree; $\Omega(f)$ is the regularization term to control the complexity of the tree and avoid over-fitting. It can be tuned by controlling some parameters of the model, such as the minimum weight sum of samples in child nodes (‘min_child_weight’). If the weight sum of a leaf node is less than this value, the node will no longer be split; if the depth of a tree is greater than the maximum depth of (‘max_depth’), the tree no longer grows; the step of updating weights (‘learning_rate’) is also need to be tuned. By adding trees, the XGBoost model reduces the influence of single trees on the final prediction of samples, so that its effect is better than any single tree and it has more optimization space. However, the loss function $l(\cdot)$ in the objective is not a concrete value, and the objective function can not be optimized by the traditional optimization methods [13].

Thanks for the Taylor’s theorem, second-order approximation can be applied to optimize the objective function. As following, the first and second order gradient statistics of the loss function are defined as:

$\displaystyle g_{i}=\partial_{\hat{y}^{(t-1)}}l(y_{i},\hat{y}^{(t-1)})$ $\displaystyle h_{i}=\partial^{2}_{\hat{y}^{(t-1)}}l(y_{i},\hat{y}^{(t-1)})$ (2)

Then the objective is approximated as:

$\displaystyle\textit{Obj}^{(t)}\simeq\sum_{i=1}^{n}\left[l(y_{i},∼{}\hat{y}_{i% }^{(t-1)})+g_{i}f_{t}(x_{i})+\frac{1}{2}h_{i}f_{t}^{2}(x_{i})\right]+\Omega(f_% {t})$ (3)

In this equation, the constant term can be removed to obtain the simplified objective Eq. (4) at the $t$ -th iteration. It can be seen that the model only depends on the first and second order derivatives of the loss function. In fact, owing to the second-order approximation, the xgboost model allows users to customize loss functions on demand.

$\displaystyle\textit{Obj}^{(t)}\propto\sum_{i=1}^{n}\left[g_{i}f_{t}(x_{i})+% \frac{1}{2}h_{i}f_{t}^{2}(x_{i})\right]+\Omega(f_{t})$ (4)

Based on above analysis, the classification performance of XGBoost is indeed better than individual decision trees, but it uses overall accuracy as the optimization goal, assigning same weights to different classes, so its effectiveness on the classification of minor classes is limited.
3.2 The proposed cost-fitting loss (CFL) function

Although ensemble learning is superior to single models in the classification of imbalanced data, it still assigns same importance to different classes. If a reasonable class-imbalanced loss function is defined and the function is second-order differentiable, then it can be introduced into the framework mentioned above. In other words, the performance of the XGBoost model for imbalanced records can be improved by embedding a class-imbalanced loss function.

Hence it is necessary to design a loss function to assign different weights to samples in different classes. Normally, there is a huge difference between the sample numbers of major and minor classes, which makes loss caused by major classes account for large proportion, and the adjustment of weights for major classes is dominant. Secondly, a dataset usually includes easy samples and hard samples. Similarly, easy samples usually dominate the gradient updating direction of loss functions, which results in invalid learning. In addition, samples on the boundary are more important than those within classes.

In order to solve aforementioned problems, a novel cost-fitting loss (CFL) function is designed to pay more attention to samples in minor classes, hard samples and boundary samples in the training stage. The CFL is defined as:

$\displaystyle L_{\text{CFL}}=c[-\alpha_{s,i}(1-p_{s,i})^{\gamma}\log(p_{s,i})]% +(1-c)\left[\frac{1}{2}\max(0,1-y_{i}z_{i})^{2}\right]=cL_{F}+(1-c)L_{S}$ (5)

This function is composed of two terms. The first one (noted as ‘ $L_{F}$ ’) is an extended focal loss, which mainly measures the quantity and difficulty difference of samples. It adjusts contributions of samples in minor classes by $\alpha$ and hard samples by $\gamma$ . The $p_{s,i}$ is the predicted probability of the $i$ -th sample, which will be explained in Section 3.2.1. It corresponds to the probability transformed by $\hat{y}_{i}$ in Eq. (3.1). The second term (noted as ‘ $L_{S}$ ’) is smoothed hinge loss, which pays attention to boundary samples and is insensitive to outliers inside classes. $z_{i}$ is the raw prediction of the $i$ -th sample, and it can be directly obtained in the classifier without being converted to probability. Moreover, $c$ is the weight to balance the effect of these two loss terms. In next two subsections, we will explain these two loss terms separately.

3.2.1 Extended focal loss

The loss term $L_{F}$ is related to quantity and difficulty imbalance, which is designed to increase costs of samples in minor classes and decrease costs of easy samples. It is modified based on the cross entropy(CE), so the demonstration of the $L_{F}$ can start from the cross entropy theory [31]. The following equation is cross entropy for binary classification.

$\displaystyle\textit{CE(p,y)}=\left\{\begin{array}[]{ll}-\log(p),&\text{if}\ y% =1\\ -\log(1-p),&\text{otherwise}.\end{array}\right.$ (6)

where $p\in[0,1]$ is the predicted probability for the class with label $y=$ 1. For convenience, $p_{s}$ is generally defined as:

$\displaystyle p_{s}=\left\{\begin{array}[]{ll}p,&\text{if}\ y=1\\ 1-p,&\text{otherwise}.\end{array}\right.$ (7)

then the Eq. (6) can be rewritten: $\textit{CE(p,y)}=\textit{CE}(p_{s})=-\log(p_{s})$ . CE loss gives same weights to all classes, so it is not applicable to imbalanced datasets. $L_{F}$ mainly makes two improvements on cross entropy to deal with the imbalance. Firstly, a weighting factor $\alpha_{s}$ is introduced to balance the major and minor classes. Secondly, a modulating factor ( $1-p_{s}$ ) ${}^{\gamma}$ is added to make a difference between easy and hard samples. $L_{F}$ for binary classification is presented as Eq. (8).

$\displaystyle{L_{F}}(p_{s})=-\alpha_{s}(1-p_{s})^{\gamma}\log(p_{s})=\left\{% \begin{array}[]{ll}-\alpha(1-p)^{\gamma}\log(p),&\text{if}\ y=1\\ -(1-\alpha)p^{\gamma}\log(1-p),&\text{otherwise}.\end{array}\right.$ (8)

$\alpha$ is a weighting factor to balance the importance of samples from different classes ( $\alpha\in[0,1]$ ), and the definition of $\alpha_{s}$ is analogous to $p_{s}$ , then we can set $\alpha$ for class 1 and $1-\alpha$ for class $-1$ . It is related to the class of samples, more precisely, it can be set by inverse class frequency. Intuitively, if a class contains more samples, smaller $\alpha$ for this class is set. For example, if $\alpha$ for major class of a two-class dataset is greater than 0.5, it means that more importance is assigned to samples in the major class. Instead, if $\alpha$ is less than 0.5, it means that the loss function weights more on the minor class. In experiments, $\alpha$ can be treated as a hyperparameter to set by cross validation.

$\gamma$ is a focusing parameter ( $\gamma>0$ ), which is related to the difficulty of samples. When a sample is misclassified and $p_{s}$ is small (corresponding to hard samples), the modulating factor ( $1-p_{s}$ ) ${}^{\gamma}$ is near 1 and the factor has no effect. If the $p_{s}$ approaches 1 (corresponding to easy samples), this factor is close to 0 and as a result the cost of easy samples is down-weighted. Moreover, $\gamma$ can smoothly adjust the weights of easy samples. Taking $\gamma=$ 2 for example, a sample with probability $p_{s}=$ 0.9 (such that $1-p_{s}=$ 0.1) has a loss contribution 100 times lower than one using cross entropy, and as $\gamma$ is increased, the effect of this factor is also increased.

Therefore, the $L_{F}$ can increase weights of minor classes and decrease weights of easy samples by these two hyperparameters. Next critical step is to embed the $L_{F}$ into the XGBoost classification framework. The first and second order derivatives of the loss function $L_{F}$ are then derived in this paper. The first and second order derivative of $L_{F}$ are presented as:

$\displaystyle\frac{\partial L_{F}(p_{s})}{\partial p}=\left\{\begin{array}[]{% ll}\alpha(\gamma(1-p)^{\gamma-1}\log p-\frac{1}{p}(1-p)^{\gamma}),&\text{if}\ % y=1\\ -(1-\alpha)(\gamma p^{\gamma-1}\log(1-p)-\frac{1}{1-p}p^{\gamma}),&\text{% otherwise}.\end{array}\right.$ (9) $\displaystyle\frac{\partial^{2}L_{F}(p_{s})}{\partial p^{2}}=\left\{\begin{% array}[]{ll}-\alpha((\gamma^{2}-\gamma)(1-p)^{\gamma-2}\log p-2\gamma\frac{(1-% p)^{\gamma-1}}{p}+\frac{(1-p)^{\gamma}}{p^{2}}),&\text{if}\ y=1\\ -(1-\alpha)((\gamma^{2}-\gamma)p^{\gamma-2}\log(1-p)-2\gamma\frac{p^{\gamma-1}% }{1-p}+\frac{p^{\gamma}}{(1-p)^{2}}),&\text{otherwise}.\end{array}\right.$ (10)

Then sigmoid function is selected as activation for two-class datasets, that is $p_{i}=\sigma(z_{i})$ , and the following basic property of sigmoid will be consistently used in the derivatives:

$\displaystyle\frac{\partial p}{\partial z}=\frac{\partial\sigma(z)}{\partial z% }=\sigma(z)(1-\sigma(z))=p(1-p)$ (11)

Taking the sigmoid into consideration, the first and second order derivatives of $L_{F}$ are presented as:

$\displaystyle\frac{\partial L_{F}}{\partial z}=\left\{\begin{array}[]{ll}% \alpha(\gamma p(1-p)^{\gamma}\log p-(1-p)^{\gamma+1}),&\text{if}\ y=1\\ -(1-\alpha)(\gamma p^{\gamma}(1-p)\log(1-p)-p^{\gamma+1}),&\text{otherwises}.% \end{array}\right.$ (12) $\displaystyle\frac{\partial^{2}L_{F}}{\partial z^{2}}=\left\{\begin{array}[]{% ll}-\alpha(1-2p)(p\gamma(\gamma-1)(1-p)^{\gamma-1}\log p-2\gamma(1-p)^{\gamma}% +\frac{(1-p)^{\gamma+1}}{p}),&\text{if}\ y=1\\ -(1-\alpha)(1-2p)(\gamma(\gamma-1)p^{\gamma-1}(1-p)\log(1-p)-2\gamma p^{\gamma% }+\frac{p^{\gamma+1}}{1-p}),&\text{otherwise.}\end{array}\right.$ (13)

Finally, with the first order derivative in Eq. (12) and the second order derivative in Eq. (13), $L_{F}$ can be embedded into XGBoost framework. As a result of this loss term, the base classification framework can focus more on minor classes and hard samples.

3.2.2 Smoothed hinge loss

Considering that $L_{F}$ does not care the importance of boundary samples for classification performance, the smoothed hinge loss is proposed to further solve the problem in this paper. In machine learning field, hinge loss is especially good at dealing with the classification boundary [48]. It directly takes classification score ${z}$ as output instead of probability normalized by sigmoid activation function. The most common expression of the loss function for the binary classification is as follows:

$\displaystyle L=\text{max}(0,1-y_{i}z_{i})=\left\{\begin{array}[]{ll}0,&\text{% if}\ y_{i}z_{i}<1\\ 1-y_{i}z_{i},&\text{otherwise},\end{array}\right.$ (14)

where $y_{i}$ is the true class label of the $i$ -th sample $x_{i}$ . $z_{i}$ is the prediction score of the $x_{i}$ . The constant 1 in the formula represents the classification threshold. Intuitively, if samples satisfy $y_{i}z_{i}>1$ , they are far enough from the boundary and easy to be classified, which means they have less importance to affect the boundary. Moreover, if a sample satisfies $y_{i}z_{i}>>1$ , then it will have no contribution to the loss. In other words, when the distance between a sample and boundary exceeds the threshold 1, there is almost no cost to misclassify this sample, so that the loss function is more robust to outliers inside the margin. Finally, it doesn’t just handle binary classification data, and it can be extended to multiclass data [49].

However, it’s obviously impossible to get second-order derivative of such a piece-wise function. Hence, a smoothed loss function $L_{S}$ is introduced in this paper, and its first and second order derivatives are then deduced out. The proposed loss function $L_{S}$ can still attach importance to the boundary samples and guarantees good performance. The smoothed loss function $L_{S}$ can be denoted as:

$\displaystyle L_{S}=\frac{1}{2}{\text{max}(0,1-y_{i}z_{i})}^{2}$ (15)

Its first and second order derivatives are then computed. They are denoted as:

$\displaystyle\frac{\partial L_{S}}{\partial z_{i}}=-y_{i}\text{max}(0,1-y_{i}z% _{i})$ (16) $\displaystyle\frac{\partial^{2}L_{S}}{\partial z_{i}^{2}}=\left\{\begin{array}% []{ll}0,&\text{if}\ y_{i}z_{i}<1\\ y_{i}^{2},&\text{otherwise}.\end{array}\right.$ (17)

3.2.3 Proposed IMLBoost algorithm

By combining the first and second order derivatives of $L_{F}$ and $L_{S}$ with the weight $c$ , the derivations of the CFL function can be got as Eqs (18) and (19). Then it is integrated into the classification framework, which makes the model pay more attention to minor classes, hard samples and boundary samples and have better performance for identifying heterogeneous records.

$\displaystyle\frac{\partial L_{\textit{CFL}}}{\partial z}=c\frac{\partial L_{F% }}{\partial z}+(1-c)\frac{\partial L_{S}}{\partial z}$ (18) $\displaystyle\frac{\partial^{2}L_{\textit{CFL}}}{\partial z^{2}}=c\frac{% \partial^{2}L_{F}}{\partial z^{2}}+(1-c)\frac{\partial^{2}L_{S}}{\partial z^{2}}$ (19)

In summary, a novel IMLBoost algorithm is proposed in this work to deal with the data imbalance problem in intelligent diagnosis, which embedding the designed CFL function into the boosting framework. The proposed CFL function includes two terms: $L_{F}$ and $L_{S}$ . $L_{F}$ takes the quantity and difficulty imbalance of samples into account, which involves two imbalanced parameters $\alpha$ and $\gamma$ ; $L_{S}$ pays attention to the boundary samples, which is combined with $L_{F}$ by the parameter $c$ . Through the above mathematical derivation and analysis, the first and second order derivatives of CFL function are provided for the Xgboost framework to deal with the imbalance problem. Therefore, IMLBoost contains parameters of the original XGBoost and parameters designed specifically for class imbalance.

The diagram of the proposed model is shown in Fig. 1. The IMLBoost algorithm includes two mainly stages: train and predict. The objective (‘obj’) of the training stage involves the embedding of the CFL function. The second stage offers the prediction of samples.

Figure 1.

The framework of the IMLBoost.

4. Experiments and results

In this section, five imbalanced medical datasets are used to evaluate the performance based on three evaluation metrics. Some techniques including feature selection for high-dimensional datasets and parameter tuning are used to improve the performance of models. The proposed IMLBoost model is compared with other machine learning methods and class-imbalanced methods to verify its superiority.

4.1 Datasets and metrics

4.1.1 Datasets

In order to verify that the IMLBoost model has high applicability to the problem of imbalance, five datasets from UCI database [50, 51, 52] are chosen: Diabetes, Column, Heart failure, Arrhythmia, and Hypothyroidism. They all have imbalanced target labels and their associated task is classification. In addition, some of them contain missing values. Table 1 shows the characteristics of these datasets, where ‘samples’ is the number of samples, ‘classes’ is the number of targets, ‘features’ is the number of attributes and ‘imbalance ratio’ is a ratio between the maximum and minimum number of samples among all classes. The larger imbalance ratio is, the more imbalanced dataset is.

Table 1
Datasets for experiment

Dataset	Samples	Classes	Features	Imbalance ratio
Diabetes	768	2	8	1.87
Column	310	3	6	2.50
Heart failure	299	2	12	2.11
Arrhythmia	452	16	279	112.50
Hypothyroid	3163	2	25	20.70

A detailed description of two datasets is present: the low-dimensional Heart failure dataset and the high-dimensional Arrhythmia dataset. The former dataset contains 299 patients records collected during the follow-up period. Each patient file has 12 clinical features, including 5 category attributes(sex, anaemia, smoking, high_blood_pressure and diabetes) and 7 continuous attributes(e.g.age, ejection_fraction and serum_sodium). Its target is death or survival. The latter one is divided into 16 classes. It has a total of 452 samples and each sample contains 279 attributes, 206 of which are continuous attributes and the rest are category attributes. They are expressed as ordinal numbers X1, X2, $\ldots$ , X279. The first 4 attributes are general information(age, sex, height and weight), and the remaining 275 attributes are from standard 12-lead ECG records. There are many missing values in this dataset.

4.1.2 Evaluation metrics

The indicator is very important for measuring the performance of models on imbalanced data [53]. Three measures are used to assess the classification performance in this work: F1-score, Geometric-mean(G-mean) [54], and the Area Under ROC curve(AUC) [55], which have been widely used in imbalanced classification. According to the confusion matrix, TP (FP) is the number of positive (negative) samples classified correctly, and FN (TN) is the number of positive (negative) samples classified wrongly. These three metrics are computed as follows:

$\displaystyle\textit{F1-score}=\frac{2\times\textit{sensitivity}\times\textit{% precision}}{\textit{sensitivity}+\textit{precision}},$ $\displaystyle\textit{G-mean}=\sqrt{\textit{sensitivity}\times\textit{% specificity}},$ $\displaystyle\textit{AUC}=\frac{(1+\textit{TP}-\textit{FP})}{2},$

here precision $=$ TP/TP $+$ FP, sensitivity $=$ TP/TP $+$ FN and specificity $=$ TN/TN $+$ FP. F1-score combines precision and recall. G-mean is the harmonic average of accuracy and recall. AUC is the area under accuracy and recall curves. Hence, they are usually used to evaluate the classification of imbalanced data.

4.2 Experimental design

These medical datasets may have some problems such as missing values and outliers. Therefore, some preprocessing work is performed before training models, such as missing values filling, standardization of continuous attributes and so on. Then models are trained and tested with proper parameter settings. All experiments randomly select 80% of data for training and the remaining 20% for testing. To better illustrate the validity of our proposed model, two groups of comparative experiments are conducted. Firstly, the proposed method is compared with 3 machine learning classification models. Secondly, it is compared with other 4 anti-imbalance boosting methods.

4.2.1 Data preprocessing and feature analysis

Figure 2.

Exploratory data analysis. (a) (c) (e) represent the class distribution, categorical attribute and continuous attribute analysis of the Heart failure dataset; (b) (d) (f) accord to the Arrhythmia dataset.

It is important to preprocess these datasets because the quality of data may greatly affect the classification ability of models. As we all know, noise and missing values often exist in the data in the real world, and models are sensitive to them. On the one hand, models behave badly with increasement of missing values. In general, linear models are sensitive to missing values. The proposed model is based on the tree, in which the missing values are not considered when the nodes are split [13], then it is not sensitive to missing values. On the other hand, the noise data interferes with the learning of models. The proposed boosting framework we used adds modification parameters in pruning to tolerate noise. Therefore, the proposed model also can mitigate the impact of noise to a certain extent.

The preprocessing work includes missing values filling, dummy variable processing, standardization of continuous attributes and so on. More specifically, the targets are label encoded. The missing values of categorical variables are filled with mode values and the missing values of continuous variables are filled with mean values. The min-max method is used to standardize continuous attributes. Then descriptive statistics and data visualization of the Heart failure and the Arrhythmia dataset are presented in Fig. 2.

Figure 3.

Heat map of correlation analysis for the Arrhythmia dataset.

In the exploratory data analysis, the number of samples in all classes are shown in Fig. 2(a) and (b). It is clear from the pie charts that data labels are imbalanced. Next the histograms in Fig. 2(c) and (d) are used to show the relationship between category attributes and labels. Taking the attributes ‘existence of derivation of R’ and ‘sex’ in Fig. 2(d) as examples, it can be seen that if the ‘existence of derivation of R’ $=$ 0, there are no samples in class 10, which indicates that this attribute is useful for the recognition of this class. But all classes appear in both male and female, which indicates that the ‘sex’ is of little significance for classification. Finally, the relationship between continuous attributes and targets are analyzed. Some attributes are shown in Fig. 2(e) and (f), and they almost satisfy Gaussian-like distributions. If an attribute has a skewed distribution, it can be transformed to meet the normal distribution requirement, which is necessary for statistical test (e.g.AVONA test). Then according to the results of statistical test, it is easy to identify which features are statistically significant and useful for classification.

In addition to the univariate analysis above, correlation analysis is conducted to fully demonstrate the relationship between features. Generally, it is used to measure the linear relationship between two features and reveal the closeness between two features. If two variables change in the same direction, they are positively correlated. Conversely, if two variables change in opposite direction, they are negatively correlated. The correlation matrix of the Arrhythmia dataset is shown in Fig. 3. It can be observed there is a light color between X28 and X171, which shows that the correlation relationship between them is strong. That features have strong correlation relationship means that there are redundancy in dataset, so feature selection in next section is necessary.

4.2.2 Feature selection

It is vital to select meaningful features to train models. Irrelevant and redundant features are not useful for classification, and may even bring bad influence on the performance. For example, patients’ names do not contribute to disease diagnosis at all, but they increase the complexity of learning process and affect learning performance. This is especially obvious for the high-dimensional datasets, which make the work of feature selection essential. Generally speaking, feature selection can be divided into three types: wrappers, filters, and embedded methods [56]. Wrappers methods select feature subsets according to their evaluation power, in which feature selection and models training affect each other; filters methods select subsets as a preprocessing step, independently of the models training; embedded methods perform feature selection in the process of training and are usually specific to given learners, in which some machine learning models are trained at first, then the importance of features are obtained and the subsets can be selected according to the importance.

The IMLBoost model adopts embedded method to select features and it is used for the high-dimensional Arrhythmia dataset. In the beginning, all the data are input into the IMLBoost for training, and the importance of some features is shown in Fig. 4. It can be found that f14 is the most dominant factor, and it represents heart rate (number of heartbeats per minute), which is highly consistent with the real life. Then we conduct experiments on this dataset of top- $k$ features with the $k$ -th biggest importance. The results of the IMLBoost model on the dataset with top-15 features, top-20 features, top-25 features and all features are shown in Fig. 5. From these curves, it can be seen that the model with more features has higher performance at first, and it can achieve a better result with top-25 features. In fact, through a number of experiments, we find that the time consumption and classification performance with top-25 features is satisfying. As one can observe, the performance with all features decreases. The reason may be that later features are noisy or not useful for classification. Finally, we use these 25 features to finish the classification task.

Figure 4.

The importance of features in IMLBoost.

Figure 5.

ROC curve with different numbers of features.

Figure 6.

Results of IMLBoost on the Arrhythmia dataset with respect to (a) learning_rate (b) max_depth (c) $\alpha$ and $\gamma$ (d) weight (e) both learning_rate and max_depth.

4.2.3 Parameter tuning

In this section, some parameters in the proposed model are fine-tuned. IMLBoost model is constructed based on the optimization of five parameters, including learning_rate, max_depth, alpha, gamma and weight c. Other parameters are set as default and fourfold cross validation strategy is used in experiments.

Table 2
Experiment results on datasets(%)

Dataset	Decision	SVM	SMOTE-	RUS-	Ada-	ME-	Imbalance-	XGBoost
	tree		Boost[22]	Boost[24]	cost[27]	Boost[21]	XGBoost[17]	None	LS	LF	CFL
F1-score
Diabetes	74.16	79.16	78.88	87.18	74.15	77.23	75.31	77.33	79.82	77.54	80.90
Column	82.40	77.84	85.66	89.00	70.63		85.73	85.67	90.57	84.15	90.18
Heart failure	83.54	76.04	85.26	90.28	72.61	93.27	84.88	83.54	83.33	86.83	88.41
Arrhythmia	64.79	54.74	72.54	66.90	45.65		62.66	69.53	67.58	67.01	68.94
Hypothyroid	98.88	97.28	98.46	52.80	88.61	99.20	98.31	99.22	99.37	99.26	99.37
Average	75.34	55.64	63.18	83.04	76.48		81.38	79.56	81.95	78.34	83.83
G-mean
Diabetes	54.90	58.82	74.51	86.82	82.35	56.86	70.65	66.67	68.63	60.48	66.67
Column	83.96	79.66	86.74	91.50	77.85		87.82	86.75	91.75	85.44	90.69
Heart failure	78.94	42.10	84.21	92.39	89.47	84.21	73.68	78.94	73.68	73.68	84.21
Arrhythmia	71.38	44.44	75.41	90.06	32.77		51.95	71.67	31.43	70.69	95.32
Hypothyroid	87.50	53.16	90.63	61.46	100	90.63	71.87	93.75	93.75	93.75	93.75
Average	80.75	77.01	82.30	85.45	70.33		71.19	83.05	84.13	82.96	86.13
AUC
Diabetes	74.19	82.58	81.57	93.05	82.68	80.20	81.80	83.21	83.53	83.34	84.72
Column	91.02	95.27	96.68	97.52	84.91		96.81	97.14	96.77	96.96	97.93
Heart failure	86.07	87.03	94.74	95.64	87.16	93.71	94.99	92.17	89.47	95.51	96.02
Arrhythmia	84.40	96.40	94.16	92.45	91.67		93.82	94.22	96.29	91.53	96.53
Hypothyroid	96.63	97.41	99.19	97.64	99.42	98.32	99.72	99.30	99.45	98.00	99.69
Average	86.46	91.73	74.51	76.33	71.91		93.43	93.21	93.10	93.07	94.85

Figure 7.

ROC curve of experiments. (a) (c) represent ML models and class-imbalanced methods on the Arrhythmia dataset; (b) (d) are results of the Heart failure dataset.

Based on the Arrhythmia dataset, some important parameters used in the experiments are described. The number of iterations is set to 50 and the min_child_weight is directly set to 2. The learning_rate and max_depth of IMLBoost are optimized as Fig. 6(a,b,e). They are divided in a certain range, where the range of learning_rate is [0, 0.6] with step 0.02 and the range of max_depth is [2, 10] with step 1. From the results in figure, the best performance can be obtained when learning_rate $=$ 0.06 and max_depth $=$ 4, so they are set in this way in the experiments. As for the parameters alpha, gamma and c involved in the proposed loss function: the range of alpha is set in [0.5, 1], and the step is set to 0.1. The range of gamma is [0, 5], and the step is set to 1. Then their experimental result is shown as the surface graph in Fig. 6(c). It can be seen that when alpha $=$ 1 and gamma $=$ 3, the performance of the model is optimal. After determining alpha and gamma, the weight parameter $c$ is also tuned. Its range is set in [0.1, 0.9] and the step is set to 0.2 at the beginning, and it turns out that $c=$ 0.9 works best in these experiments. Then a small-scale search between 0.9 and 1 is done for weight $c$ , and after the several small-scale tests, the best result is obtained when $c=$ 0.96. Its optimization results are shown in Fig. 6(d). Here it is pointed that the optimal values for the above parameters are different for different datasets.

4.2.4 Validation of IMLBoost

In order to demonstrate the validation of the IMLBoost, comparative experiments are organized in two aspects. Firstly, the method is compared with other machine learning classification methods, such as decision tree, SVM and XGBoost. Secondly, the IMLBoost is compared with other existing anti-imbalance methods, such as SMOTEBoost, RUSBoost, AdaCost, MEBoost and Imbalance-XGBoost. Additionally, we also test the LFBoost( $L_{F}+$ XGBoost) and LSBoost( $L_{S}+$ XGBoost) for more comprehensive experiments.

The experiment results of these methods on the Heart failure and the Arrhythmia dataset are presented in Fig. 7, it is clear that the IMLBoost model has a higher AUC than other methods, which proves its effectiveness. Further, the classification results of these methods on all datasets are shown in Table 2, which shows F1-score, G-mean and AUC of each dataset and the averages on the five datasets. Considering that the MEBoost does not apply to multiclass datasets, it is not compared with other methods on the whole. From the table, it can be seen that the IMLBoost yields the best average rank in terms of all metrics. For the evaluation metric F1-score, our method is 0.5% higher than the second-best method; For G-mean, our method is 0.7% higher than the second-ranked method; For AUC, our method is 1.6% higher than the second place. In a word, all experimental results on these datasets confirm the effectiveness of the IMLBoost model.

5. Conclusions

In this paper, a new classification model IMLBoost for imbalanced medical records is proposed, which import the cost-fitting loss (CFL) function into a boosting framework. This CFL function leverages cost assignment to samples in minor classes as well as hard samples and boundary samples. The classification accuracy of those samples is largely improved. Owing to the first and second derivations of the CFL function, the boosting framework can be implemented in an optimized manner. Moreover, some machine learning skills (feature selection and parameter tuning) can further improve the performance of the IMLBoost. The experimental results on five imbalanced medical datasets indicated that the proposed model significantly performed better than other methods on F1-score, G-mean and AUC. The proposed IMLBoost can be extended for classification of more imbalanced datasets.

Footnotes

Acknowledgments

This work was supported by the Key Research Development program of Shandong Province [Grant 2019GGX101021] and the Taishan Scholars Program of Shandong Province [Grant NO.ts20190985].

References

Jain

Chotani

and Anuradha

, Disease diagnosis using machine learning: A comparative study,in: Data Analytics in Biomedical Engineering and Healthcare, Elsevier, 2021, pp. 145–161.

Arslan

A.K.

Colak

and Sarihan

, Different medical data mining approaches based prediction of ischemic stroke, Computer Methods Programs in Biomedicine130 (2016), 87–92.

Wosiak

, Feature selection and classification pairwise combinations for high-dimensional tumour biomedical datasets, Schedae Informaticae 24 (2015), 53–62.

Wang

Yan

D.Q.

and Liang

H.X.

, Fuzzy Support Vector Machine with Imbalanced Regulator and its Application in Stroke Classification, in: 2019 IEEE Fifth International Conference on Big Data Computing Service and Applications (BigDataService), IEEE, 2009, pp. 290–295.

Liu

Chen

and Qin

, Privacy-Preserving Patient-Centric Clinical Decision Support System on Naïve Bayesian Classification, IEEE Journal of Biomedical Health Informatics 20(2) (2017), 655–668.

Al-Hadeethi

Abdulla

Diykh

Deo

R.C.

and Green

J.H.

, Adaptive boost LS-SVM classification approach for time-series signal classification in epileptic seizure diagnosis applications, Expert Systems with Applications 161 (2020), 113676.

Alhakbani

H.A.

and Al-Rifaie

M.M.

, Exploring feature-level duplications on imbalanced data using stochastic diffusion search, in: Multi-agent systems and agreement technologies, Springer, 2016, pp. 305–313.

Ventura-Molina

Alarcón-Paredes

Aldape-Pérez

Yáñez-Márquez

and Adolfo Alonso

, Gene selection for enhanced classification on microarray data using a weighted k-NN based algorithm, Intelligent Data Analysis 23(1) (2019), 241–253.

Alazrai

Momani

Khudair

H.A.

and Daoud

M.I.

, EEG-based tonic cold pain recognition system using wavelet transform, Neural Computing and Applications 31(7) (2019), 3187–3200.

10.

Wang

C.W.

Lee

Y.C.

Calista

Zhou

Zhu

Suzuki

Komura

Ishikawa

and Cheng

S.P.

, A benchmark for comparing precision medicine methods in thyroid cancer diagnosis using tissue microarrays, Bioinformatics 34(10) (2018), 1767–1773.

11.

Wang

Shi

and Wang

, Robust propensity score computation method based on machine learning with label-corrupted data, arXiv preprint arXiv:180103132. (2018).

12.

Ratsch

, Soft Margins for AdaBoost, Machine Learning 42(3) (2001), 287–320.

13.

Chen

and Guestrin

, XGBoost: A Scalable Tree Boosting System, in: Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, 2016, pp. 785–794.

14.

Meng

Finley

Wang

Chen

and Liu

T.-Y.

, LightGBM: A Highly Efficient Gradient Boosting Decision Tree, Advances in Neural Information Processing Systems 30 (2017), 3146–3154.

15.

Prokhorenkova

Gusev

Vorobev

Dorogush

A.V.

and Gulin

, CatBoost: unbiased boosting with categorical features, arXiv preprint arXiv:170609516. (2017).

16.

Chen

Zuo

Zheng

and Ren

, Radar emitter classification for large data set based on weighted-xgboost, IET Radar Sonar & Navigation11(8) (2017), 1203–1207.

17.

Wang

Deng

and Wang

, Imbalance-XGBoost: leveraging weighted and focal losses for binary label-imbalanced classification with XGBoost, Pattern Recognition Letters 136 (2020), 190–197.

18.

Hou

and Liu

, An anti-noise ensemble algorithm for imbalance classification, Intelligent Data Analysis 24 23(6) (2019), 1205–1217.

19.

Zhang

Huang

Yang

and Jiang

, Using Multi-features and Ensemble Learning Method for Imbalanced Malware Classification, in: 2016 IEEE Trustcom/BigDataSE/I SPA, IEEE, 2016, pp. 965–973.

20.

Kang

and Oh

, Balanced training/test set sampling for proper evaluation of classification models, Intelligent Data Analysis 24(1) (2020), 5–18.

21.

Rayhan

Ahmed

Mahbub

Jani

M.R.

Shatabda

Farid

D.M.

and Rahman

C.M.

, Meboost: Mixing estimators with boosting for imbalanced data classification, in: 2017 11th international conference on software, knowledge, information management and applications (SKIMA), IEEE, 2017, pp. 1–6.

22.

Chawla

N.V.

Lazarevic

Hall

L.O.

and Bowyer

K.W.

, SMOTEBoost: Improving Prediction of the Minority Class in Boosting, in: European Conference on Knowledge Discovery in Databases: Pkdd, 2003, pp. 107–119.

23.

Guo

and Viktor

H.L.

, Learning from imbalanced data sets with boosting and data generation: The DataBoost-IM approach, ACM SIGKDD Explorations Newsletter 6(1) (2004), 30–39.

24.

Seiffert

Khoshgoftaar

T.M.

Van Hulse

and Napolitano

, RUSBoost: A Hybrid Approach to Alleviating Class Imbalance, IEEE Transactions on Systems Man Cybernetics Part A Systems Humans 40(1) (2010), 185–197.

25.

Rayhan

Ahmed

Mahbub

Jani

Shatabda

and Farid

D.M.

, Cusboost: Cluster-based under-sampling with boosting for imbalanced classification, in: 2017 2nd International Conference on Computational Systems and Information Technology for Sustainable Solution (CSITSS), IEEE, 2017, pp. 1–5.

26.

Cui

Jia

Lin

T.-Y.

Song

and Belongie

, Class-balanced loss based on effective number of samples, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 9268–9277.

27.

Fan

Stolfo

S.J.

Zhang

and Chan

P.K.

, AdaCost: misclassification cost-sensitive boosting, in: Icml, Vol. 99, Citeseer, 1999, pp. 97–105.

28.

Freund

and Schapire

, Lecture Notes in Computer Science, Comput Learn Theory 55 (1970), 23–37.

29.

Siers

M.J.

and Islam

M.Z.

, Software defect prediction using a cost sensitive decision forest and voting, and a potential solution to the class imbalance problem, Information Systems 51 (2015), 62–71.

30.

Wang

Chen

and Wang

, NIBoost: new imbalanced dataset classification method based on cost sensitive ensemble learning, Journal of Computer Applications 39(3) (2019), 629–633.

31.

Lin

T.Y.

Goyal

Girshick

and Dollár

, Focal Loss for Dense Object Detection, in: 2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 2999–3007.

32.

Hossain

M.S.

Paplinski

and Betts

, Adaptive Class Weight based Dual Focal Loss for Improved Semantic Segmentation, arXiv preprint arXiv:190911932. (2019).

33.

Sun

Dong

Sutcliffe

Chen

and Feng

, Drug-drug interaction extraction via recurrent hybrid convolutional neural networks with an improved focal loss, Entropy 21(1) (2019).

34.

Ren

Zeng

Yang

and Urtasun

, Learning to reweight examples for robust deep learning, in: International Conference on Machine Learning, PMLR, 2018, pp. 4334–4343.

35.

Koh

P.W.

and Liang

, Understanding black-box predictions via influence functions, in: International Conference on Machine Learning, PMLR, 2017, pp. 1885–1894.

36.

Wang

Tian

and Liu

, Adaptive FH-SVM for imbalanced classification, IEEE Access 7(2019), 130410–130422.

37.

Kabir

M.F.

and Ludwig

, Classification of breast cancer risk factors using several resampling approaches, in: 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), IEEE, 2018, pp. 1243–1248.

38.

Malisiewicz

Mulam

and Efros

, Ensemble of exemplar-SVMs for object detection and beyond, 2011, pp. 89–96.

39.

Dong

Gong

and Zhu

, Class rectification hard mining for imbalanced deep learning, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 1851–1860.

40.

Wang

Ramanan

and Hebert

, Learning to Model the Tail, in: Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017, pp. 7032–7042.

41.

Bengio

, Sharing Representations for Long Tail Computer Vision Problems, in: Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, 2015, pp. 1–1.

42.

Huang

Loy

C.C.

and Tang

, Learning Deep Representation for Imbalanced Classification, in: Computer Vision Pattern Recognition, 2016, pp. 5375–5384.

43.

Zhang

Fang

Wen

and Qiao

, Range Loss for Deep Face Recognition with Long-tail, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 5409–5418.

44.

Shu

Xie

Zhao

Zhou

and Meng

, Meta-Weight-Net: Learning an Explicit Mapping For Sample Weighting, arXiv preprint arXiv:190207379. (2019).

45.

Brown

Yang

M.-H.

Wang

and Gong

, Rethinking Class-Balanced Methods for Long-Tailed Visual Recognition From a Domain Adaptation Perspective, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 7607–7616.

46.

Zhou

Cui

Wei

X.S.

and Chen

Z.M.

, BBN: Bilateral-Branch Network With Cumulative Learning for Long-Tailed Visual Recognition, in: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.

47.

Kang

Xie

Rohrbach

Yan

Gordo

Feng

and Kalantidis

, Decoupling Representation and Classifier for Long-Tailed Recognition, arXiv preprint arXiv:191009217. (2019).

48.

Suykens

and Vandewalle

, Least Squares Support Vector Machine Classifiers, Neural Processing Letters 9(3) (1999), 293–300.

49.

Hsu

C.-W.

and Lin

C.-J.

, A Comparison of Methods for Multiclass Support Vector Machines, IEEE transactions on neural networks/a publication of the IEEE Neural Networks Council 13 (2002), 415–25.

50.

Dua

, UCI Machine Learning Repository, http://archive.ics.uci.edu/ml.

51.

Guvenir

H.A.

Acar

Demiroz

and Cekin

, A supervised machine learning algorithm for arrhythmia analysis, in: Computers in Cardiology 1997, IEEE, 1997, pp. 433–436.

52.

Chicco

and Jurman

, Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone, BMC Medical Informatics and Decision Making 20(1) (2020).

53.

Liu

Wang

Ren

Zhou

and Diao

, A Classification Method Based on Feature Selection for Imbalanced Data, IEEE Access7 (2019), 1–1.

54.

Akbani

Kwek

and Japkowicz

, Applying support vector machines to imbalanced datasets, in: European conference on machine learning, Springer, 2004, pp. 39–50.

55.

Matthews

B.W.

, Comparison of the predicted and observed secondary structure of T4 phage lysozyme., Biochim Biophys Acta 405(2) (1975), 442–451.

56.

Guyon

and Elisseeff

, An Introduction to Variable and Feature Selection, Journal of Machine Learning Research 3(6) (2003), 1157–1182.

IMLBoost for intelligent diagnosis with imbalanced medical records

Abstract

Keywords

1. Introduction

3.1 The boosting framework for classification

4.1 Datasets and metrics

4.1.1 Datasets

Table 1 Datasets for experiment

4.2 Experimental design

4.2.1 Data preprocessing and feature analysis

Table 2 Experiment results on datasets(%)

5. Conclusions

Footnotes

Acknowledgments

References

Table 1
Datasets for experiment

Table 2
Experiment results on datasets(%)