RBSP-Boosting: A Shapley value-based resampling approach for imbalanced data classification

Abstract

Addressing the problem of imbalanced data category distribution in real applications and the problem of traditional classifiers tending to ensure the accuracy of the majority class while ignoring the accuracy of the minority class when processing imbalanced data, this paper proposes a method called RBSP-Boosting for imbalanced data classification. First, RBSP-Boosting introduces the Shapley value and calculates the Shapley value for each sample of the dataset through the truncated Monte Carlo method. Moreover, the proposed method removes the noise data according to the Shapley value and undersamples the samples with Shapley values less than zero in the majority class. Then, it takes the Shapley value as the weight of the sample and oversamples the minority class according to the weight. Finally, the new dataset is trained on the classifier through the AdaBoost classifier. Experiments are conducted on nine groups of UCI and KEEL datasets, and RBSP-Boosting is compared with four sampling algorithms: Random-OverSampler, SMOTE, Borderline-SMOTE and SVM-SMOTE. Experimental results show that the RBSP-Boosting method in the three evaluation metrics of AUC, F-score and G-mean, compared with the best performance of the four comparison algorithms, increases by 4.69%, 10.3% and 7.86%, respectively. The proposed method can significantly improve the effect of imbalanced data classification.

Keywords

Shapley value resampling imbalanced data monte carlo

1. Introduction

Classification is a fundamental and important problem in data analysis and processing. It has many applications, such as text classification [1], open set classification [2] and multilabel classification [3]. Imbalanced data classification is one of the hot issues. In an imbalanced dataset, the number of samples in one class is much lower than the number of samples in other classes. In recent years, the problem of imbalanced data classification has become increasingly prominent and has received extensive attention from many researchers [4], and it is common in disease prediction [5], anomaly detection [6] and medical diagnosis [7] fields. For example, disease is often a label for a minority class in malignancy diagnosis. To ensure the maximization of overall prediction accuracy, traditional classification algorithms may predict the presence of malignancy as healthy, delaying patient treatment. Therefore, improving the classification of imbalanced data has become a hot issue in the field of machine learning and medical diagnosis.

Researchers have proposed many methods to solve the problem of imbalanced data classification. Behzad et al. [8] proposed a hybrid method based on the concept of density and clustering that uses K-means to select minority samples for oversampling and deletes the majority samples with low density. Intouch et al. [9] proposed an oversampling technique based on probability distribution, which eliminates noise data through the Z score method and generates new samples according to the probability distribution of each minority sample.

However, the existing studies are all based on the data distribution to generate a balanced dataset but do not consider the real impact of each sample on the classifier. The contribution of each sample to the classifier is different. Both negative and positive effects are possible. Accurately measuring the contribution of each sample to the classifier and sampling based on it can improve the effect of imbalanced data classification. The Shapley value was originally used to solve the problem of the distribution of interests of all parties in the cooperative game. In the field of machine learning, it is used for machine learning interpretability.

In response to the above problems, the paper uses the Shapley value to measure the real contribution of each sample to the classifier and proposes an imbalanced data classification method RBSP-Boosting (resampling based on Shapley-Boosting). First, RBSP-Boosting improves the calculation method of the Shapley value to reduce the complexity. Second, it calculates the Shapley value of each sample to measure the real contribution. Then, the method oversamples the minority samples with large Shapley values and deletes the majority samples with small Shapley values to generate a balanced dataset. It has better performance on AUC, F-score and G-mean than existing classic methods.

The main contributions of this paper are as follows.

(1)
Aiming at the problem that the time complexity of the Shapley value calculation method is O (N!), we use the Monte Carlo method to transform it into an expectation problem, and the real-time complexity is reduced to O (TN).
(2)
To consider the contribution of each data point to the classifier, we apply the Shapley value to the imbalanced data classification for the first time. The Shapley value is used to measure the contribution of each sample and generate a balanced dataset based on it. The performance of the proposed method on AUC, F-score and G-mean is improved.

The rest of the paper is organized as follows: Section 2 introduces the related work of imbalanced data classification, data estimation and Shapley value. Section 3 introduces the RBSP-Boosting method in detail. Section 4 verifies the effect of RBSP-Boosting on real datasets and compares it with the other four methods. Section 5 concludes the paper.
2. Background and related work

2.1 Imbalanced data classification

Currently, the research on the classification of imbalanced data focuses on three areas: data, features and algorithms. Figure 1 shows the generic framework of the imbalanced data classification problem and the problem to be solved in this paper. In the three stages of resampling, feature selection and classifier training, domain experts need to select appropriate algorithms according to domain knowledge and data characteristics.

Figure 1.

General framework for imbalanced data classification.

(1) Data level work

The resampling technique is mainly used to generate balanced datasets and transform the imbalanced data classification problem into a balanced data classification problem. Kubat et al. [10] proposed the one-sided selection (OSS) undersampling method based on K-nearest neighbors, which combines the Tomek links and compressed nearest neighbor methods, uses Tomek links to remove noisy samples and majority boundary samples, and uses the compressed nearest neighbor method to remove samples away from the boundary in the majority class, making the sampled samples more valuable for learning. Uzmán-Ponce et al. [11] proposed the DBIG-US algorithm, which combines a two-stage approach to filter noisy data with the DBSCAN clustering algorithm and a graph-based process to solve class imbalance. In addition, researchers have also proposed oversampling methods. The SMOTE [12] method is the most classic oversampling method, which generates new samples by linear interpolation between each minority sample and its K-nearest neighbors, giving the minority class a larger generalization space. Mathew et al. [13] improved on the shortcomings of the SMOTE algorithm and proposed a weighted kernel-based oversampling method WK-SMOTE, which oversamples the minority samples directly in the feature space of the SVM classifier and overcomes the limitation of nonlinear problems.

(2) Feature level work

Feature selection methods are mainly used to select a subset of features from the full feature space. For example, Shahee et al. [14] proposed a distance-based feature selection method ED-Relief, which employs a new distance metric that uses the geometric mean of JF scatter as well as separation between classes to deal with the imbalanced problem between within and between classes, changing the limitations of traditional methods based on interclass imbalance feature selection only. Yang et al. [15] used sampling methods to filter out multiple balanced datasets from imbalanced datasets and then used the ensemble of classifiers trained on the balanced datasets to filter feature subsets.

(3) Algorithmic level work

Commonly used methods are cost-sensitive learning [16] and ensemble learning. For example, Lucas et al. [17] proposed a cost-sensitive adaptive random forest algorithm (CSARF) to alleviate the problem of imbalanced data in a streaming processing environment. Freund et al. [18] proposed one of the most representative boosting integration methods, AdaBoost, which adaptively modifies the sample weights to reduce classification errors. The algorithm uses the entire dataset to train the classifier serially several times, and after each round of training, the sample weights are updated such that the weights of incorrectly classified samples are increased and the weights of correctly classified samples are decreased. Currently, researchers have combined ensemble learning methods with other imbalanced data classification methods to form a series of boosting algorithms based on resampling techniques to improve the classification of imbalanced data, such as CUSBoost [19].

The above work has limitations, such as relying too much on the spatial distance distribution, deleting valid samples to a certain extent, and easily generating noise samples. In response to these shortcomings, this paper introduces the Shapley value and makes improvements. First, the Shapley value is used to represent the weight of the sample during resampling. Second, the Shapley value of each sample is iteratively calculated by the Monte Carlo method, which reduces the time complexity. Moreover, the paper resamples the minority samples and the majority samples by oversampling and undersampling, respectively, and deletes noise data according to the Shapley value of each sample. It not only effectively retains the samples that have a positive impact on the prediction results but also generates effective new samples.

2.2 Shapley value

Lloyd Shapley [20] introduced the concept of Shapley value, which is mainly used to solve the problem of benefit distribution among parties in a cooperative game. Shapley value reflects the degree of contribution of each party to the overall goal of cooperation, avoiding egalitarianism in the distribution, more reasonable and fair. It also reflects the process of mutual game of cooperation.

With the development of machine learning technology, the problem of machine learning model interpretability has attracted the attention of scholars. The Shapley value method, as a means of data valuation, is widely used in machine learning model interpretability problems. Researchers use the Shapley value method to calculate the impact of each feature on the model effect as a way to explain the model results. For example, Wang et al. [21] proposed a Shapley flow method to interpret machine learning models based on Shapley values, which considers causal graphs to help understand the flow of importance and the factors that potentially intervene.

In recent years, several studies have used Shapley values to assess the contribution of different data points in a dataset when training a machine learning model. For example, Ghorbani et al. [22] proposed two heuristics to efficiently compute the Shapley values for each sample and demonstrated for the first time that Shapley values can be used to assess the impact of each sample on the effect of a machine learning model. Jia et al. [23] proposed an efficient method to compute Shapley values for samples on a K-nearest neighbor-based machine learning approach. Song et al. [24] defined the contribution index CI of different data providers based on the concept of the Shapley value for the application scenario of federal learning and proposed two effective methods for calculating CI.

It is very important to reasonably evaluate the impact of data. The Shapley value can measure the impact of each data point on the model effect. Therefore, this paper uses the Shapley value to calculate the contribution of each data and proposes an improvement approach for the high complexity of the Shapley value calculation method. At the same time, different sampling strategies are adopted for the dataset according to the Shapley value of each sample, which are undersampling the sample with a low Shapley value in the majority samples and oversampling the sample with a high Shapley value in the minority samples. While effectively deleting noisy data, it not only retains samples that are beneficial to the classifier but also reasonably generates minority samples, which improves the effect of the classifier.

3. RBSP-Boosting method

Resampling technology is an effective method to deal with imbalanced data, including undersampling and oversampling. However, it is easy to generate noise data or lose some useful data. To compensate for these two shortcomings, this paper proposes the RBSP-Boosting method. By introducing the Shapley value to measure the impact of each sample on the model, it aims to screen out the samples that have a positive and negative impact on the model effect and improve the sampling strategy based on the Shapley value. At the same time, in view of the high complexity of the Shapley value calculation method, this paper proposes an improvement approach. Then, the classification results are obtained and evaluated on the AdaBoost classifier. The flow chart of the RBSP-Boosting method is shown in Fig. 2. Domain experts need to process the initial dataset based on domain knowledge, such as missing values. In addition, they can also determine a more appropriate threshold based on the Shapley value of the data.

Figure 2.

Flow chart of RBSP-Boosting.

3.1 Design ideas

3.1.1 Calculating the Shapley value

The initial definition formula of Shapley value is:

$\displaystyle\phi_{i}({\nu})=\mathop{\sum}\limits_{s\in S_{i}}\omega({|s|})[{V% (s)-V({s-\{i\}})}]$ (1)

where $S_{i}$ represents the set composed of all the subsets containing sample $i$ in one dataset. $|s|$ is the number of elements in the set $s$ . $V(s)$ is the effect score of classifier ${\nu}$ on set $s$ . $\omega(|s|)$ represents the weighting factor and has the following formula:

$\displaystyle\omega({|s|})=\frac{({|s|-1})!({N-|s|})!}{N!}$ (2)

where $N$ is the number of elements in one dataset.

According to Eq. (1), calculating the Shapley value requires traversing all subsets containing sample i in the dataset. For a dataset with $N$ samples, the number of subsets is $2^{N-1}+1$ . Therefore, the time complexity to calculate the true Shapley value of each data point is $O(2^{N})$ . When the size of the dataset is large, this method is unacceptable in practical applications. To calculate the Shapley value of each sample more accurately in a shorter time, this paper improves the TMC-Shapley method proposed in [19] and uses the Monte Carlo method to simulate the process of random sampling to calculate the Shapley value of the sample. The TMC-Shapley method is an effective method to calculate the Shapley value, which has been tested in practice. However, there are still shortcomings to the scenario of this paper. The optimal tolerance $\lambda$ and sampling times $T$ are determined by grid search, and the F-score is used to replace the accuracy in the TMC-Shapley method to measure the marginal contribution of each data point to the model. It can obtain a more accurate Shapley value with fewer sampling times, further reduces the time complexity and is more suitable for imbalanced data classification problems. Equation (3) shows the mathematical description of the optimization process.

$\displaystyle T,\lambda=\textit{arg}({\textit{MAX}({F1\_\textit{score}({D,A})}% )})$ (3)

where $D$ is an imbalanced dataset, and $A$ is a learning model.

The Shapley value of the sample is estimated by the Monte Carlo method. First, the proposed method randomly samples the dataset. Second, it scans each data point in the random sample array from the beginning and calculates the marginal contribution of each newly added sample. Then, the method updates the Shapley value of the sample. Repeating the above steps on multiple Monte Carlo permutations until they tend to converge, we can obtain an unbiased Shapley estimate for each data point as follows:

$\displaystyle\emptyset_{i}=\mathop{\sum}\limits_{\pi\sim\Pi}[{V({S_{\pi}^{i}% \mathop{\cup}\nolimits\{i\}})-V({S_{\pi}^{i}})}]$ (4)

where $\Pi$ is the uniform distribution over all $N!$ permutations of data points; $\pi$ is one of the permutations, and $S_{\pi}^{i}$ is the set of samples before sample $i$ in $\pi$ . If $i$ is the first sample, $S_{\pi}^{i}=\emptyset$ .

The calculation of the Shapley value can be formalized as an expectation problem, as described in Algorithm 1 for details.

3.1.2 Resample with different Shapley value

For samples with different Shapley values, this paper adopts different resampling strategies to make better use of the samples with strong discriminative properties and eliminate noise data.

Step 1 Samples with a small Shapley value or even less than zero have a negative impact on the effect of the model, which are noise data. Therefore, the samples with the smaller Shapley value in the dataset are eliminated to reduce the number of noise data.

Step 2 For the minority class samples, their oversampling weights are assigned based on the Shapley value of the sample. The higher the Shapley value is, the greater the number of new samples generated based on the sample. After determining the sampling weight of each minority class sample, the selected samples are oversampled using the SMOTE algorithm.

The method takes into account the contribution of each sample to the model. It can not only delete the noise samples in the dataset but also select the highly discriminative samples for oversampling, which effectively compensates for the shortcomings of traditional imbalanced data processing methods. More importantly, it solves the shortcomings of oversampling and undersampling at the same time Fig. 3 shows the principle of sampling in this method.

3.2 Algorithm description

The main steps of RBSP-Boosting are as follows.

(1) For a given dataset, Algorithm 1 describes how to calculate the Shapley value for each sample.

Algorithm 1: Shapley Value Calculation
Input: Training set: D $=$ {1, …, N} Learning model: A, Model performance score: V
Output: Shapley value of each data in D: $\emptyset_{1},\ldots,\emptyset_{N}$
BEGIN
1. Initialize $\emptyset_{i}=0,i={\{}1,\ldots,N{\}}$ ;
2. Find the best sampling times T and tolerance $\lambda$ through Grid Search;
3. for i in range (1, $T$ ):
4. Sample subset $S=\varphi$ ;
5. Randomly sample D to get a random arrangement $\pi^{i}$ ;
6. $v_{0}^{i}=V({\varphi,A})$ ;
7. for j in range ( $1,\textit{len}(\pi^{i})$ )
8. $V({S,A})=S\cup\pi^{i}[j]$ ;
9. $V({S,A})=F1\_\textit{SCORE}({S,A})$ ;
10. if $\|V({S,A})-V({D,A})\|<\lambda$ :
11. $v_{j}^{i}=v_{j-1}^{i}$ ;
12. end if
13. else
14. $v_{j}^{i}=V({S,A})-v_{j-1}^{i}$ ;
15. end else
16. $\emptyset_{i}=\emptyset_{i}+v_{j}^{i}$ ;
17. end for
18. end for
END.

Figure 3.

The sampling principle of this method.

Algorithm 1 improves the method of calculating the Shapley value and reduces the time complexity of the algorithm from $O({2\wedge N})$ to $O(TN)$ . For the application scenario of this paper, the sampling times T and tolerance $\lambda$ are improved through the grid search algorithm. In the case of fewer iterations, a more accurate Shapley value can still be obtained.

(2) According to the Shapley value of each sample obtained by Algorithm 1, the samples of different Shapley values are oversampled and undersampled. Samples with a small Shapley value or even less than zero in the dataset have a negative impact on the effect of the model, which is noise data. Therefore, Algorithm 2 reduces the number of noise samples by deleting samples with a small Shapley value in the dataset. Second, the samples in the minority class are sorted in descending order of the Shapley value. Shapley value is used as the sampling weight. Then, the number of new samples generated for each data point is determined according to the sampling weight. The smoot algorithm is used for oversampling.

Algorithm 2 describes how to obtain the balanced dataset after processing.

Algorithm 2: Resampling Data
Input: Minority dataset: $D^{+}=\{{1,\ldots,n}\}$ , Majority dataset: $D^{-}=\{{1,\ldots,m}\}$ , Shapley value of each data: $\emptyset_{1}^{+},\ldots,\emptyset_{n}^{+};\emptyset_{1}^{-},\ldots,\emptyset_% {m}^{-}$ .
Output: Sampled training set S.
BEGIN
1. Initialize S $=$ $\phi$ ;
2. for i in $\{{1,\ldots,m}\}$ :
3. if $\emptyset_{i}^{-}<0$ or $\emptyset_{i}^{-}<{\lambda}$ :
4. $D^{-}=D^{-}-\{{D^{-}[i]}\}$ ;
5. endif
6. end for
7. for i in $\{{1,\ldots,n}\}$ :
8. if $\emptyset_{i}^{+}<\lambda$ :
9. $D^{+}=D^{+}-\{{D^{+}[i]}\}$ ;
10. end if
11. end for
12. $\textit{weight}_{i}=\textit{SUM}(\emptyset_{1}^{+},\ldots,\emptyset_{\textit{% length}({D^{+}})}^{+})$ ;
13. for i in range (( $\textit{length}(D^{+})$ ):
14. $\textit{weight}_{i}=\emptyset_{i}/\textit{weight}$ ; //Sampling weight of data i
15. $\textit{num}_{i}=\textit{weight}_{i}*(\textit{length}(D^{-})-\textit{length}(D% ^{+}))$ ;
16. $S=S\cup\textit{SMOTE}(D^{+}[i],\textit{num}_{i})$ ; // $\textit{num}_{i}$ is the number of sampling
17. end for
18. $S=S\cup D^{+}\cup D^{-}$ ;
END.

Algorithm 2 introduces the Shapley value to resample the imbalanced dataset, which takes into account the contribution of each sample to the classifier To a certain extent, it solves the problem of loss of some useful data and generating noise data during the sampling process.

(3) Through Algorithm 2, a balanced dataset is obtained. Then, the proposed method imports the AdaBoost classifier in Python for training and prediction and calculates the AUC value, F-score value and G-mean value of the model.

The RBSP-Boosting method improves the complexity of the Shapley value calculation method and uses the Shapley value to measure the contribution of each data point to the classifier. According to the Shapley value of each data point, it adopts different sampling strategies for the majority class and the minority class. While effectively deleting noisy data, it not only retains samples that are beneficial to the model but also reasonably generates minority class samples, which can theoretically improve the effect of imbalanced data classification. Next, the effectiveness and stability of the method are verified through experiments.

4. Experiments and analysis

4.1 Experimental setup

4.1.1 Datasets description

To verify the effectiveness of the proposed method, this paper selects 9 imbalanced datasets from the UCI and KEEL machine learning databases for experiments and analysis. These datasets are all imbalanced datasets. The imbalanced degree of the datasets ranges from 1.87 to 853, with a large span. They are widely used in experiments in related literature [5, 6, 8, 10]. Most of them are medical data, which conform to the medical application scenarios of this paper. Therefore, they verified the effect of the RBSP-Boosting method on datasets with different imbalance degrees Detailed information on the datasets is shown in Table 1. IR is the ratio between the number of samples from the majority and minority classes.

Table 1
Datasets information

Dataset	Features	Minority	Majority	Samples	IR
pima	8	268	500	768	1.87
glass0	9	70	144	214	2.06
haberman	3	81	225	306	2.78
blood	4	178	570	748	3.20
ecoli1	7	77	259	336	3.36
ecoli2	7	52	284	336	5.46
yeast3	8	163	1321	1484	8.10
yeast0359vs78	8	50	456	506	9.12
shuttle	9	2	1706	1708	853

There are two explanations for the dataset. First, the proposed algorithm resamples according to the Shapley value of the sample, and the samples with low Shapley values will be deleted as noise data. Therefore, the noise data do not affect the effect of the algorithm. Second, the algorithm cannot process data with missing values. Domain experts are required to preprocess the dataset

4.1.2 Experimental environment and design

To verify the performance of the RBSP-Boosting method, the paper has been compared with sampling algorithms, including Random-OverSampler, SMOTE, Borderline-SMOTE and SVM-SMOTE. These algorithms are the most advanced and classic imbalance problem learning techniques and are widely used in common imbalanced data processing problems, including medical diagnosis. The same AdaBoost classifier is used to verify the effect of these algorithms. The base classifier of AdaBoost is a support vector machine. Ten independent tenfold cross-validation algorithms are adopted. The main reasons why we use it are as follows. First, the purpose of it is to increase the number of experiments and improve the reliability of the results. Second, it can reduce the influence of accidental factors on the overall experimental results and ensure the accuracy of the experimental results. Third, it is adopted for the five methods, which will not destroy the basic conditions of the control experiment.

The code in this paper is written in Python language. Sampling algorithms including Random-OverSampler, SMOTE, Borderline-SMOTE and SVM-SMOTE use functions in the Python third-party library called imbalanced-learn. The classifier is implemented using the AdaBoost classifier in the Python third-party library called scikit-learn. The experiments are based on the win10 operating system, python3.6 and TensorFlow2.4 learning environment.

4.1.3 Evaluation metric

In the problem of imbalanced data classification, if the model classifies all samples into the majority class, it can obtain a higher accuracy rate. However, the minority class is misclassified, and the effect of the model is not good. Therefore, the accuracy and error rates are not enough to accurately measure the effect of the model. To measure the effect of the model more comprehensively, this paper uses AUC (area under the curve of ROC), F-score and G-mean, commonly used in imbalanced data classification, as the assessment metrics of the model effect. Among them, AUC is the area under the ROC curve, which is widely applied to estimate the accuracy of imbalanced models with all possible scopes of thresholds. The values of the F-score and G-mean are obtained from the confusion matrix. The larger the value, the better the effect of the model. The confusion matrix is shown in Table 2.

Table 2
Confusion matrix

	Classified as positive	Classified as positive
Actual positive	TP	FN
Actual negative	FP	TN

The larger the value of the F-score is, the better the effect of the model. It is defined as follows:

$\displaystyle\textit{F-score}=\frac{({1+\beta^{2}})\times\textit{Recall}\times% \textit{Precision}}{\beta^{2}\times\textit{Recall}+\textit{Precision}}$ (5)

Recall is the recall rate, which is defined as follows:

$\displaystyle\textit{Recall}=\frac{TP}{TP+FN}$ (6)

The precision is the precision rate, which is defined as follows:

$\displaystyle\textit{Precision}=\frac{TP}{TP+FP}$ (7)

When the prediction accuracy of the minority class and the majority class are both high, the value of G-mean is high. It is defined as follows:

$\displaystyle\textit{G-mean}=\sqrt{\frac{TP}{TP+FN}\times\frac{TN}{TN+FP}}$ (8)

AUC, F-score and G-mean consider the classifier’s ability to classify positive and negative samples at the same time. Compared with indicators such as accuracy, they can better reflect the effect of imbalanced data classifiers and are widely used to evaluate the effects of different algorithms on imbalanced data classification.

4.2 Experimental results and analysis

To eliminate the randomness of the experimental results, this paper uses the tenfold cross-validation method to divide the dataset and conducts ten times tenfold cross-validation. Then, it performs sampling algorithm processing on the divided dataset and finally takes the average value as the result of the experiment. Each time the dataset is divided into a training set and a test set, 90% is used as the training set and 10% is used as the test set.

Tables 3 to 5 show the comparison results of AUC, F-score and G-mean classified by different sampling methods, and the best result is shown in bold black. In the last row of each table, the average scores of the five sampling methods on all datasets are given, and the best scores are indicated in bold black.

Table 3
AUC value

DataSet	Random over sampler	SMOTE	Borderline SMOTE	SVM SMOTE	RBSP Boosting
pima	0.579806	0.577075	0.571209	0.582537	0.753075
glass0	0.798115	0.776885	0.780952	0.787698	0.812401
haberman	0.578272	0.564938	0.551605	0.568642	0.631358
blood	0.652149	0.659344	0.660753	0.652119	0.684595
ecoli1	0.860126	0.867322	0.858371	0.875219	0.872411
ecoli2	0.859155	0.863489	0.876625	0.900190	0.874052
yeast3	0.911359	0.920143	0.923250	0.916319	0.890903
yeast0359vs78	0.688596	0.690789	0.684079	0.693947	0.710526
shuttle	0.713700	0.785421	0.713992	0.785714	0.850000
Average	0.737920	0.745045	0.735648	0.751376	0.786591
The Number of Wins	0	0	1	2	6

Table 4

F-score value

DataSet	Random over sampler	SMOTE	Borderline SMOTE	SVM SMOTE	RBSP Boosting
pima	0.336088	0.327778	0.319559	0.344262	0.679577
glass0	0.729927	0.700730	0.703448	0.714286	0.748201
haberman	0.393258	0.380405	0.368421	0.374269	0.462428
blood	0.471338	0.479657	0.481172	0.470588	0.513317
ecoli1	0.770186	0.771084	0.742857	0.765714	0.792453
ecoli2	0.780000	0.769231	0.796117	0.807339	0.816327
yeast3	0.769648	0.791209	0.780749	0.783562	0.795181
yeast0359vs78	0.381679	0.387597	0.384000	0.413793	0.450450
shuttle	0.998243	0.998828	0.998536	0.999122	0.999122
Average	0.625596	0.622947	0.619429	0.630326	0.695228
The Number of Wins	0	0	0	1	9

Table 5

G-mean value

DataSet	Random over sampler	SMOTE	Borderline SMOTE	SVM SMOTE	RBSP Boosting
pima	0.460580	0.453453	0.447664	0.467568	0.809419
glass0	0.793700	0.771517	0.779194	0.784270	0.809419
haberman	0.559492	0.549098	0.538504	0.541501	0.616197
blood	0.651523	0.658655	0.660590	0.642290	0.678773
ecoli1	0.858370	0.866568	0.858254	0.875205	0.870724
ecoli2	0.852193	0.858329	0.872180	0.898566	0.867744
yeast3	0.910472	0.919411	0.922839	0.915488	0.887206
yeast0359vs78	0.662266	0.663920	0.652929	0.660144	0.678621
shuttles	0.654270	0.755707	0.654462	0.755929	0.836660
Average	0.711430	0.721851	0.709624	0.726773	0.783863
The Number of Wins	0	0	1	2	6

Figure 4.

Average rank of five methods on nine datasets.

Figure 5.

Roc curve of five methods on nine datasets.

Figure 6.

Average results of five methods on nine datasets.

Figure 7.

Auc under different iteration times on the nine datasets.

Figure 8.

F-score under different iteration times on the nine datasets.

Figure 9.

G-mean under different iteration times on the nine datasets.

It can be seen from Table 3 that when the RBSP-Boosting method is used to classify these datasets, the AUC value achieves the optimal value on the five datasets. Compared with the SVM-SMOTE algorithm with the best average effect among the four comparison algorithms, the average AUC value on these datasets increased from 0.751376 to 0.786591, increasing by 4.69%.

Table 4 shows that the proposed method achieves the best F-score value on all datasets. Compared with the SVM-SMOTE algorithm with the best average effect among the four comparison algorithms, the average value of the F-score on all datasets increased from 0.630326 to 0.695228, increasing by 10.30%.

It can be seen from Table 5 that when the proposed method is used to classify these datasets, G-mean achieves the optimal value on five datasets. Compared with the best SVM-SMOTE algorithm among the four comparison methods, the average value of G-mean on all datasets increased from 0.726773 to 0.783863, increasing by 7.86%.

The paper draws the average ranking of five methods on the assessment metric scores on nine datasets. Figure 4 shows that the average ranking of the proposed method in AUC, F-score and G-mean is the best.

In addition, the paper draws the ROC curves of five methods on the nine datasets, which more intuitively demonstrates the effectiveness of the proposed method. Figure 5 shows the ROC curves of the five methods on nine datasets, which shows the effectiveness of the proposed method.

It can be seen that in most ROC graphs, when the false-positive rate is small, the RBSP-Boosting method has a larger true positive rate value than other methods, indicating that this method has achieved better classification results.

To further demonstrate the effectiveness of the method, we show the average results of all datasets in Fig. 6. The performance of the RBSP boosting method in terms of the AUC, F-score and G-mean is better than that of the other four methods.

Figure 6 shows that the RBSP-Boosting method achieves the best average values of AUC, F-score and G-mean on the nine datasets, and the classification effect is better than that of the other four methods.

To verify the stability of the method proposed in this paper, the paper conducts experiments under different iteration times. The experimental settings were the same as those in the above experiments. Finally, the AUC, F-score and G-mean under different iteration times are obtained, and the experimental results are shown in Figs 7–9.

It can be seen from Figs 7 to 9 that the proposed method can be stably executed under different iteration times except for the yeast0359 vs. 78 dataset, which fully illustrates the stability and effectiveness of the method.

The greater the contribution value of a sample point to the model effect, the deeper its impact on the model performance. The Shapley value can accurately measure the contribution of each data point to the model effect. According to the Shapley value of the data, we make full use of the samples with a larger Shapley value for oversampling and delete the samples with a smaller Shapley value. While constructing a balanced dataset, it fully retains the effective samples, which is also the key to improving the effect of the model. Therefore, the proposed method outperforms the comparison method in multiple datasets.

Table 6 compares the RBSP-Boosting method against several classic and representative processing methods for solving the problem of imbalanced data classification.

Table 6

Comparison of handling methods for imbalanced data classification

Method	Features	Shortcomings
One-Sided Selection	Classify sample points and deal with different types of samples in a targeted manner	To some extent deleted valid samples
DBIG-US	The sampling process is divided into two stages, which are targeted	Too much dependence on distance distribution in space
SMOTE	Effective use of nearby samples of sample points for sampling	The generated samples have a certain degree of randomness and are easy to produce noise samples
WK-SMOTE	Oversampling in the feature space of SVM can solve nonlinear problems	Only valid for SVM classifier, not universal
RBSP-Boosting	Measure the contribution of each data to the model, and sample based on this	Not yet able to address multiclass problems

Table 7

Running time comparison table

Method	Average running time
RandomOver-Sampler	5.06s
SMOTE	5.33s
Borderline-SMOTE	5.37s
SVM-SMOTE	5.87s
RBSP-Boosting	85.62s

Summarizing Tables 3 to 6 and Figs 4 to 9, we can see that the RBSP-Boosting method can achieve the best results on most datasets. On the three evaluation indicators of AUC, F-score and G-mean, the overall effect is optimal, and stability can be guaranteed in multiple sets of different iteration times. Therefore, the proposed method is effective and can be used to solve the practical problem of imbalanced data classification.

In addition, we compare the running time of the methods for the researcher’s reference.

5. Conclusion

This paper proposes an imbalanced data processing method based on the Shapley value for the problems that the oversampling algorithm easily generates redundant data and noisy data, and the undersampling algorithm easily removes the defects of discriminative valuable data. First, the RBSP-Boosting method calculates the Shapley value of each data point in the dataset. Second, it uses different processing methods for samples with different Shapley values based on the Shapley value of the samples to construct a balanced dataset. Finally, the method uses the AdaBoost classifier for classification. On the one hand, it effectively reduces the redundant data generated in the oversampling process. On the other hand, it effectively uses the data with a certain degree of discrimination. The experimental results on nine sets of KEEL and UCI datasets show that the RBSP-Boosting method has improved AUC, F-score and G-mean compared with the four methods mentioned in this paper. In future work, a problem worthy of research is the multiclassification of imbalanced data, and another problem worthy of research is the multilabel classification of imbalanced data. We will improve the measurement indicators in the calculation of the Shapley value, and we will further study how to reasonably divide the dataset into different categories and adopt different sampling methods.

Footnotes

Acknowledgments

This work is supported in part by the Natural Science Foundation of China under Grant 61762008 and 62162003, the National Key Research and Development Project of China under Grant 2018YFB1404404.

References

Zhang

Y.H.

and Adam

, Estimating a one-class naive bayes text classifier, Intelligent Data Analysis 24 (2020), 567–579.

Yang

Wei

H.C.

Sun

Z.Q.

G.Y.

Zhou

Y.C.

Xiong

and Yang

, S2OSC: A holistic semi-supervised approach for open set classification, ACM Trans. Knowl. Discov. Data 16(34) (2021), 1–27.

Liu

Zhong

H.W.

and Xiao

Y.S.

, New multi-view classification method with uncertain data, ACM Trans. Knowl. Discov. Data 16(19) (2021), 1–23.

Patel

Rajput

D.S.

Thippa

G.T.

Iwendi

Bashir

A.K.

and Jo

, A review on classification of imbalanced data for wireless sensor networks, International Journal of Distributed Sensor Networks 16 (2020), 1–15.

Yang

J.F.

X.P.

Liang

Sun

X.X.

Cheng

M.M.

Rosin

P.L.

and Wang

, Self-paced balance learning for clinical skin disease recognition, IEEE Transactions on Neural Networks and Learning Systems 31 (2020), 2832–2846.

Saqlain

Abbas

and Lee

J.Y.

, A deep convolutional neural network for wafer defect identification on an imbalanced dataset in semiconductor manufacturing processes, IEEE Transactions on Semiconductor Manufacturing 33 (2020), 436–444.

Gan

Shen

and Liu

, Integrating TANBN with cost sensitive classification algorithm for imbalanced data in medical diagnosis, Computers & Industrial Engineering 140 (2020), 106266–106274.

Mirzaei

Nikpour

and Nezamabadi-pour

, CDBH: A clustering and density-based hybrid approach for imbalanced data classification, Expert Systems with Applications 164 (2021), 114035–114049.

Kunakorntum

Hinthong

and Phunchongharn

, A synthetic minority based on probabilistic distribution (SyMProD) oversampling for imbalanced datasets, IEEE Access 8 (2020), 114692–114704.

10.

Kubat

and Matwin

, Addressing the curse of imbalanced training sets: one-sided selection, in: Proceeding of the 14th International Conference on Machine Learning, ACM, Nashville, TN, USA, 1997, pp. 179–186.

11.

Guzmán-Ponce

Sánchez

J.S.

Valdovinos

R.M.

and Marcial-Romero

J.R.

, DBIG-US: A two-stage under-sampling algorithm to face the class imbalance problem, Expert Systems with Applications 168 (2021), 114301–114313.

12.

Chawl

N.V.

Bowyer

K.W.

Hall

L.O.

and Kegelmeyer

W.P.

, SMOTE: Synthetic minority over-sampling technique, Journal of Artificial Intelligence Research 16 (2002), 321–357.

13.

Mathew

Pang

C.K.

Luo

and Leong

W.H.

, Classification of imbalanced data by oversampling in kernel space of support vector machines, IEEE Transactions on Neural Networks and Learning Systems 29 (2018), 4065–4076.

14.

Shahee

S.A.

and Ananthakumar

, An effective distance based feature selection approach for imbalanced data, Applied Intelligence 50 (2020), 717–745.

15.

Yang

P.Y.

Liu

Zhou

B.B.

Chawla

and Zomaya

A.Y.

, Ensemble-based wrapper methods for feature selection and class imbalance learning, in: Pacific-Asia Conference on Knowledge Discovery and Date Mining, Springer, Gold Coast, QLD, Australia, 2013, pp. 544–555.

16.

Geng

and Luo

X.Y.

, Cost-sensitive convolutional neural networks for imbalanced time series classification, Intelligent Data Analysis 23 (2019), 357–370.

17.

Loezer

Enembreck

Barddal

J.P.

and Britto

A.D.S.

, Cost-sensitive learning for imbalanced data streams, in: Proceedings of the 35th Annual ACM Symposium on Applied Computing (SAC ’20), ACM, Online event, [Brno, Czech Republic], 2020, pp. 498–504.

18.

Freund

and Schapire

R.E.

, A decision-theoretic generalization of online learning and an application to boosting, Journal of Computer & System Sciences 55 (1999), 119–139.

19.

Rayhan

Ahmed

Mahbub

Jani

M.R.

Shatabda

and Farid

D.M.

, CUSBoost: Cluster-based Under-sampling with Boosting for Imbalanced Classification, CoRR abs/1712.04356, 2017.

20.

Shapley

L.S.

, A value for n-person games, Contributions to the Theory of Games 2 (1953), 307–317.

21.

Wang

J.X.

Wiens

and Lundberg

, Shapley Flow: A Graph-based Approach to Interpreting Model Predictions, in: Proceedings of the 24th International Conference on Artificial Intelligence and Statistics, PMLR, Virtual event, 2021, pp. 721–729.

22.

Ghorbani

and Zou

, Data Shapley: Equitable Valuation of Data for Machine Learning, in: Proceedings of the 36th International Conference on Machine Learning, PMLR, Long Beach, California, USA, 2019, pp. 2242–2251.

23.

Jia

R.X.

Dao

Wang

B.X.

Hubis

F.A.

Gurel

N.M.

Zhang

Spanos

and Song

, Effificient task-specifific data valuation for nearest neighbor algorithms, in: Proceedings of the 45th International Conference on Very Large Data Bases, Morgan Kaufmann, Los Angeles, California, USA, 2019, pp. 1610–1623.

24.

Song

T.S.

Tong

Y.X.

and Wei

S.Y.

, Profit Allocation for Federated Learning, in: IEEE International Conference on Big Data (Big Data), IEEE, Los Angeles, California, USA, 2019, pp. 2577–2586.

RBSP-Boosting: A Shapley value-based resampling approach for imbalanced data classification

Abstract

Keywords

1. Introduction

2.1 Imbalanced data classification

(1) Data level work

(2) Feature level work

(3) Algorithmic level work

3. RBSP-Boosting method

3.1.1 Calculating the Shapley value

3.2 Algorithm description

4.1 Experimental setup

4.1.1 Datasets description

Table 1 Datasets information

4.1.3 Evaluation metric

Table 2 Confusion matrix

Table 3 AUC value

Footnotes

Acknowledgments

References

Table 1
Datasets information

Table 2
Confusion matrix

Table 3
AUC value