Abstract
The research of biomedical data is crucial for disease diagnosis, health management, and medicine development. However, biomedical data are usually characterized by high dimensionality and class imbalance, which increase computational cost and affect the classification performance of minority class, making accurate classification difficult. In this paper, we propose a biomedical data classification method based on feature selection and data resampling. First, use the minimal-redundancy maximal-relevance (mRMR) method to select biomedical data features, reduce the feature dimension, reduce the computational cost, and improve the generalization ability; then, a new SMOTE oversampling method (Spectral-SMOTE) is proposed, which solves the noise sensitivity problem of SMOTE by an improved spectral clustering method; finally, the marine predators algorithm is improved using piecewise linear chaotic maps and random opposition-based learning strategy to improve the algorithm’s optimization seeking ability and convergence speed, and the key parameters of the spectral-SMOTE are optimized using the improved marine predators algorithm, which effectively improves the performance of the over-sampling approach. In this paper, five real biomedical datasets are selected to test and evaluate the proposed method using four classifiers, and three evaluation metrics are used to compare with seven data resampling methods. The experimental results show that the method effectively improves the classification performance of biomedical data. Statistical test results also show that the proposed PRMPA-Spectral-SMOTE method outperforms other data resampling methods.
Introduction
In recent years, with the continuous progress and development of science and technology, the quantity and quality of biomedical data have been significantly improved. Effective analysis of biomedical data can help researchers better understand the mechanism of disease occurrence and help doctors more accurately diagnose and treat diseases, so how to analyze biomedical data scientifically and accurately has become an urgent problem. The complexity of biomedical data makes it impossible to analyze accurately by traditional methods, and machine learning and data mining have become important means of biomedical data analysis [1].
In biomedical data, the dimensionality of features is usually much higher than the number of samples [2], and the dimensionality of features may reach thousands or even tens of thousands, while the number of samples is usually small, which will lead to the “curse of dimensionality” problem [3, 4]. The “curse of dimensionality” problem leads to a decrease in the predictive performance of machine learning algorithms and an increase in the risk of algorithm overfitting [5]. Feature selection can remove redundant and noisy information from high-dimensional data, select the most relevant features, improve accuracy, reduce computational cost, and avoid overfitting, so feature selection has become an important method for biomedical data research [6, 7].
In addition to the high-dimensional problem, biomedical data usually suffers from the class imbalance problem, for example, when biomedical data is utilized for disease diagnosis, the number of diseased and healthy samples may differ greatly, and the distribution of categories shows an extreme imbalance [4]. Class imbalance can lead to classification algorithms that are more biased toward samples of the majority class and have a reduced classification performance for samples of the minority class, while in real life accurate identification of the minority class is often the goal that needs to be achieved [8]. Data resampling is an effective method to solve class imbalance by processing imbalanced data into balanced data through oversampling or undersampling methods [9]. Oversampling mainly achieves data balancing by increasing the minority class samples, and the most popular oversampling method is Synthetic Minority Over-sampling Technique (SMOTE), which generates new samples based on the minority class samples and their nearest neighbor samples [10]. Undersampling methods mainly balance the samples by reducing the number of majority class samples, Edited Nearest Neighbours [11] and Tomek Links [12] are classical undersampling methods. In previous studies of biomedical and bioinformatics data, most of the studies only consider the high-dimensional problem and often ignore the class imbalance problem [13]. The undersampling method will make information lost, which is not suitable for small-sample data research and generally not applicable in biomedical data, so this paper focuses on the application of SMOTE oversampling method in biomedical data. SMOTE method has also been widely used in biomedical data, Nakamura et al. [14] used learning vector quantization to improve the code set obtained by SMOTE method. Li et al. [15] proposed an oversampling method called adaptive swarm cluster dynamic multi-objective SMOTE for solving the problem of biomedical data class imbalance. Xu et al. [16] proposed a hybrid sampling method combining misclassification-oriented SMOTE and edited nearset neighbor based on Random forest for biomedical data.
In order to improve the performance of biomedical data classification, this paper takes into account the high dimensionality and class imbalance characteristics of biomedical data, and combines feature selection with data resampling methods to construct a hybrid classification model for biomedical data, with the following main innovations: Feature selection of biomedical data using minimal-redundancy maximal-relevance method to solve the high dimensionality problem of biomedical data, reduce the computational cost; Multi-kernel learning is introduced into spectral clustering to improve the clustering performance, and improvement of SMOTE methods using multi-kernel spectral clustering to address noise sensitivity, improve the quality of SMOTE-generated samples, and address class imbalance in biomedical data; Marine predators algorithm combining piecewise linear chaotic maps and random opposition-based learning strategy (PRMPA) is proposed to avoid the algorithm from falling into a local optimum and to improve the algorithm’s optimization searching ability, and PRMPA is utilized to adaptively select the key parameters of the spectral-SMOTE, which further improves the classification performance of biomedical data; Experiments are carried out on five real biomedical datasets to validate the performance of the proposed model.
The rest of the paper is organized as follows, Section 2 presents the related work, Section 3 describes the proposed model in detail, Section 4 evaluates the performance of the proposed model with respect to other comparative models using five real biomedical datasets and gives the results and analysis, and Section 5 concludes the paper.
Related work
In this paper, we mainly apply feature selection and SMOTE methods to the problem of high dimensionality and class imbalance in biomedical data, so this section will briefly summarize the above two aspects.
Feature selection
Through feature selection, redundant and unnecessary features can be eliminated, thus improving the learning ability and generalization of the model. Feature selection can be divided into three main categories: filter, wrapper and embedded [17]. Filter methods mainly evaluate the relevance of features based on their intrinsic properties [18], and features are evaluated based on variance, correlation coefficient, statistical tests, and information theory, whereas evaluation criteria based on information theory are more common. Battiti uses the mutual information criterion to evaluate candidate features [19], and since univariate filter methods only consider the correlation between but a single feature variable and the target variable. The univariate filtering approach only considers the correlation between a single feature variable and the target variable, and does not fully consider the correlation between features, which leads to the possibility of large redundancy among the subset of selected features. Peng et al. [20] proposed minimal-redundancy maximal-relevance (mRMR), which takes into account the redundancy among the features when performing feature selection, and effectively improves the performance of feature selection. Van et al. [21] applied a threshold-based feature selection method on bioinformatics data, which effectively improved the performance of feature selection. Lyu et al. [22] proposed a filter feature selection method based on improved maximum information coefficients (MICs) and used it for biomedical data. El-Manzalawy et al. [23] extended mRMR by proposing a multi-view mRMR to predict the survival time of ovarian cancer. Xiong et al. [24] used uncertainty indices to assess the correlation between features and class labels, removed redundant features. Wang et al. [25] used mutual information to screen gastric cancer genes to improve classification accuracy. Wrapper methods evaluate a selected subset of features by training a learning model [26]. Correa et al. [27] used discrete particle swarm optimization algorithm for bioinformatics dataset to maximize classification accuracy while minimizing the number of features. Li et al. [28] proposed a binary Wolf Search Algorithm with an elite mechanism, which effectively reduces the bioinformatics data The computation time of feature selection. Mafarja et al. [29] introduced multiple strategies into the grey wolf optimization to solve the problem of premature convergence of the algorithm and enhance the local search ability, and used the multi-strategy grey wolf optimization in feature selection of bioinformatics data. The embedded method uses the ability of learning algorithm itself for feature selection [30]. Nie et al. [31] introduced L1, L2 norms into machine learning algorithms to realize an efficient and robust feature selection method, which effectively eliminates noisy features in bioinformatics data. Guo et al. [32] used the embedded feature selection method to process multilabel bioinformatics data, and the results showed that the proposed algorithm has a more significant advantages than other algorithms. Wrapper and embedded are both related to learning algorithms, which usually can obtain higher accuracy than filter, but also lead to higher computational cost, while filter methods are independent of the model, which can effectively avoid the overfitting problem and have lower computational cost, so filter feature selection methods are more widely used in biomedical fields [22]. This paper uses the minimal-redundancy maximal-relevance methods to select features for biomedical data. The mRMR method not only reduces the redundancy between features during the feature selection process, but also selects the features that are most relevant to the target variable, which can be effective reduce the dimensionality of high-dimensional biomedical data, retain important features, reduce the “curse of dimensionality” in the disease classification process, and reduce computational complexity, thus facilitating subsequent research.
SMOTE
SMOTE [10] is the most classical oversampling method, which is widely used in the class imbalance problem, but it still has many problems, such as easy to introduce noise, can’t deal with sample overlap. Scholars have suggested many enhancement strategies to address the issues associated with SMOTE. Han et al. [33] proposed the Borderline-SMOTE, which synthesizes the new samples by using only a few class samples on the boundary samples, thus improving the distribution of samples. Bunkhumpornpat et al. [34] proposed Safe-Level-SMOTE method, which synthesizes new samples at a larger safety level based on the safety level of the minority class samples. Nguyen et al. [35] proposed SVM-based SMOTE method, which utilizes support vectors to generate new samples and effectively solves the problem of identifying boundary samples. He et al. [36] proposed the ADASYN method, which assigns different weights to different minority class samples and reduces the influence of noisy data. Maciejewski and Stefanowski [37] taking into account the local information of the minority class samples, proposed the LN-SMOTE, which effectively improves the quality of generated samples. S
In view of the high-dimensional and imbalanced characteristics of biomedical data, which makes it difficult for traditional methods to accurately classify, it is proposed that a hybrid method based on feature selection and oversampling. First, mRMR method is utilized to select features for biomedical data to reduce redundant features, improve generalization ability and reduce computational cost; then, for the problem that K-means SMOTE cannot handle high-dimensional sparse data efficiently, the multi-kernel spectral clustering method is used to improve SMOTE, improve the oversampling accuracy, and solve the noise-sensitive problem effectively; finally, the improved marine predators algorithm is used to optimize the selection of the key parameters of Spectral-SMOTE, which effectively improves the classification performance of biomedical data.
Methods
In this section, we will briefly describe the main methods. The mRMR feature selection method is introduced in Section 3.1. Multi-kernel spectral clustering and SMOTE methods are introduced in Sections 3.2 and 3.3 respectively. Section 3.4 describes the improved marine predators algorithm. Finally, the overall framework of the method proposed in this paper is described in detail in Section 3.5.
mRMR
The mRMR [20] is a filter feature selection method based on mutual information, whose main objective is to find a subset of features that have the highest correlation with the target variable while minimizing the redundancy among the features. The measure of correlation is mainly done through mutual information, which can be used to represent the common parts between two variables and measure the correlation between two variables. Given two random variables y and z, with probability density functions p (y) and p (z), respectively, p (y, z) is the joint probability density function of y and z, the mutual information of s y and z can be expressed as Equation (1):
Minimum redundancy and maximum relevance are calculated by Equations (2) and (3), respectively:
Spectral clustering [44] is a clustering method based on graph theory, mainly by clustering the eigenvectors of the Laplace matrix of the original data. Spectral clustering is more adaptable to data distribution and can converge to the global optimal solution, so it is widely used. Spectral clustering first calculates the similarity between the original data and constructs the similarity matrix using the similarity; then constructs the corresponding Laplace matrix based on the similarity matrix and finds the eigenvectors corresponding to the first eigenvalues of the Laplace matrix to form a new solution space; finally, clustering is performed in the new solution space.
Spectral clustering only needs to calculate the similarity matrix between data, which is more conducive to dealing with sparse data than traditional clustering methods such as K-means, but its clustering effect is also highly dependent on the similarity matrix. The most common way to generate similarity matrix is to use RBF kernel function, but RBF kernel is a classic local kernel function with strong learning ability but weak generalization ability. Therefore, multi-kernel learning is used in the construction of similarity matrix in this paper. The Sigmoid kernel is a commonly used global kernel function known for its strong generalization ability but limited learning ability. By combining the Sigmoid kernel and the RBF kernel, a multi-kernel function can be constructed to enhance both learning ability and generalization ability [45]. The proposed multi-kernel function is represented by Equation (5):
In imbalanced data classification, the majority class refers to the class with a larger number of samples in the data set, while the minority class refers to the class with a smaller number of samples. SMOTE [10] is a classical oversampling method to solve the class imbalance problem, SMOTE is an improvement for the random oversampling method, random oversampling is only a simple replication of the minority class samples, which can easily lead to the model overfitting problem, whereas SMOTE effectively reduces the overfitting problem of the model by generating new minority class samples, and thus is widely used in imbalanced data. The core idea of SMOTE involves creating synthetic samples through linear interpolation between existing minority class samples and their neighboring samples of the same class., which has the following two main steps: Given a minority sample x, compute its distance from other minority samples and find the k nearest neighbor of x; In the near-neighborhood samples randomly selected a sample x
k
, in the minority class samples x and x
k
between the linear interpolation, constructed to generate a new minority class samples, the specific sample generation formula as shown in Equation (8):
Marine predators algorithm (MPA)
Marine predators algorithm (MPA) [46] is a new swarm intelligence optimization algorithm proposed by Faramarzi et al. in 2020, which mainly simulates the process of survival of the fittest organisms in the ocean. The population is first initialized as shown below:
The Elite matrix is constructed by replicating the top predator vector X I times n, n and d are the number and dimensions of search agents, respectively. The MPA optimization process is divided into three main phases, where different speed ratios are considered in the simulation of the whole predator and prey life. The MPA optimization process is divided into three main steps as shown below:
At the beginning of the iteration, MPA mainly realizes the global exploration process of the search space, when the optimal selection strategy of the predator is to remain motionless at the original position:
At the end of the iteration, MPA is mainly realized as a process of exploitation of the search space, when the best strategy for the predator is to perform a L
Eddy formation or Fish Aggregating Devices (FADs) usually cause changes in the behavior of marine organisms, which help MPA to jump out of local optima during the optimization search process and prevent the algorithm from converging prematurely.
The traditional MPA algorithm initializes the population by applying a random way in the search space, which will lead to the poor difference and diversity of the population, and the excellent initialized population can effectively improve the algorithm’s speed of searching for the optimal [47], so in this paper, we adopt the piecewise linear chaotic map (PWLCM) for the initialization of the population, and PWLCM has the traversal and randomness, and it can effectively improve the diversity of the initial solution, and the definition of the PWLCM is shown in Equation (17):
Opposition learning strategy [49] is an improved strategy for swarm intelligence optimization proposed by Tizhoosh in 2005, and its main idea lies in generating an opposition solution that is opposite to the current solution, adding it to the current population, comparing the advantages and disadvantages of the current solution and the opposition solution, and selecting the best to enter the next iteration. However, due to the lack of randomness of the opposition solution generated by opposition learning, it cannot generate a population with diversity, so this paper adopts random opposition-based learning (ROBL) proposed by Long et al. [50], ROBL introduces random numbers into the opposition learning strategy learning strategy, which helps to generate a population with randomness and enhance the population diversity, to avoid the algorithm falling into local optimum, the formula of ROBL is shown in Equation (18):
Aiming at the high dimensional and class imbalance characteristics of biomedical data, this paper combines the feature selection method and data oversampling method to solve the high dimensional problem of biomedical data by using the mRMR feature selection method, and to solve the class imbalance problem by using the improved SMOTE oversampling method which effectively solves the current problems of biomedical data classification. The method proposed in this paper mainly includes the following four steps:
Aiming at the high dimensionality of biomedical data, which can easily lead to the problem of “curse of dimensionality”, this paper firstly uses the mRMR method to select the features of the data, to obtain the subset of features that are the most relevant to the target variables and have the smallest redundancy among the features, so as to reduce the dimensionality of the data, to improve the generalization ability of the model, and to reduce the cost of the subsequent calculations.
To address the class imbalance problem in biomedicine data, this paper proposes a new Spectral-SMOTE method for oversampling data. Spectral-SMOTE effectively solves the noise sensitivity problem of SMOTE, and at the same time, it solves the problem that K-means SMOTE can not deal with high-dimensional sparse data effectively. For the problem of constructing similarity matrix in spectral clustering, this paper uses the multi-kernel function to construct the similarity matrix with excellent learning ability and generalization ability. The specific process and pseudo-code of Spectral-SMOTE are shown in Algorithm 1.
The parameters in Spectral-SMOTE are crucial to the performance of sampling, and optimizing these parameters is a complex mathematical optimization problem, so in this paper, we use PRMPA to optimize these parameters. PRMPA combines PWLCM and random opposition learning strategy, and firstly, PWLCM is used to generate the initialized population with diversity, which lays a good foundation for the subsequent algorithm to find the optimum. Then ROBL is used to generate the opposition solution of the current solution in each iteration to prevent the algorithm from falling into a local optimum. Since this paper mainly focuses on the problem of imbalanced biomedical data classification, the traditional accuracy rate effectively measures the goodness of the model, so this paper chooses F-measure as the fitness function. The parameters to be optimized in Spectral-SMOTE are shown in Table 1, the pseudo-code of PRMPA is shown in Algorithm 2, and the pseudo-code of PRMPA-Spectral-SMOTE is shown in Algorithm 3.
Description of optimization parameters
Description of optimization parameters
The dataset after feature selection is divided into a training set and a test set, on the training set a balanced training set is obtained by oversampling using PRMPA-Spectral-SMOTE, on the balanced training set a classification model is built using KNN, SVM, RF, and LR, and then a test is performed on the test set, and the final evaluation metrics are calculated based on the classification results.
Datasets
To evaluate the performance of mRMR and PRMPA-Spectral-SMOTE, this paper selects five real high-dimensional imbalanced biomedical datasets, which are mainly used for various disease diagnosis and contain thousands of gene expression biological samples. Includes two binary classification datasets and three multi-category datasets. Colon, GLI_85, TOX_171, and GLIOMA datasets are obtained from https://jundongl. github.io/scikit-feature/datasets.html. The MLL dataset was obtained from https://csse.szu.edu.cn/staff/zhuzx/Datasets. html. In real life, identifying specific classifications may be the main purpose of biomedical data classification [51], so for multiclass classification datasets firstly convert them into binary datasets, in the MLL dataset, MLL is treated as one class, and the remaining two classes as one class; in the TOX-171 dataset, class 3, which has the least number of classes, is regarded as one class, and the remaining three classes are treated as one class; in the GLOIMA dataset, cancer oligodendrogliomas was considered as one class and the remaining three classes as one class. Table 2 lists the basic information of the selected dataset, IR is the imbalance ratio of the dataset. It can be seen from Table 2 that the number of features of the selected data sets ranges from 2000 to 22283, IR ranges from 1.82 to 6.14, and the total number of samples ranges from 50 to 171, indicating that biomedical data is typical high-dimensional imbalanced small sample data, and accurate classification of biomedical data is a very difficult learning task.
Description of the datasets
Description of the datasets
In this paper, the accuracy rate (Acc), F-measure and area under the curve (AUC) [51, 52] are used to evaluate the performance of the biomedical data classification model. First, the confusion matrix of the classification results is established (as shown in Table 3), and then the following evaluation indicators can be defined according to the confusion matrix:
Confusion matrix
AUC represents the area under the ROC curve. The ROC curve takes TP/(TP + FN) as the ordinate and FP/(TN + FP) as the abscissa, and draws the entire curve by traversing all thresholds. When the data is imbalanced, F-measure and AUC have great reference value. The larger the Acc, F-measure and AUC, the better the classification performance of the model.
All experiments in this paper adopt five-fold cross-validation, and each data set is repeated 5 times. The final result is the average result of the above process, which can effectively reduce the deviation caused by dividing the data set. In this paper, four classifiers, KNN, SVM, RF and LR, are used to classify and evaluate the data set after feature selection and data resampling. All experiments were carried out on a computer with Intel(R) Core(TM) i7-10700, CPU 2.90 GHz, 16.0 GB memory and Windows 10 operating system, using Python 3.7 for programming.
This paper mainly carried out three experiments, which are briefly introduced as follows:
Comparison of results and discussions
Experiment 1: Comparison of feature selection results
In this paper, four classifiers, KNN, SVM, RF and LR, are first used to classify the original dataset and the dataset after mRMR feature selection, in which k in KNN is set to 3, and the parameters of the rest of the classifiers are set to default parameters. The experimental results are shown in Table 4 and Fig. 1.
Comparison of feature selection results
Comparison of feature selection results
By analyzing Table 4, the classification performance after mRMR feature selection is significantly better than the unfeatured one on the four classifiers, and the F-measure, Acc and AUC values are greatly improved. When classifying the unfeatured selection dataset, KNN, RF and LR perform better and SVM performs the worst, while when classifying on the mRMR feature selected dataset, KNN and SVM are better, which indicates that the performance of different classifiers is not the same on different datasets, and that no single classifier can effectively and accurately classify all datasets, so it makes sense to select multiple classifiers for model evaluation is meaningful. The poor classification performance of SVM on the original dataset indicates that SVM is sensitive to dimensionality and may not be able to handle high-dimensional data effectively, but the dataset after mRMR feature selection presents good performance, and the F-measure even improves by 0.74 on the GLIOMA dataset, which further illustrates that the mRMR feature selection method can effectively improve the performance of biomedical data classification.
Figure 1 shows the visualization of the feature selection results, from which it can be more intuitively seen that on the five datasets, the classification performance of KNN, SVM, RF, and LR on the datasets processed by mRMR feature selection are all greatly improved.

Comparison of feature selection result.
In Experiment 2, the method proposed in this paper is compared with seven SMOTE-based data resampling methods: SMOTE [10], ADASYN [36], Borderline SMOTE (B-SMOTE) [30], SVMSMOTE (S-SMOTE) [35], K-means SMOTE (K-SMOTE) [40], SMOTE-ENN (SMOTE-E) [53], SMOTE-Tomek (SMOTE-T) [54]. These methods have been widely used in imbalanced data problems in previous research and are considered reliable baseline methods [51, 56]. In conducting the experiments, all the datasets were mRMR feature selected, and all the above SMOTE-based data methods use the default parameters, and the data resampling methods are still evaluated using four classifiers, namely, KNN, SVM, RF, and LR, in order to eliminate the influence of the classifiers on the evaluation of the data resampling performance. Experiment 2 carried out model comparison analysis from two aspects, firstly, the performance evaluation of different data resampling methods using the same classifiers (as shown in Tables 5–8) to verify the effectiveness of the resampling methods proposed in this paper; and then the performance of different classifiers after data resampling from the same dataset was visualized to compare the performance differences of the different classifiers (as shown in Figs. 2–6).
Comparison results of different data resampling methods on KNN
Comparison results of different data resampling methods on KNN
Comparison results of different data resampling methods on SVM
Comparison results of different data resampling methods on RF
Comparison results of different data resampling methods on LR

Comparison of results from different classification models on the Colon dataset.

Comparison of results from different classification models on the GLI_85 dataset.

Comparison of results from different classification models on the MLL dataset.

Comparison of results from different classification models on the TOX_171 dataset.

Comparison of results from different classification models on the GLIOMA dataset.
By analyzing the results in Tables 5–8, the following conclusions can be obtained: Regardless of which classifier among KNN, SVM, RF and LR is used, the PRMPA-Spectral-SMOTE method proposed in this paper outperforms other data resampling methods in all evaluation indexes, the GLIOMA dataset is classified completely and correctly on KNN, SVM and LR, and the MLL dataset is classified completely and correctly on KNN and SVM. The evaluation indexes are all 1, indicating that the PRMPA-Spectral-SMOTE method is more suitable for imbalanced data classification and helps to improve the model classification performance; Compared with the models without data resampling in Table 4, the performance of most of the classification models built after data resampling is improved to different degrees, indicating that data resampling can effectively alleviate the imbalance of biomedical data and improve the performance of biomedical data classification; In the same classifier, the performance of the same data resampling method on different datasets shows large differences, such as in the KNN model, SMOTE-E performs better on the GLIOMA dataset, but performs worst on the TOX_171 dataset; The performance of the same data resampling method on different classifiers is also different, for example, the S-SMOTE method performs better on the LR classifier but worse on other classifiers, while PRMPA-Spectral-SMOTE can achieve the best performance on all classifiers, which further illustrates the PRMPA-Spectral-SMOTE superiority.
By analyzing the results in Figs. 2–6, the following conclusions can be obtained: On different datasets, the performance of PRMPA-Spectral-SMOTE is significantly better than the other comparison models, and the evaluation indexes are improved compared with the comparison models, but the degree of improvement is different on different classifiers, for example, on the GLI_85 dataset, the performance improvement of RF is significantly better than other classifiers; There are also large differences in the performance of different classifiers on the same dataset, for example, in the TOX_171 and GLIOMA datasets, the performance of RF is significantly worse than that of KNN, SVM, and LR, which suggests that RF may not be suitable to be used as a classifier for these datasets; From the GLIOMA dataset, it can be seen that the larger the imbalance rate of the dataset is, the more obvious the improvement effect of the PRMPA-Spectral-SMOTE method is, indicating that the method proposed in this paper can well solve the classification problem of highly imbalanced data.
To evaluate the performance variations among various data resampling methods, this paper employs the Friedman test [57] and Holm’s test [58] for further statistical analysis. The main objective of the Friedman test and Holm’s test is to compare the average performance of multiple methods across multiple datasets. The main idea is to rank and categorize the results obtained to determine whether there is a significant difference in the average performance of the various methods on multiple datasets, and to comprehensively assess the differences in the performance of different methods on different datasets. This statistical test is also widely used in work related to imbalanced data classification [55, 59]. The original hypothesis is “there is no significant difference between the methods”. The statistical tests in this paper are implemented in the Python library provided by [60]. The results of the experiments are shown in Fig. 7 and Table 9–10:

Mean rank of different data resampling methods using different classifiers.
Result of Friedman test
Result of Holm’s test on data with tne proposed model as the control method
Figure 7 shows the average rank results of different oversampling methods on different classifiers, the smaller the average rank, the better the performance of the method, and it can be seen that the PRMPA-Spectral-SMOTE is ranked the first in all the evaluation indexes. From each classifier, PRMPA-Spectral-SMOTE outperforms the other compared models in terms of F-measure, Acc and AUC. The above results also illustrate the superiority of PRMPA-Spectral-SMOTE.
Table 9 shows the Friedman test results of multiple data resampling methods on different classifiers, in this paper, the significance level α = 0.05 is set, and the bold in the table indicates that the original hypothesis can be rejected when the significance level is α. It is considered that there is a significant difference between the methods.
As can be seen from Table 9, on all classifiers, the p-value of F-measure, Acc and AUC is less than the significance level, and the original hypothesis can be rejected that there is a significant difference between multiple data resampling methods. However, Friedman Test cannot test the difference between the two methods, so Holm’s test is used to see if there is a significant difference between the proposed PRMPA-Spectral-SMOTE and other data resampling methods, and the experimental results are shown in Table 10.
From Holm’s test results in Table 10, it can be seen that most of the significant differences between PRMPA-Spectral-SMOTE and the other compared methods indicate that the proposed method in this paper outperforms the other compared methods in most of the cases. The most significant differences are between PRMPA-Spectral-SMOTE and SMOTE-E and SMOTE-T, followed by SMOTE, ADASYN, and K-SMOTE. Different data resampling methods also present different performances on different classifiers, for example, PRMPA-Spectral-SMOTE performs similarly to ADASYN and B-SMOTE in the SVM classifier, while their performances in the RF classifier show large differences, which is because the choice of classifiers also has a large impact on the final classification results.
In summary, the PRMPA-Spectral-SMOTE method proposed in this paper presents better performance compared with other data resampling methods, which can better solve the data imbalance problem and effectively improve the performance of biomedical data classification.
In order to address the problem of high dimensionality and class imbalance of biomedical data, this paper proposes a new method that combines feature selection and data resampling. Firstly, feature selection is performed on biomedical data using mRMR to reduce the feature dimensions and lower the computational cost; then the proposed Spectral-SMOTE is used to oversample the feature-selected data to obtain a balanced dataset; finally, the key parameters of Spectral-SMOTE are selected using the marine predators algorithm that incorporates PWLCM and ROBL, in order to enhance the performance of oversampling. Extensive experiments are conducted on five biomedical datasets, and the results show that the method proposed in this paper provides significant improvements in F-measure, Acc and AUC. The statistical test results show that the proposed PRMPA-Spectral-SMOTE is significantly different from other data resampling methods, which is highly superior and is an effective method to deal with the problem of class imbalance in biomedical data. The experimental results in this paper also illustrate that both feature selection and data resampling can significantly improve the classification performance of biomedical data, and thus, in the future, we can further explore the in-depth combination of these two aspects to more effectively solve the high-dimensional and class imbalance problems of biomedical data. In addition, extending and applying the proposed method to multi-classification problems is also an important direction for future research.
Statements & declarations
Authors contributions
Funding
This work was supported by Department of Science and Technology of Jilin Province project (20210101149JC, 20200403182SF); Education Department of Jilin Province project (JJKH20220662KJ); National Natural Science Foundation of China (12026430).
Competing interests
The authors have no relevant financial or non-financial interests to disclose.
