Abstract
Self-training semi-supervised classification has grown in popularity as a research topic. However, when faced with several challenges including outliers, imbalanced class, and incomplete data in reality, the traditional self-training semi-supervised methods might adversely damage the classification accuracy. In this research, we develop a two-step robust semi-supervised self-training classification algorithm that works with imbalanced and incomplete data. The proposed method varies from traditional self-training semi-supervised methods in three major ways: (1) The method in this paper does not necessitate the balance and complete assumption in traditional semi-supervised self-training methods, since it can complete and rebalance the dataset simultaneously. (2) This method is compatible with many classifiers, so it can handle multi-classification and non-linear classification cases. (3) The classifier in this paper is resistant to outliers during semi-supervised classification. Furthermore, several numerical simulations were performed in this research to illustrate the quality of our method to synthesized data, as well as multiple experiments to demonstrate our method superior classification performance on various real datasets.
Keywords
Introduction
Although supervised classification is a reasonably advanced machine learning technique, labels are actually difficult to come by, particularly in fields like risk management, medicine, image classification, etc. [1–3], while unlabeled data is abundant in these practical applications. Since the lack of labeled data may lead to the supervised classification learning failure, semi-supervised classification (SSC) has been proposed, which has used abundant unlabeled data and a little labeled data to learn appropriate classifiers [4]. Due to the nature of the SSC, it has rapidly become a research hotspot and various approaches have been proposed. According to the different ways of getting the data structure, the SSC methods can be divided into three categories: self-training, co-training and active learning. Compared with self-training methods, the co-training methods need to divide the original data set into complementary feature sets, and inappropriate division can affect the the final result of the collaborative training [6]. The active learning method needs to select specific samples and put them back into the training set after manual annotation, which will produce considerable manual annotation burden in practical application [7]. For these reasons, this paper mainly considers the SSC based on the self-training methods.
In recent years, many semi-supervised self-training systems have been developed and investigated by researchers [2, 10, 17]. Nosayba AL-Azzam et.al. suggested a self-training classification system based on logistic regression [2], although this method’s performance is comparable to that of traditional approaches, whose classification outcomes fully depend on how accurate the first classifier was, necessitating that the labeled dataset match the overall distribution of the data. Shuichi Kawano introduced a logistic regression self-training classification approach [24], which used the distribution of unlabeled data through the EM algorithm to produce the final classification results. Compared to the method presented in [2], this model made better use of the information in the unlabeled data distribution. However, this approach is based on two assumptions: balanced assumption and unlabeled data distribution assumption, which means that it would fail when the distribution of unlabeled data is not available. Bennet and Demiriz proposed semi-supervised Support Vector Machine (SSSVM) method [17], which the SSSVM method can handle the situation when the unlabeled data distribution differs from the distribution of labeled data. All of these techniques rely on the balanced assumption, hence they cannot be used to classify imbalance dataset.
However, in real life, there are often data with extremely imbalanced classes, such as bank credit card default data, medical diagnosis prediction of rare but important disease [8]. Stanescu, A. and Caragea, D. have conducted an empirical assessment of a semi-supervised learning algorithm for dealing with imbalanced classes [14], specifically self-training based on Naive Bayes Multinomial (NBM), and addresses the problem of imbalanced class distributions at both the data-level (by resampling) and algorithmic-level (using cost-sensitive learning and ensembles). Notably, the three semi-supervised self-training techniques described previously do not take imcomplete data into consideration which includes missing values, unmeasurable values and unobservable values, leading to a decrease in the accuracy of the model predictions for lacking of sufficient information [9].
For imbalanced and incomplete data, adopting a two-step methodology for the SSC technique is a natural idea. This two-step approach is first repaired the dataset using oversampling or undersampling and imputation methods, and then perform semi-supervised classification on the completed dataset. Since the undersampling methods can be easily criticized for losing potential information, the oversampling methods is generally adopted when dealing with imbalance problems, such as synthetic oversampling technology (SMOTE, [12]), adaptive synthetic oversampling approach (ADASYN, [13]) and cluster-based synthetic oversampling (CBSO, [14]). Liu et.al. [15] proposed a fuzzy-based information decomposition (FID) algorithm that utilizes fuzzy theory and information decomposition techniques to tackle imbalanced class and incomplete data at the same time. To improve the diversity of the samples generated by FID, Dou et.al. [16] proposed improved fuzzy-based information decomposition (IFID) algorithm which improved the membership function of the FID algorithm, where the IFID algorithm imitates the normal distribution to give precise weight for all observed data.
In addition to incomplete and imbalanced data, another challenge to classification is the presence of outliers, which are frequently contained in incomplete and imbalanced datasets [31, 32]. According to Ma et. al. [33], ignoring the existence of outlier will lead to large deviation in the classification algorithm, particularly for self-training classification approaches that are naturally sensitive to outliers. Also, for imputation and rebalanced methods, the existence of outliers can influence the results of the data repair [18, 19]. Therefore, the existenc of outlier will make the dataset recovery more detrimental, and have disastrous impact on the final classification of SSC method. For the two-step SSC method, it will inevitably magnify the impact of outlier on the final results.
From the explanation above, it is apparent that outliers in the data cannot be resisted by the existing data recovery methods, while existing semi-supervised classification methods rarely consider imbalanced and incomplete data simultaneously. An important motivation for this study is how to robustly repair labeled data and utilize unlabeled data to improve the performance of final semi supervised classification. This paper proposes a robust two-step semi-supervised self-training classification method so that it can handle incomplete and imbalanced data set with outliers. The main contributions of this paper can be highlighted as follows. A robust improved fuzzy information decomposition (RIFID) is developed to reduce the impact of outliers and enhance the effect of data repair. For synthetic samples produced by RIFID, they are more close to the distribution of the original data. A novel two-step semi-supervised classification method is proposed for the incomplete and imbalanced data. This methods embeds the RIFID method into a semi-supervised classification method for the first time. The two-step semi-supervised classification method proposed in this paper can handle the case of multi-classification and non-linear classification due to compatible with many classifiers, making it more generalizable; Several numerical simulations and a large number of experimental results on the seven public datasets demonstrate the effectiveness and robustness of the proposed method.
The rest of this paper is organized as follows. Section 2 reviews the related work. Section 3 presents the details of the proposed method. The experiment results and analysis are presented in Section 4. Finally, the conclusions are drawn in Section 5.
Related work
In this section, we present a few SSC methods, as well as oversampling and missing value imputation techniques.
Semi-supervised classification
Researchers have conducted several studies on a variety of semi-supervised self-training methods in recent years, see [2, 10, 17] and etc.
The SSC methods can be broadly grouped into two different types from algorithm perspective. The first is to use latent variables or specific boundary functions to represent the pseudo-label of unlabeled data in the optimization function, and the solution of optimization problem contains the pseudo-label and parameters of final classifier. The LSSLR method proposed by Amini, Gallinar et.al. [34] and the SSSVM method proposed by Bennett, Demiriz et.al. [17] fall under this type, but both methods are prone to fall into the local optima. Moreover, the SSSVM method is computationally complex and cannot address semi-supervised classification problems with large amount of data. The second type of SSC methods do not construct an overall optimization function to get pseudo-labels and final classifier. This type first trains the initial classifier using labeled data, and then enhances the classification performance of the initial classifier by using unlabeled data.
The three primary steps of the second type of SSC methods are: first, the initial classification model is trained on the labeled data. Then the initial classifier is used to predict the labels for unlabeled data, and merged with the labeled large data set. Finally the model is subsequently trained again via hyperparameter optimization. As initial classifiers, a variety of classification techniques are available, including logistic regression (LR), SVM, DNN, C4.5, and others [17, 26–28].
For the initial classifier, the logistic regression model has a straightforward expression and excellent algorithmic efficiency, while it is unable to tackle nonlinear classification issues. A set of irregular and unorganized training samples are used in the C4.5 classification approach to infer classification rules in the form of a decision tree representation. C4.5 classification is more accurate and can handle nonlinear classification issues, despite the high complexity and low algorithmic efficiency of the model [28]. The DNN model is easier, quicker, and takes less time to learn than C4.5 and the hidden layer’s activation function can be adjusted to make the model tackle nonlinear classification issues [29]. To improve the DNN classification accuracy, however, more hidden layers must be added when dealing with challenging nonlinear classification issues. This complicates the DNN model and raises computing complexity.
When these initial classifiers are incorporated into semi-supervised classification, such methods classification results are wholly dependent on the quality of the initial classifiers, their performance is nearly equivalent to that of conventional approaches, let alone these classification models suffer greatly as a result of outliers. For this reason, it is essential that the labeled dataset fit the general distribution of the data. When the number of the outliers increases in the dataset, this self-training classification algorithm will reinforce the incorrect classification judgments, substantially decreasing the classifier’s accuracy and generalization capacity. Making a semi-supervised model that may be used to successfully combat the detrimental effects of outliers in incomplete data with imbalanced classes is one of our main objectives.
Oversampling and imputation
A two-step method is typically used to address semi-supervised classification problems with imbalanced and incomplete data with outliers. Specifically, the first stage employs data-driven methods for oversampling and imputation to solving the imbalanced and incomplete data problem, and the second employs the semi-supervised classification method described in Section 2.1. In this part, we will look at various methods for correcting imbalance and incomplete data. A data-based approach to tackle issues with imbalanced classes is oversampling. This section introduces three conventional oversampling techniques: synthetic minority oversampling technology (SMOTE, [12]), adaptive synthetic sampling approach (ADASYN, [13]), cluster-based synthetic oversampling (CBSO, [14]).
A key oversampling approach called SMOTE uses interpolation methods to create fresh samples for the minority class. The K closest neighbor technique, which SMOTE employs to create synthetic samples, takes into account the useful data information and, to some extent, solves the overfitting issue. The amount of nearest neighbor samples must be predetermined, and the data distribution must be taken into account, among other key limitations of SMOTE. SMOTE is based on the distance synthesis of minority class samples. The synthesis data is quite likely to contain outliers that do not fit the distribution of the original data.
To solve the aforementioned issues, ADASYN, which adaptively generates synthetic samples to the target minority class samples, has been proposed in [13]. The basic goal of ADASYN is to use different weights to reflect the degree of learning difficulty in each minority class sample. Minority classes with greater levels of learning difficulty require more fresh synthetic data, such as samples taken from the edges of different classes. However, ADASYN is prone to disruption from outliers or noise points. In other words, if the target sample is a noise sample, the safety of synthetic samples cannot be assured and may possibly be compromised.
The clustering technique has been used to create a cluster-based oversampling, or CBSO, in order to take the distribution of the data into consideration. Since minority class samples from the same cluster are used to synthesize new samples, CBSO can reduce the issue of overlapping synthetic samples and, to some extent, increase the accuracy of synthetic samples with the aid of clustering. However, because noise samples have a tendency to develop their own class, which leads to additional noise synthetic samples, CBSO struggles to address the interference issue with noise samples.
These three conventional oversampling methods fail to deal with outliers in the data, and they perform worse with incomplete data. In order to improve the oversampling methods so that can tackle imbalanced datasets with missing values, Liu et al. [17] recently have suggested a FID method that can solve the problem of data analysis in the presence of missing values and imbalanced classes. This approach does have certain restrictions, namely that all missing data are substituted with the same mean value when the membership degree is 0. The FID approach lacks diversity in synthetic samples and reduces the accuracy of missing value recovery, while using the mean of the observed values to fill in all missing data.
Dou et al. [16] presented the improved fuzzy-based information decomposition (IFID) technique to address this issue. The IFID approach improves the membership function of the FID algorithm and emulates the normal distribution to provide accurate weights for all observed data. However, the IFID approach was unable to address the issue of outliers in the data being contained, which became our primary purpose.
Proposed method
In this section, we suggest the SSC-RIFID method. Section 3.1 first introduces the RIFID method, provides a sensible and robust method for completing missing values and synthesising data. This foundation leads to the proposal of a robust SSC-RIFID method in section 3.2.
RIFID method
Assume that the column vector
The terms S1 and S2 stand for the domain of features’ lower and upper boundaries, respectively. The maximum and minmum values are used to establish boundaries in the FID and IFID [16, 17]. It is well known that the maximum and minmum values are sensitive to outliers, the model can perhaps take extreme samples into account, jeopardizing the effectiveness of model. This difficulty part can be accomplished by replacing the maximum and minimum values with quantiles. Based on (2), the 5% percentile and 95% percentile values of the observed data are used in this article to define S1 and S2,
Similar to FID [15], to estimate the missing feature values, this paper establishes an interval
Then, we can obtain the following t partitions of the interval
The mapping is provided by measuring the contribution of the observed data to the incomplete data recovery:
To retrieve the kth missing value, we can utilize the following information decomposition based on(9):
The RIFID given here completely addresses the variety of the synthetic data and employs percentile values in (3) and (4) to mitigate the influence of outliers on the synthetic data. The suggested mapping (8) and the membership degree calculation (9) enable the assignment of weights with a small effect and every observed data will provide varying degrees of important contributions to the missing recovery. Compared with FID and IFID, these change and enhancement in this paper will make the synthetic data diverse and closer to the distribution of actual data. These major advantages will be visualized in the numerical simulations in section 4.1 with 3D plots.
The proposed RIFID has the capacity to both recover incomplete data and create new samples, which directly contributes to its ability to handle issues with both incomplete data and imbalanced data. When processing incomplete data with imbalanced class, RIFID will first determine the number of minority samples that must be generated from the data. Next, the samples that must be generated will be defined as the missing samples that must be filled, and finally the method of section 3.1.1 will be used to recover all of the missing samples.
In Algorithm 1, the number of data to be generated, G, contains two parts, one being the number of samples to be filled by the incomplete data, and the other being the number of samples to be generated for minority classes of samples to rebalance.
Considering the difficulty to use unlabeled data, the semi-supervised classification is viewed as a supervised classification and unsupervised learning hybrid strategy. In the self-training based SSC, the labeled data are used to train the initial classification model, while the unlabeled examples bring additional significant information for the initial model, improving the accuracy of the model.
1: synthesize G samples and set all features by None;
2:
3: identy the number t of missing values in
4: set S1 and S2;
5: calculate step length,
6: calculate intervals,
7:
8: calculate w i by (7);
9: calculate u k by (10),(11);
10: recover missing values
11:
12:
By leaning on [17, 24, 26, 27] for the design and conceptualization of semi-supervised self-training classification systems, we suggest a semi-supervised classification approach. After the incomplete data with imbalanced classes are fully filled and rebalanced using the RIFID method, we utilize this semi-supervised classification method to classify the recover data. The initial classifier is not a limiting factor for classification methods, therefore alternative initial classifiers can be used depending on the real data, such as classifiers suited for binary classification or multi-classification; classifiers suitable for linear or non-linear classification models. Additionally, because the RIFID technique can handle unbalanced and incomplete cases and is resistant to outliers, it is more reliable and robust than the semi-supervised classification method mentioned in Section 2.1.
1: train the multi-classification,
2: predict pseudo-labels for unlabeled data by
3: combine pseudo-labels data with labeled data to train the final classification,
4:
The primary steps of the SSC-RIFID method are as follows: Propcessing data by using the RIFID method; Training multi-classification models by using reconstructed labeled data; Prediction pseudo-labels for unlabeled data using trained classification models by Step2.; Combined pseudo-labels data with labeled data to train the final classification.
The SSC-RIFID algorithm flow chart is in Fig. 1.

SSC-RIFID algorithm flow chart.
Imputation and rebalancing of the data are crucial to the performance of the final classifier. We first demonstrate the advantages of RIFID in completing data and synthesizing samples as well as the effects of these oversampling and imputation techniques on self-training semi-supervised classification using toy examples. Additionally, we validate the efficiency of the self-training semi-supervised approach SSC-RIFID described in this paper using seven UCI datasets [21].
Numerical simulations
In order to compare the performance of FID, IFID and RIFID methods in terms of synthetic samples, some simulations are made in this paper. The advantages of the proposed data reconstructed method RIFID described in this paper are clearly demonstrated by the examples in Figs. 2, 3.

Numerical simulations about FID, IFID and RIFID on the Synthetic samples.

Numerical simulations about SSC-FID, SSC-IFID and SSC-RIFID.
The outcomes of the FID, IFID, and RIFID methods are contrasted in Fig. 2. Green dots represent minority classes, blue dots represent majority classes, and purple dots represent outliers in minority classes in the top left 3D plot, which depicts the case of the original data. We compared some class samples created by the method of FID, IFID and RIFID. The fake samples are represented by yellow dots. The samples produced by the FID method, as shown in the Fig. 2., are distributed almost in a straight line, lack diversity, and are still impacted by outliers and generate some synthetic samples close to outliers, which will rise the difficulty in the following SSC step. Utilizing quantiles, the RIFID method lessens the impact of outliers on synthetic samples, resulting in a more accurate representation of the data distribution.
The effectiveness of classification is directly influenced by the goodness of the synthesized samples. Figure 3 shows the impact of completment and rebalanced data methods on the semi-supervised classification results. In Fig. 3, the data are first processed using the FID, IFID, and RIFID approaches, then 50% of the labels are randomly eliminated, and finally the semi-supervised logistic regression classification method is used. It is important to note that the usage of SSC-FID, SSC-IFID in this study entails that the classification process is carried out using the semi-supervised classification approach after the data has been filled in using the FID or IFID method.
In Fig. 3, raw data is represented by the 3D plot in the upperleft corner, and the pseudo labels produced by SSC-FID, SSC-IFID, SSC-RIFID are represented in the remaining part. Brown dots signify many types of data, including unlabeled information, data that was formerly part of the majority class (brown round dots), and data that was formerly part of the minority class (brown triangular dots). The settings are the same as those in Fig. 2 with the exception of the red dots, which represent misclassified data. As can be observed, the propsed SSC-RIFID method produces the least misclassified data when compared to the other two methods.
Data analysis
To show the universality of the suggested strategy on various data sets, comparison tests with several cutting-edge methods are conducted using seven public data sets provided from the UCI machine learning repository [21]. Table 1 summarises the data from these seven data sets, designating the majority class as the class with samples that is the one with the largest number of samples in the same category and the minority class as the class with samples that fall short of the majority class threshold. For binary classification, the imbalance ratio(IR), the proportion of the number of samples in the minority class to the number of samples in the majority one, is represented by the last column. In the case of multiple categories, IR is the ratio of the smallest minority class to the largest majority class sample. When the imbalance ratio of the initial data is greater than 0.9, we consider this dataset to be balanced.
Basic information of datasets
Basic information of datasets
The seven datasets employed in the experiments, as seen in Table 1, are all multiclassified datasets. In the case of multiclassification, we eliminate 50% labels of each category at random, which avoid deleting labels that all originate from the same category, thus causing this category to disappear. Respectively, we apply the totally stochastic deletion approach to remove 20% or 40% of the feature samples from the raw datasets.
In Table 1, TS stands for total sample, representing the total number of samples in the dataset; Maj-samples stands for the largest majority class, and Min-samples stands for smallest the minority class.
To get unbiased findings, the 80%-20% train-test setting method and 10-fold cross-validations are used. As additional inital classifiers, we employ the Deep Neural Network (DNN) with two hidden layers, logistic regression (LR), and C4.5 decision tree. There will be 1000 training repeats total. Additionally, the data set is polluted in this research to test the robustness of the methodology. The precise procedure is as follows: In this article, pollution refers to adding 5% of the abnormal data. These abnormal data are about 5 times the standard deviation(5 σ) away from the mean of the original dataset.
In this paper, until all incomplete data with imbalanced class have been handled it, semi-supervised classification is performed. So, several cutting-edge methods for combining missing value recovery techniques with oversampling techniques are employed to demonstrate the efficacy of our proposed techniques. This study outlines the following four strategies: EM [18] + ADASYN (abbreviated as EMAD), KNN [18] + MWMOTE(KNMW), NMF+ Kmean-SMOTE [19] (NMKS), and FID. The classification is then carried out utilizing the semi-supervised method classification approach once these methods have been processed. In this paper, we continue to use SSC-EMAD to illustrate the process of first repairing the data using EMAD and then categorizing it using a semi-supervised classification algorithm. SSC-KNMW, SSC-NMKS and SSC-FID have the similar meaning. It is critical to note in this paper that the SSC-Naive method essentially removes any observations with missing data and then uses the semi-supervised method to classify them, doing nothing to complete in the missing values or add data to the dataset.
F-score and Gmean are generally selected as evaluated indices for imbalanced data classification outcomes [5]. The following indices, which were adapted from [20] and used for evaluation:
The primary text component of this paper only shows the experimental results of the C4.5 model to save chapters. The other categorization methods’ experimental findings are included in the Appendix.
Through comparative experiments, the benefits of the suggested strategy are further shown in Tables 2–7 in this paper. The comparison experiment findings for C4.5 classifiers with different missing rate and different proportions are shown in Tables 2–5, where the best outcomes for each indicator are highlighted in bold type.
Comparative experiment of 20% missing values using C4.5 without pollution
Comparative experiment of 20% missing values using C4.5 without pollution
Comparative experiment of 40% missing values using C4.5 without pollution
Comparative experiment of 20% missing values using C4.5 with 5% pollution
Comparative experiment of 40% missing values using C4.5 with 5% pollution
Comprative experiment of 20% missing values using SSSVM with 5% pollution
Comprative experiment of 40% missing values using SSSVM with 5% pollution
These tables demonstrate that SSC-RIFID, outperform the competiters. Tables 2, 3 both show that, in terms of F-Score and Gmean, our suggested solutions offer the best performance for 11 out of 14 indices with respective missing rates without pollution. To compare the robustness of these methods, we summarise the results in different missing rate for polluted dataset in Tables 4, 5. Table 4 shows that, for the 20% missing data with 5% pollution scenario, the SSC-RIFID technique outperformed the others in 9 of 14 measures. Table 5 reveals that, for 40% missing data had 5% pollution case, the proposed method outperformed the other methods in 12 of 14 criteria. This implies that the SSC-RIFID technique is less sensitive to outlier influence.
The above experiments demonstrate that under the proposed SSC framework, SSC-RIFID performs best. To further demonstrate the positive effect of RIFID on the final classification results, we used another type of semi-supervised self-training classification methods SSSVM, for comparative experiments. It should be noted that SSSVM is only applicable to binary data, and when performing multiple classification, this paper employs the one-versus-rest strategy, in which the samples selected during training are grouped into one category, and the remaining samples are grouped into another, resulting in the construction of k SSSVM’s classifers for k categories of samples. The unknown samples are classified as having the highest classification function value [30]. Table 6 (7) displays the experimental results of the SSSVM classifier with 5% contamination and 20% (40%) missing data. According to Tables 6, 7, the SSSVM-RIFID method outperforms the other methods in 13 of 14 measurements under these cases.
Futhermore, a comparison of Tables 4, 5 and 6, 7 shows that almost all datasets’ classification performance worse when using the SSSVM framework. This shows that the SSC-RIFID approach is more effective for classification when C4.5 serves as the initial classifier. Table 8 uses hypothesis testing to show whether there is a significant difference between the SSC and SSSVM methods. We establish the null hypothesis that there is no significant difference between the two algorithms without losing generality. When the significance level is set to 0.05, it is clear that all p-values are statistically significant because the p-value is substantially lower than 0.05. According to Table 8, the SSC method proposed in this paper surpasses the SSSVM method in terms of F-Score and Gmean, independent of the missing rate, contamination rate, as well as the oversampling and imputation methods, demonstrating that the SSC-RIFID method is superior.
Mann–Whitney Wilcoxon test results of F-score and Gmean
I SSC-EMAD vs SSSVM-EMAD II SSC-KNMW vs SSSVM-KNMW III SSC-NMKS vs SSSVM-NMKS IV SSC-FID vs SSSVM-FID V SSC-RIFID vs SSSVM-RIFID VI SSC-Naive vs SSSVM-Naive
This paper calculates the value of Gmean for three datasets in order to analyse and investigate the effectiveness of SSC-RIFID under various pollution conditions. The results are summarised in Figs. 4–6.

Dermatology dataset for five pollution conditions.

Epileptic dataset for five pollution conditions.

Cortex-nulear dataset for five pollution conditions.
In this study, the original data was contaminated with 5% outliers, and the contamination scenario was split into five categories based on how far the outliers were from the original data class centre, specifically, their distances of 1σ, 2σ, 3σ, 4σ, and 5σ.
The SSC-RIFID approach shown here is represented by the purple line. Figures 4, 5, and 6 demonstrate that, when compared to the other approaches, the SSC-RIFID method not only performs effectively in various pollution intensities, but also has the slowest Gmean fall as contamination intensity increases. This demonstrates how slowly the SSC-RIFID approach collapses.
In general, semi-supervised classification for complex data, such as missing value, imbalance class and containing outlier, is a challenging problem. We believe that an ideal method should have the following characteristics: competent data repair, capacity of multi-classification and non-linear classification, and robustness to outliers. We discovered through the comparison experiments outlined above that various existing semi-supervised classification methods are either unsuitable for nonlinear classification or are unable to handle incomplete data with imbalanced classes, and most of the existing methods are sensitive to outlier. Table 9 lists the benefits and drawbacks of the 12 techniques used in the comparison experiments for this paper.
Summary merits of the proposed method relative to related competitors
Summary merits of the proposed method relative to related competitors
I Robust to Noise II Multi-Classification III Nonlinear Classification IV Rebalance and Imputation Simultaneously
In this paper, we suggest a reliable two-step robust semi-supervised classification method. Four exceptional metrics characterize the proposed SSC-RIFID method. First, the SSC-RIFID procedure’s enhanced membership function creates synthetic samples that are more diverse, which is better for data recovery. Second, the SSC-RIFID process uses quantiles to reduce the impact of extreme values (or outliers) rather than employing maximum and minimum values. Third, SSC-RIFID successfully resolved the semi-supervised multi-classification problem for incomplete data with imbalanced class by combining the advantages of the self-training methods and RIFID procedure. Finally, the SSC-RIFID method can be applied to both linear and nonlinear classification situations, with a wider range of applications, and is better able to handle complex classification problems in real life.
However, there are certain issues related to the proposed method worth further research. First, a data-driven adaptive initial classifier selection will be conducted in the future research, since the selection of initial classifiers in this paper requires preview analysis of data characteristics. Second, this paper does not consider the case where the dimensionality of the feature values exceeds the size of the observed data, which will also be the focus of the next research. Finally, the data in this work are vector data instead of matrix data, how to extend the proposed method to more complex data structures, such as tensor data set, is also a possible research direction in the future.
Footnotes
Acknowledgments
This work was supported by the National Natural Science Foundation of China(Grant Nos.11401383 and 62073223). The authors thank the editor and two anonymous referees for their insightful comments and helpful suggestions.
Appendix
Comparative experiment of 40% missing values using LR with 5% pollution
| Algorithm | Datasets | |||||||
| Wine | MEU-Moblie | Libras | Dermatology | Cortex-Nuclear | Epileptic | Sat | ||
| SSC-EMAD | F-score | 0.5199 | 0.4924 | 0.5675 | 0.5314 | 0.5599 | 0.5353 | 0.5531 |
| Gmean | 0.6045 | 0.5639 | 0.5935 | 0.5819 | 0.6352 | 0.6023 | 0.6164 | |
| SSC-KNMW | F-score | 0.5573 | 0.5332 | 0.5536 |
|
0.6062 | 0.5630 | 0.6145 |
| Gmean | 0.6259 | 0.5958 | 0.6145 | 0.6454 | 0.6746 | 0.6294 | 0.6632 | |
| SSC-NMKS | F-score | 0.5745 | 0.5211 |
|
0.5318 | 0.5554 | 0.5362 | 0.5504 |
| Gmean | 0.6334 | 0.6044 | 0.6339 | 0.5709 | 0.6249 | 0.6195 | 0.6208 | |
| SSC-FID | F-score | 0.5313 | 0.5106 | 0.5139 | 0.5285 | 0.5759 | 0.5311 | 0.6021 |
| Gmean | 0.6378 | 0.5850 | 0.6225 | 0.6029 | 0.6729 |
|
0.6634 | |
| SSC-RIFID | F-score |
|
|
0.5446 | 0.5601 |
|
|
|
| Gmean |
|
|
|
|
|
0.6356 |
|
|
| SSC-Naive | F-score | 0.4987 | 0.4724 | 0.4634 | 0.5079 | 0.5163 | 0.4872 | 0.4699 |
| Gmean | 0.4934 | 0.5165 | 0.4860 | 0.5211 | 0.5336 | 0.5188 | 0.4849 | |
