Abstract
Data gathered from real world often contains label noise, which is harmful to the quality of data. Moreover, any data mining process suffers a deterioration when it is applied on noisy data. In this paper, a new approach is proposed to improve data quality by correcting mislabeled data. The proposed method employs a procedure to estimate the level of the noise in the data and combines this noise estimation with a correction process. A clustering method and k nearest neighbors approach are applied in the correction process. Extensive experimental results using real-world data sets from UCI machine learning repository are provided. The empirical study shows that our approach successfully improves data quality in many cases and outperforms several correction methods.
Introduction
The traditional setting assumes that data is perfectly labeled in data mining process. However, this assumption does not always hold in many real-world applications. The labels of data may be corrupted, for example, in some human supervision situations, like bio-medical applications (DNA micro-array, MRI images, etc.). When data labels are mislabeled, it fails to guarantee the data quality [3], and may further affect the performance in data mining process [4, 15, 24]. Study on this situation is an important task for data mining area [11]. Existing studies on label noise include two main methods: algorithm level approach and data level approach [26]. For the algorithm level approach, it mainly aims to design some robust supervised leaning algorithms which are little affected by label noise. For the data level approach, it focuses on identifying and removing mislabeled data or correcting mislabeled data. This paper focuses on the data level approach so that data quality can be improved and the corrected data can be exploited to other applications.
In the literature [9], label noise is categorized into three groups based on its statistical properties. Noise completely at random model means that the corrupted label is flipped dependently with a constant probability. Noise at random model means that the probability of label noise is dependent on its true label and is independent of its features. As for the noise not at random model, the probability of the mislabeled label is dependent on both its features and label. In this paper, we consider the noise is completely at random.
To improve data quality, we propose a novel label noise correction approach, an algorithm composed of two sequential parts. First, estimating the label noise rate in a dataset. The estimation process is learning with confident instances, i.e. instances with a predicted probability of its true class is near 1 or of its mislabeled class is near 0. If we consider non-confident instances as noisy data, then the noise rate can be easily estimated. Second, combining this estimated noise rate with the correction process of mislabeled data.
The rest of this paper is organized as follows. Section 2 presents a review of previous work on handling label noise. The proposed method is introduced in Section 3. In Section 4, we provide the experimental setup and results on UCI datasets. Section 5 concludes the paper.
Related work
Researches on dealing with label noise have been widely investigated. There are two main approaches [26]. One is an algorithm level method, using some classification methods to tolerate the presence of noise. All these algorithms share a common characteristic: they are inherently robust to noisy data. Some existing studies have shown that some algorithms are less affected than others when the training data has mislabeled data [9]. A typical method is C4.5 [21], which reduces the effects of noise by using pruning strategies. There are also some sensitive classification approaches, like
Among the data level approaches, two typical strategies are noise filters and noise correction. Noise filters consist of identifying and removing mislabeled instances from the training data. Some of them are [5, 10, 12]. Compared to noise filters, noise correction is not to remove the incorrect examples but correct them. In the literature, removing noise label is more common than correcting them [9]. It seems to be implemented easily by simply removing mislabeled instances, but there are some cases that these filter methods are likely to remove too much data. This may decrease the performance of classification [9]. Moreover, some data is expensive and difficult to be collected so only small labeled data is available [3]. Therefore, in this setting, it is suitable to correct label noise.
However, the literature on correcting label noise is few. Attribute noise and label noise are both corrected in [23]. A correction algorithm about textual ‘spectral’ labels is developed [22]. There are two correction methods: self-training and cluster-based [18]. It is shown that cluster-based method is better [18]. But these correction strategies don’t consider noise level in the data. Moreover, though recent correction strategy is investigated in the crowdsourcing setting [28], it is only available for binary classification problems. Inspired by the cluster-based method [18], we intend to build a general algorithm which can deal with both binary classification and multi-classification.
There are several works on estimating the noise level [16, 17, 19]. However, these methods can only work with binary class problems. We want to take account into noise levels and deal with multi-class problems.
The new label noise correction method
In this work, we propose a novel method designed for handling label noise. Different from previous works on correcting noisy data, our proposed approach has two main advantages. First, we design a new algorithm to estimate noise rate, which can work with multi-class problems. Second, the correction process makes full use of the advantages of both supervised learning algorithm and unsupervised learning algorithm.
Estimating noise rate of data
In the proposed strategy, estimating noise rate of data is an important part. We employ
There are three steps in estimating noise rate process. Suppose that there is an instance
More formally, given the training sample
where
The threshold values are defined as
where
where the label of
The number of potential mislabeled instances is
And noise rate
where
The pseudo-code of the method is detailed in Algorithm 1.
The noise level estimation algorithm[1] Data set
Correction algorithm
Based on K-means clustering algorithm [13] and
Our correction algorithm is detailed, which can choose different correction process according to different noise levels. Based on the degree of corruption of data, we divide noise rate into three levels:
Modified cluster correction[1] Data set
all instances
Lines 1–4 in Algorithm 2 compute the distribution of the corrected labels in the data set. This will calculate the label weights for each instance, shown in Algorithm 3. Lines 5–13 in Algorithm 2 perform all required correction process, which employs K-means and
In Algorithm 3, Lines 1–3 calculate the distribution of no useful information in a cluster. Line 4 works to give more magnitude to the larger clusters. Line 6 means the difference between the majority label
CalcWeights[1] Cluster
Modified_by_kNN[1] Cluster
modified_prob
Line 1 in Algorithm 4 calculates the number of the nearest neighbors. Lines 2–7 calculate the weights for each cluster.
Different from previous work, we consider the influence of the number of label noise in the data set. Besides we make full use of the results of clustering, that is, using the neighbors of a cluster increases the probability of correcting labels.
Experiments and results
Our experimental goal is to demonstrate that the proposed method improves data quality and outperforms existing three label noise correction. One is the polishing labels method [23], which considers both the feature noise and the label noise. In this paper, we simplify the polishing label method without considering the feature noise. The rest of them are self-training correction and cluster-based correction [18]. To do this, we choose three aspects to evaluate the performance, which are label quality, model quality and AUC. Section 4.1 presents the experimental setup using several artificial levels of noise and the datasets employed in these experiments, which are extracted from the UCI repository (
Experiment setup
In the following studies, we generate the noisy data based on uniform distribution, that is, given a data set
Our implementation of the proposed method (MCC) is mostly detailed in the pseudo-code shown in Algorithms 1–4. In the estimating noise rate process, we employ
To investigate the performance of the proposed approach, we choose 24 datasets from UCI repository. Table 1 summarizes the detailed information of these datasets, including number of examples (#EX), number of categorical variables (#CAT), number of numerical variables (#NUM), and number of classes (#Class). These datasets contain binary datasets and multi-class datasets. Note that we use them as they are, that is, we do not preprocess the features of instances in our experiments.
Description of the 24 UCI datasets used in the experiments
Description of the 24 UCI datasets used in the experiments
In this section, we evaluate the performance of MCC and three other correction methods including simplified polishing labels (PL), self-training correction (STC) and cluster-based method (CC). The measurement of the approaches’ performance includes three aspects: label quality, model quality and AUC. Label quality is the proportion of instances whose corrected labels are the same as true labels. Model quality means the accuracy of the classifier which is derived by corrected data set and performs cross-validation. In our experiments, we use sklearn.tree.DecisionTreeClassifier (
Label quality
Following indicates the performance of each label correction algorithm. Figure 1 presents the label quality results on labor dataset. In the picture, the horizontal axis is the noise rate, and the vertical axis represents accuracy, that is, label quality metric. The dashed line is a baseline, which represents effectiveness of these four approaches. From Fig. 1, we can see that both MCC and CC improve the label quality and MCC performs better. At each noise rate, the line marked with circles is on the top of the rest of four lines. Besides, the accuracy value of MCC are all up to 90% in the interval [0.05, 0.30]. When the noise rate in [0.40, 0.45], the value is above 80%. The detailed data is presented in Table 2. Top method accuracy values are in bold. The experimental results on labor data set show that MCC gives the best accuracy (98.2%) at the noise rate 15%, which is much better than the second best algorithm CC (91.2%). There are 14 data sets in all 24 data sets like in this situation, where MCC performs better than other three methods at each noise rate or most of the noise rate. These data sets conclude 7 binary class data sets and 7 multi-class data sets.
Accuracy results for labor dataset
Accuracy results for labor dataset
Average binary accuracy
Label quality on labor data set.
Label quality on diabetes data set.
Label quality on credit-a data set.
Average label quality results on binary data sets.
Average label quality results on multi-class data sets.
In some cases, these four approaches are competitive with each other in terms of label quality improvement. Figure 2 shows the results of diabetes data set in such cases, it is seen that all the four methods can improve the data set label quality when the noise rate is in the interval [0.25, 0.45] and that all these improvements are slight. Further detailed data can be found in Appendix Tables 5 and 6. There are 2 data sets like this in binary class data set and 4 such data sets in the multi-class data sets.
However, in some data sets, MCC performs not very well, such as credit-a in Fig. 3. In the binary class data sets, 3 data sets like this including credit-a data set. In the multi-class data sets, there also exists 1 data set like auto data set. And all the label quality results on multi-class are shown in the Appendix (Fig. 12).
Average model quality results on binary data sets.
Average model quality results on multi-class data sets.
Average AUC results on binary data sets.
Average AUC results on multi-class data sets.
Effects of the type of features.
To summarize the performance of the four correction algorithms, the average improvement results on the 12 binary data sets and 12 multi-class data sets are shown in Figs 4 and 5, respectively. From Fig. 4, we can observe that all four approaches’ results are above the baseline when the noise rate in the binary datasets is greater than 20%, which means these correction algorithms can improve the label quality. It is also seen that MCC outperforms the rest of three approaches in the experimental label noise range except at 20%, where CC and MCC are same and both dominate STC and PL. In particular, with noise rate increasing, the advantage of MCC is gradually getting obvious. Detailed data of the average results of binary class datasets is presented in Table 2. Figure 5 displays the average results on multi-class datasets. Both MCC and CC improve label quality on average, with MCC showing superior improvements to CC when the noise rate is greater than 15%. To conclude, MCC is the best method in terms of label quality improvement on average. We also perform experiments that replace K-means by self organized map (SOM), where we choice 14 datasets in terms of label quality, because the method using SOM cost too much time. The results indicate that using K-means is better. The average result of these experiments is shown in Appendix (Fig. 13).
The average model quality results of binary class and multi-class datasets are shown in Figs 6 and 7, respectively. Note that there are 6 lines in each picture. The black dashed line represents the baseline which is derived by noise data. And the cyan dashed line is drawn from clean data which does not have noise label.
From Fig. 6, we can find that all four methods can improve model quality on average in binary class datasets at all noise rates. We can also find that MCC outperforms the rest of three algorithms on the high noise rates. In Fig. 7, it is quite clear that at all noise rates, MCC greatly outperforms the best, followed by CC while PL is the third and STC the last. All the model quality results on each multi-class dataset can be found in the Appendix (Fig. 14). To summarize, MCC performs the best in terms of model quality.
AUC
For most binary class datasets, there are slight differences between these algorithms at low noise rates. As the noise rate increases, the difference becomes distinct (e.g. breast-w, heart-stat-log and mushroom). For most multi-class data sets, MCC and CC performs better than STC and PL at most noise rates in terms of the improvements of AUC. Besides, MCC is superior (e.g. balance-scale, vowel and lymph).
Figures 8 and 9 present the average results of AUC of binary class and multi-class datasets, respectively. In Fig. 8, for the binary class data sets, MCC, CC and PL are slightly different when the noise rate is below 20%; PL becomes poor, when the rate increases; MCC is the best when noise rate is up to 30%. In Fig. 9, MCC and CC are almost the same and STC and PL are both very competitive with each other when the noise rate is in the interval [0.05, 0.20]; when the noise rate is greater than 20%, it is obvious that MCC performs the best and CC is second best followed by PL, with STC at last. More details of multi-class datasets are in the Appendix (Fig. 15). These show that MCC outperforms the other three methods in terms of AUC on average.
Effects of the type of features
The type of features may affect the performance of the proposed method. We evaluate our framework with the type of features, which are numerical features and categorical features. There are 3 datasets only containing categorical features and 10 datasets only involving numerical features in the experimental datasets. The remaining datasets are both. We give the average label quality results for indicating the effects of the type of features on the performance of the algorithm. The results are shown in Fig. 10. Notice that MCC almost outperforms the others in all datasets. As for the datasets only containing categorical features, STC performs the best, MCC comes in a close second, followed by CC and PL is the worst when the noise rate is less than 0.25. If the noise rate is greater than 0.25, MCC is the best, STC is the second, the rest two methods perform remain the same. As for the datasets only including numerical features, MCC is the best. In the mixed datasets, CC performs better than MCC when the noise rate is less than 0.25, but when the noise rate is greater than 0.25, MCC performs best. Thus, we can conclude that MCC significantly improves the performance, especially when the noise rate is high, which is the dataset setting appropriate for MCC.
Conclusion
In this paper, we introduced a label noise correction method, called MCC, for improving data quality in the presence of label noise. It employs a procedure to estimate the level of the noise in the data and combines this noise estimation with a correction process. The experimental studies show that MCC can effectively improve label quality. In particular, MCC appears to be better than existing correction methods. In addition, the improvement of label quality has contributed to the classification in terms of accuracy and AUC.
Our proposed method concerns the estimation of the noise rate. In future work, we will consider how to estimate the noise rate for the noisy data accurately.
Footnotes
Acknowledgments
The authors wish to thank the Editor, an Associate Editor, and two anonymous referees for their valuable comments which have helped improve the presentation and quality of this paper a great deal. This research was supported by the Natural Science Basic Research Plan in Shaanxi Province (Program NO. 2017JQ1013), the Fundamental Research Funds for the Central Universities (Grant JB180705), and the National Natural Science Foundation of China (Grant 61573266).
Appendix
Tables 4 and 5 present the accuracy results for two datasets: diabetes and credit-a. The average accuracy results on the twelve multi-class datasets are summarized in Table 6. Tables 7–10 illustrate the average model quality results and the average AUC results for binary class datasets and multi-class datasets. Figure 11 shows the effect of the parameter
Accuracy results for diabetes dataset
Method
Noise_rate
5%
10%
15%
20%
25%
30%
35%
40%
45%
MCC
0.790
0.780
0.720
0.660
0.600
CC
0.831
0.802
0.802
0.767
0.738
0.677
0.616
0.552
STC
0.836
0.819
0.770
0.736
0.721
0.676
0.615
0.605
PL
0.803
0.797
0.781
0.763
0.745
0.701
0.688
Accuracy results for credit-a dataset
Method
Noise_rate
5%
10%
15%
20%
25%
30%
35%
40%
45%
MCC
0.851
0.851
0.820
0.770
0.759
0.759
0.72
0.620
0.551
CC
0.883
0.878
0.854
0.758
0.696
STC
0.899
0.881
0.868
0.846
0.822
0.787
0.597
PL
0.820
0.829
0.723
0.717
0.649
0.535
Average multi-class accuracy
Method
Noise_rate
5%
10%
15%
20%
25%
30%
35%
40%
45%
MCC
0.864
0.838
CC
0.855
0.847
0.833
0.810
0.796
0.780
0.753
0.697
STC
0.838
0.802
0.758
0.705
0.653
0.623
0.590
0.524
PL
0.750
0.758
0.736
0.724
0.703
0.701
0.690
0.653
0.611
Average binary model quality
Method
Noise_rate
5%
10%
15%
20%
25%
30%
35%
40%
45%
True
0.812
0.801
0.808
0.810
0.807
0.798
0.799
0.802
0.795
MCC
0.810
0.814
CC
0.818
0.798
0.806
0.781
0.745
0.675
0.635
0.557
STC
0.811
0.787
0.793
0.740
0.697
0.687
0.649
0.564
PL
0.786
0.789
0.783
0.772
0.731
0.704
0.657
0.647
0.539
Noise
0.758
0.717
0.689
0.649
0.612
0.581
0.550
0.534
0.481
Average multi-class model quality
Method
Noise_rate
5%
10%
15%
20%
25%
30%
35%
40%
45%
True
0.762
0.765
0.760
0.757
0.757
0.761
0.761
0.765
0.767
MCC
CC
0.730
0.723
0.711
0.696
0.682
0.669
0.640
0.591
0.535
STC
0.676
0.632
0.594
0.548
0.504
0.449
0.420
0.368
0.336
PL
0.676
0.682
0.670
0.641
0.629
0.617
0.608
0.557
0.518
Noise
0.672
0.613
0.561
0.501
0.451
0.408
0.390
0.316
0.292
Average binary AUC
Method
Noise_rate
5%
10%
15%
20%
25%
30%
35%
40%
45%
True
0.835
0.822
0.828
0.832
0.827
0.818
0.821
0.824
0.819
MCC
0.829
0.834
0.817
0.821
CC
0.837
0.803
0.771
0.707
0.664
0.592
STC
0.829
0.808
0.814
0.765
0.725
0.717
0.681
0.599
PL
0.796
0.799
0.798
0.786
0.754
0.732
0.679
0.670
0.580
Noise
0.784
0.753
0.722
0.693
0.656
0.634
0.599
0.586
0.537
Average multi-class AUC
Method
Noise_rate
5%
10%
15%
20%
25%
30%
35%
40%
45%
True
0.885
0.887
0.883
0.884
0.885
0.886
0.886
0.885
0.888
MCC
CC
0.862
0.860
0.853
0.846
0.840
0.833
0.822
0.800
0.766
STC
0.842
0.821
0.806
0.786
0.758
0.729
0.718
0.690
0.664
PL
0.812
0.819
0.811
0.792
0.788
0.783
0.783
0.754
0.735
Noise
0.844
0.818
0.794
0.765
0.738
0.718
0.706
0.666
0.646
Effect of the parameter k for autos, segment, vowel and zoo.
Accuracy results on multi-class data sets.
Average accuracy results on 14 datasets using SOM.
AUC results on multi-class data sets.
Model quality results on multi-class data sets.
