Improving data quality with label noise correction

Abstract

Data gathered from real world often contains label noise, which is harmful to the quality of data. Moreover, any data mining process suffers a deterioration when it is applied on noisy data. In this paper, a new approach is proposed to improve data quality by correcting mislabeled data. The proposed method employs a procedure to estimate the level of the noise in the data and combines this noise estimation with a correction process. A clustering method and k nearest neighbors approach are applied in the correction process. Extensive experimental results using real-world data sets from UCI machine learning repository are provided. The empirical study shows that our approach successfully improves data quality in many cases and outperforms several correction methods.

Keywords

Label noise noise correction noise rate estimation classification

1. Introduction

The traditional setting assumes that data is perfectly labeled in data mining process. However, this assumption does not always hold in many real-world applications. The labels of data may be corrupted, for example, in some human supervision situations, like bio-medical applications (DNA micro-array, MRI images, etc.). When data labels are mislabeled, it fails to guarantee the data quality [3], and may further affect the performance in data mining process [4, 15, 24]. Study on this situation is an important task for data mining area [11]. Existing studies on label noise include two main methods: algorithm level approach and data level approach [26]. For the algorithm level approach, it mainly aims to design some robust supervised leaning algorithms which are little affected by label noise. For the data level approach, it focuses on identifying and removing mislabeled data or correcting mislabeled data. This paper focuses on the data level approach so that data quality can be improved and the corrected data can be exploited to other applications.

In the literature [9], label noise is categorized into three groups based on its statistical properties. Noise completely at random model means that the corrupted label is flipped dependently with a constant probability. Noise at random model means that the probability of label noise is dependent on its true label and is independent of its features. As for the noise not at random model, the probability of the mislabeled label is dependent on both its features and label. In this paper, we consider the noise is completely at random.

To improve data quality, we propose a novel label noise correction approach, an algorithm composed of two sequential parts. First, estimating the label noise rate in a dataset. The estimation process is learning with confident instances, i.e. instances with a predicted probability of its true class is near 1 or of its mislabeled class is near 0. If we consider non-confident instances as noisy data, then the noise rate can be easily estimated. Second, combining this estimated noise rate with the correction process of mislabeled data.

The rest of this paper is organized as follows. Section 2 presents a review of previous work on handling label noise. The proposed method is introduced in Section 3. In Section 4, we provide the experimental setup and results on UCI datasets. Section 5 concludes the paper.

2. Related work

Researches on dealing with label noise have been widely investigated. There are two main approaches [26]. One is an algorithm level method, using some classification methods to tolerate the presence of noise. All these algorithms share a common characteristic: they are inherently robust to noisy data. Some existing studies have shown that some algorithms are less affected than others when the training data has mislabeled data [9]. A typical method is C4.5 [21], which reduces the effects of noise by using pruning strategies. There are also some sensitive classification approaches, like $k$ nearest neighbors approach ( $k$ NN) [8], especially when $k$ is small [6]. Unfortunately, the robust algorithms are not available because of their poor performance when the noise rate is high. By comparison, a data level approach is a better choice, and moreover, the cleaning data can be applied to many cases.

Among the data level approaches, two typical strategies are noise filters and noise correction. Noise filters consist of identifying and removing mislabeled instances from the training data. Some of them are [5, 10, 12]. Compared to noise filters, noise correction is not to remove the incorrect examples but correct them. In the literature, removing noise label is more common than correcting them [9]. It seems to be implemented easily by simply removing mislabeled instances, but there are some cases that these filter methods are likely to remove too much data. This may decrease the performance of classification [9]. Moreover, some data is expensive and difficult to be collected so only small labeled data is available [3]. Therefore, in this setting, it is suitable to correct label noise.

However, the literature on correcting label noise is few. Attribute noise and label noise are both corrected in [23]. A correction algorithm about textual ‘spectral’ labels is developed [22]. There are two correction methods: self-training and cluster-based [18]. It is shown that cluster-based method is better [18]. But these correction strategies don’t consider noise level in the data. Moreover, though recent correction strategy is investigated in the crowdsourcing setting [28], it is only available for binary classification problems. Inspired by the cluster-based method [18], we intend to build a general algorithm which can deal with both binary classification and multi-classification.

There are several works on estimating the noise level [16, 17, 19]. However, these methods can only work with binary class problems. We want to take account into noise levels and deal with multi-class problems.

3. The new label noise correction method

In this work, we propose a novel method designed for handling label noise. Different from previous works on correcting noisy data, our proposed approach has two main advantages. First, we design a new algorithm to estimate noise rate, which can work with multi-class problems. Second, the correction process makes full use of the advantages of both supervised learning algorithm and unsupervised learning algorithm.

3.1 Estimating noise rate of data

In the proposed strategy, estimating noise rate of data is an important part. We employ $k$ NN [8], which is a typical instance-based learning algorithm [2]. Existing studies show that this algorithm is sensitive to noisy data [14, 25] and has been used to identify mislabeled instances [1, 20].

There are three steps in estimating noise rate process. Suppose that there is an instance $x_{i}$ and its label $y_{i}$ , and the number of classes in the dataset is $m$ . Firstly, we use the $k$ NN classifier to derive probability estimates for each instance in the data set. Specifically, we pick only this instance as the test set, then build a model on all the remaining, complementary instances, and obtain the probability estimate in all classes ( $p_{x_{i}}^{1}$ , $p_{x_{i}}^{2}$ , $\cdots$ , $p_{x_{i}}^{m}$ ) on $x_{i}$ . Secondly, find thresholds to detect anomalous instances. The thresholds are the mean probability estimates of all examples in the same class. For example, the threshold of class $J$ is $T_{J}=\frac{\sum_{x\in J}p_{x}^{J}}{|J|}$ , where $|J|$ is the number of instances in class $J$ . Thirdly, count the number of potential incorrectly labeled instances and compute its percentage of all instances. Suppose that class $M$ is the actual label of the instance $x_{i}$ , if the probability $p_{x_{i}}^{J}>T_{J}$ , then $x_{i}$ is a potential mislabeled instance. Do the same thing on the full data set. And count all the potential mislabeled instances, and calculate the proportions of all instances.

More formally, given the training sample $X=\{X_{1},X_{2},\cdots,X_{n}\}$ and the corresponding label set $Y\in\{0,1,2,\cdots,m\}$ . Using $k$ NN classifier, obtain

$\displaystyle g(x)=P(Y=l|x),$

where $l\in\{0,1,2,\cdots,m\}$ .

The threshold values are defined as

$\displaystyle T_{l}=E_{X}[g_{l}(x)],$

where $l\in\{0,1,2,\cdots,m\}$ , $g_{l}(x)$ is the $l$ th element of $g(x)$ .

$x_{j}\in X$ is a potential mislabeled instance, if there exists an $i\in\{0,1,2,\cdots,m\}$ and $i\neq l$ , such that

$\displaystyle g_{i}(x_{j})=P(Y=i|x_{j})>T_{i},$

where the label of $x_{j}$ is $l$ , $g_{i}(x)$ is the $i$ th element of $g(x)$ .

The number of potential mislabeled instances is

$\displaystyle\textit{mis\_nums}=\sum_{{i\in\{1,\cdots,n\}}}\mathbf{I}\,(x_{i}% \textit{ is a potential mislabeled instances})$

And noise rate

$\displaystyle\gamma=\frac{\textit{mis\_nums}}{n},$

where $n$ is the number of all instances.

The pseudo-code of the method is detailed in Algorithm 1.

The noise level estimation algorithm[1] Data set $X$ , label set $Y$ , n_neighbors $k$ The noise level estimation $\gamma$ $i\leftarrow 1$ to $n$ $\textit{TrainSet}\leftarrow X/X_{i}$ $\textit{TestSet}\leftarrow X_{i}$ $g(x_{i})=\textit{kNN}(\textit{TrainSet},\textit{TestSet},k)$

$l\leftarrow 0$ to $m$ $T_{l}=E_{X}[g_{l}(x)]$

$\textit{mis\_nums}=0$ $i\leftarrow 1$ to $n$ $j\leftarrow 0$ to $m$ $g_{j}(x_{i})=P(Y=j|x_{i})>T_{j\textit{ and }j\neq l}$ $\textit{mis\_nums}=\textit{mis\_nums}+1$ break

$\gamma=\frac{\textit{mis\_nums}}{n}$ $\Return{\gamma}$

3.2 Correction algorithm

Based on K-means clustering algorithm [13] and $k$ NN algorithm [8], we propose a novel approach for label noise correction. The advantage of clustering is that clusters are formed by using instances’ features instead of their labels, which makes it more robust to noise than supervised approaches [18]. Using $k$ NN algorithm is to make full use of data information so that the noisy labels can be corrected in high probability.

Our correction algorithm is detailed, which can choose different correction process according to different noise levels. Based on the degree of corruption of data, we divide noise rate into three levels: $\gamma\leqslant$ 15%, 15% $<\gamma\leqslant$ 30%, $\gamma>$ 30%, which are the low noise level, the middle noise level, the high noise level, respectively. The idea is that we think the label’s confidence have three possible cases: the most confident, less confident, the least confident. In the low noise level, we just employ clustering method, for in this situation, the most labels are correct in a cluster. However, in the middle noise level, the number of true labels decreases in a cluster, which requires us to refer to its local neighborhood of the cluster when the number of clusters is large. As for the high noise level, fewer instances’ labels are correct, and we do the same as in the middle noise level. Moreover, we also define weight, which is characterized by the dominant label in a cluster. That’s, we use this weight to consider the trade-off between within a cluster and inter-cluster. Its pseudo-code is shown in Algorithm 2. This framework is derived from [18].

Modified cluster correction[1] Data set $X$ , corresponding label vector $Y$ , number of clusterings $a$ , noise_rate $\gamma$ Corrected_label $Y$ $i\leftarrow 1$ to $n$ $\textit{index}\leftarrow y_{i}$ $\textit{LabelTotals}_{\textit{index}}\leftarrow\textit{LabelTotals}_{\textit{% index}}+1$

$i\leftarrow 1$ to $a$ $k\leftarrow\frac{i}{a}\times\frac{n}{2}+2$ $C\leftarrow K-\textit{meansCluster}(X,k)$

$j\leftarrow 1$ to number of clusters $|C|$ all instances $x$ in $C_{j}$ $\textit{InsWeights}_{x}\leftarrow\textit{InsWeights}_{x}+\textit{CalcWeights}(% C_{j},\textit{LabelTotals},\textit{noise\_rate})$

all instances $x$ in $X$ $\textit{LabelGuess}\leftarrow\textit{argmax}(\textit{InsWeights}_{x})$ $y_{i}\leftarrow\textit{LabelGuess}$

Lines 1–4 in Algorithm 2 compute the distribution of the corrected labels in the data set. This will calculate the label weights for each instance, shown in Algorithm 3. Lines 5–13 in Algorithm 2 perform all required correction process, which employs K-means and $k$ NN. Line 6 establishes the range of the parameter $k$ in K-means. It varies from 2 to $\frac{2}{n}$ , which adds to the diversity of clusters. Line 7 performs K-means clustering process. Lines 14–17 work to correct label using the calculated weights.

In Algorithm 3, Lines 1–3 calculate the distribution of no useful information in a cluster. Line 4 works to give more magnitude to the larger clusters. Line 6 means the difference between the majority label $d_{(1)}$ and the second majority label $d_{(2)}$ , which quantifies the influence of the dominant label. Lines 7–20 calculate the label weights of instances. Here, we use conf_in_clusters which are a series of functions of diff in order to indicate the confidence in a cluster in different noise level. We think give more confidence to the dominant label in the low noise rate, while less confidence in the high noise rate. Because in the low noise rate, most of instances’ labels in a cluster are correct and they are confident in a high probability. However, few number of instances’ labels are correct and they are confident in a low probability. $\frac{d_{i}-u_{i}}{v_{i}}$ calculates the difference between the expected proportions. When the estimation of noise rate is smaller than 15%, Lines 8–9 are performed. Lines 10–13 work when the estimation of noise rate is in the middle level. In this case, we consider the confidence not only within a cluster, but also between clusters. $\textit{weight}_{j}$ is a trades off between $\textit{cluster}_{j}$ and inter-cluster. Lines 15–17 are performed when the noise rate is high. Calculating instances’ weights after considering confidence between clusters is in Algorithm 4.

CalcWeights[1] Cluster $C$ , data set distribution vector v, noise_rate $\gamma$ Weight vector w

$i\leftarrow 0$ to $m$ $u_{i}\leftarrow\frac{1}{m+1}$

$\textit{multiplier}\leftarrow\min(\textit{log}_{10}(\textit{sizeof}(C)),2)$

d $\leftarrow$ Distribution(C)

$\textit{diff}\leftarrow d_{(1)}-d_{(2)}$

$i\leftarrow 0$ to $m$ $\gamma\leqslant 0.15$

$w_{i}\leftarrow\textit{multiplier}\times\frac{d_{i}-u_{i}}{v_{i}}\times\textit% {conf\_in\_cluster}1(\textit{diff})$

$\gamma\leqslant 0.30$ $\textit{weight}_{j}\leftarrow\frac{\textit{max}(d)-\frac{1}{m+1}}{1-\frac{1}{m% +1}}$ $w_{i}\leftarrow\textit{multiplier}\times\frac{d_{i}-u_{i}}{v_{i}}\times\textit% {conf\_in\_cluster}2(\textit{diff})\times\textit{weight}_{j}+$ $\textit{Modified\_by\_kNN}(C_{j})\times(1-\textit{weight}_{j})$

$\textit{weight}_{c}\textit{luster}_{j}\leftarrow\frac{\textit{max}(d)-\frac{1}% {m+1}}{1-\frac{1}{m+1}}$ $w_{i}\leftarrow\textit{multiplier}\times\frac{d_{i}-u_{i}}{v_{i}}\times\textit% {conf\_in\_cluster}3(\textit{diff})\times\textit{weight}_{j}+$ $\textit{Modified\_by\_kNN}(C_{j})\times(1-\textit{weight}_{j})$

Modified_by_kNN[1] Cluster $C$ Modified_prob

$\textit{knn\_k}\leftarrow\min(\frac{k-1}{m+1},\textit{num\_neigh bors})$ $j\leftarrow 1$ to the number of cluster $|C|$ $\textit{TrainSet}\leftarrow C/C_{j}$ $\textit{TestSet}\leftarrow C_{j}$ $\textit{modified\_prob}\leftarrow\textit{kNN}(\textit{TrainSet},\textit{% TestSet},\textit{knn\_k})$

modified_prob

Line 1 in Algorithm 4 calculates the number of the nearest neighbors. Lines 2–7 calculate the weights for each cluster.

Different from previous work, we consider the influence of the number of label noise in the data set. Besides we make full use of the results of clustering, that is, using the neighbors of a cluster increases the probability of correcting labels.

4. Experiments and results

Our experimental goal is to demonstrate that the proposed method improves data quality and outperforms existing three label noise correction. One is the polishing labels method [23], which considers both the feature noise and the label noise. In this paper, we simplify the polishing label method without considering the feature noise. The rest of them are self-training correction and cluster-based correction [18]. To do this, we choose three aspects to evaluate the performance, which are label quality, model quality and AUC. Section 4.1 presents the experimental setup using several artificial levels of noise and the datasets employed in these experiments, which are extracted from the UCI repository (http://archive.ics.uci.edu/ml/). Section 4.2 describes the experimental results.

4.1 Experiment setup

In the following studies, we generate the noisy data based on uniform distribution, that is, given a data set $X$ , if $p$ is noise rate, then $p\times n$ instances will be assigned a different label at random. We test 9 different noise levels: $p\in\{0.05\times i,i=1,\cdots,9\}$ , which is the same as [18].

Our implementation of the proposed method (MCC) is mostly detailed in the pseudo-code shown in Algorithms 1–4. In the estimating noise rate process, we employ $k$ NN classifier because it is sensitive to noisy data when $k$ is small. But some correct data may be considered noisy data when $k$ is 1. Thus, the number of neighbors in $k$ NN is set to 3. In fact, we evaluate the effect of the parameter $k$ of $k$ NN and find the performance of the model is not sensitive to the value of $k$ . The results are shown in the Appendix (Fig. 11). The parameter $a$ is 200 in Algorithm 2, the same as [23]. In Algorithm 3, we set $\textit{conf\_in\_cluster}1=1/(1.05-\textit{diff})^{2}$ , $\textit{conf\_in\_cluster}2=1/(1.05-\textit{diff})$ and $\textit{conf\_in\_cluster}3=1$ as to give various magnitudes to the clusters when noise rate varies. Note that these functions are derived from experiments. In Algorithm 4, $k$ NN classifier is used to modify the weights of clusters, so we set the larger number of neighbors as 7. Of course, we could set other larger numbers as well.

To investigate the performance of the proposed approach, we choose 24 datasets from UCI repository. Table 1 summarizes the detailed information of these datasets, including number of examples (#EX), number of categorical variables (#CAT), number of numerical variables (#NUM), and number of classes (#Class). These datasets contain binary datasets and multi-class datasets. Note that we use them as they are, that is, we do not preprocess the features of instances in our experiments.

Table 1
Description of the 24 UCI datasets used in the experiments

NO.	Data set	#EX	#CAT	#NUM	#Class
1	Breast-w	699	0	9	2
2	Colic	369	20	7	2
3	Credit-a	690	9	6	2
4	Credit-g	1000	13	7	2
5	Diabetes	768	0	8	2
6	Heart-stat-log	270	0	13	2
7	Hepatitis	155	13	6	2
8	Ionosphere	351	0	34	2
9	Kr-vs-kp	3196	36	0	2
10	Labor	57	8	8	2
11	Mushroom	8124	22	0	2
12	Sonar	208	0	60	2
13	Autos	205	10	15	7
14	Balance	625	0	4	3
15	Cars	1728	6	0	4
16	Heart-c	303	7	6	5
17	Heart-h	294	7	6	5
18	Iris	150	0	4	3
19	Lymph	148	15	3	4
20	Segment	2310	0	19	7
21	Vehicle	846	0	18	4
22	Vowel	990	3	10	11
23	Waveform	5000	0	40	3
24	Zoo	101	16	1	7

4.2 Experimental results

In this section, we evaluate the performance of MCC and three other correction methods including simplified polishing labels (PL), self-training correction (STC) and cluster-based method (CC). The measurement of the approaches’ performance includes three aspects: label quality, model quality and AUC. Label quality is the proportion of instances whose corrected labels are the same as true labels. Model quality means the accuracy of the classifier which is derived by corrected data set and performs cross-validation. In our experiments, we use sklearn.tree.DecisionTreeClassifier (http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html), with 10 folds for cross-validation. AUC, a measure of predict performance, is the area under the curve of the Receiver Operating Characteristic (ROC) curve. Moreover, we discuss about the type of features (categorical vs. number) to evaluate the performance of MCC.

4.2.1 Label quality

Following indicates the performance of each label correction algorithm. Figure 1 presents the label quality results on labor dataset. In the picture, the horizontal axis is the noise rate, and the vertical axis represents accuracy, that is, label quality metric. The dashed line is a baseline, which represents effectiveness of these four approaches. From Fig. 1, we can see that both MCC and CC improve the label quality and MCC performs better. At each noise rate, the line marked with circles is on the top of the rest of four lines. Besides, the accuracy value of MCC are all up to 90% in the interval [0.05, 0.30]. When the noise rate in [0.40, 0.45], the value is above 80%. The detailed data is presented in Table 2. Top method accuracy values are in bold. The experimental results on labor data set show that MCC gives the best accuracy (98.2%) at the noise rate 15%, which is much better than the second best algorithm CC (91.2%). There are 14 data sets in all 24 data sets like in this situation, where MCC performs better than other three methods at each noise rate or most of the noise rate. These data sets conclude 7 binary class data sets and 7 multi-class data sets.

Table 2
Accuracy results for labor dataset

Method	Noise_rate
	5%	10%	15%	20%	25%	30%	35%	40%	45%
MCC	0.965	0.947	0.982	0.965	0.965	0.912	0.667	0.807	0.807
CC	0.930	0.930	0.912	0.947	0.860	0.824	0.649	0.684	0.719
STC	0.912	0.772	0.824	0.877	0.684	0.579	0.491	0.632	0.596
PL	0.649	0.702	0.772	0.772	0.772	0.614	0.404	0.702	0.508

Table 3

Average binary accuracy

Method	Noise_rate
	5%	10%	15%	20%	25%	30%	35%	40%	45%
MCC	0.912	0.897	0.877	0.864	0.854	0.820	0.771	0.736	0.633
CC	0.896	0.888	0.870	0.863	0.845	0.812	0.736	0.695	0.606
STC	0.903	0.877	0.862	0.847	0.802	0.765	0.746	0.688	0.599
PL	0.838	0.828	0.831	0.809	0.794	0.767	0.694	0.666	0.583

Figure 1.

Label quality on labor data set.

Figure 2.

Label quality on diabetes data set.

Figure 3.

Label quality on credit-a data set.

Figure 4.

Average label quality results on binary data sets.

Figure 5.

Average label quality results on multi-class data sets.

In some cases, these four approaches are competitive with each other in terms of label quality improvement. Figure 2 shows the results of diabetes data set in such cases, it is seen that all the four methods can improve the data set label quality when the noise rate is in the interval [0.25, 0.45] and that all these improvements are slight. Further detailed data can be found in Appendix Tables 5 and 6. There are 2 data sets like this in binary class data set and 4 such data sets in the multi-class data sets.

However, in some data sets, MCC performs not very well, such as credit-a in Fig. 3. In the binary class data sets, 3 data sets like this including credit-a data set. In the multi-class data sets, there also exists 1 data set like auto data set. And all the label quality results on multi-class are shown in the Appendix (Fig. 12).

Figure 6.

Average model quality results on binary data sets.

Figure 7.

Average model quality results on multi-class data sets.

Figure 8.

Average AUC results on binary data sets.

Figure 9.

Average AUC results on multi-class data sets.

Figure 10.

Effects of the type of features.

To summarize the performance of the four correction algorithms, the average improvement results on the 12 binary data sets and 12 multi-class data sets are shown in Figs 4 and 5, respectively. From Fig. 4, we can observe that all four approaches’ results are above the baseline when the noise rate in the binary datasets is greater than 20%, which means these correction algorithms can improve the label quality. It is also seen that MCC outperforms the rest of three approaches in the experimental label noise range except at 20%, where CC and MCC are same and both dominate STC and PL. In particular, with noise rate increasing, the advantage of MCC is gradually getting obvious. Detailed data of the average results of binary class datasets is presented in Table 2. Figure 5 displays the average results on multi-class datasets. Both MCC and CC improve label quality on average, with MCC showing superior improvements to CC when the noise rate is greater than 15%. To conclude, MCC is the best method in terms of label quality improvement on average. We also perform experiments that replace K-means by self organized map (SOM), where we choice 14 datasets in terms of label quality, because the method using SOM cost too much time. The results indicate that using K-means is better. The average result of these experiments is shown in Appendix (Fig. 13).

4.2.2 Model quality

The average model quality results of binary class and multi-class datasets are shown in Figs 6 and 7, respectively. Note that there are 6 lines in each picture. The black dashed line represents the baseline which is derived by noise data. And the cyan dashed line is drawn from clean data which does not have noise label.

From Fig. 6, we can find that all four methods can improve model quality on average in binary class datasets at all noise rates. We can also find that MCC outperforms the rest of three algorithms on the high noise rates. In Fig. 7, it is quite clear that at all noise rates, MCC greatly outperforms the best, followed by CC while PL is the third and STC the last. All the model quality results on each multi-class dataset can be found in the Appendix (Fig. 14). To summarize, MCC performs the best in terms of model quality.

4.2.3 AUC

For most binary class datasets, there are slight differences between these algorithms at low noise rates. As the noise rate increases, the difference becomes distinct (e.g. breast-w, heart-stat-log and mushroom). For most multi-class data sets, MCC and CC performs better than STC and PL at most noise rates in terms of the improvements of AUC. Besides, MCC is superior (e.g. balance-scale, vowel and lymph).

Figures 8 and 9 present the average results of AUC of binary class and multi-class datasets, respectively. In Fig. 8, for the binary class data sets, MCC, CC and PL are slightly different when the noise rate is below 20%; PL becomes poor, when the rate increases; MCC is the best when noise rate is up to 30%. In Fig. 9, MCC and CC are almost the same and STC and PL are both very competitive with each other when the noise rate is in the interval [0.05, 0.20]; when the noise rate is greater than 20%, it is obvious that MCC performs the best and CC is second best followed by PL, with STC at last. More details of multi-class datasets are in the Appendix (Fig. 15). These show that MCC outperforms the other three methods in terms of AUC on average.

4.2.4 Effects of the type of features

The type of features may affect the performance of the proposed method. We evaluate our framework with the type of features, which are numerical features and categorical features. There are 3 datasets only containing categorical features and 10 datasets only involving numerical features in the experimental datasets. The remaining datasets are both. We give the average label quality results for indicating the effects of the type of features on the performance of the algorithm. The results are shown in Fig. 10. Notice that MCC almost outperforms the others in all datasets. As for the datasets only containing categorical features, STC performs the best, MCC comes in a close second, followed by CC and PL is the worst when the noise rate is less than 0.25. If the noise rate is greater than 0.25, MCC is the best, STC is the second, the rest two methods perform remain the same. As for the datasets only including numerical features, MCC is the best. In the mixed datasets, CC performs better than MCC when the noise rate is less than 0.25, but when the noise rate is greater than 0.25, MCC performs best. Thus, we can conclude that MCC significantly improves the performance, especially when the noise rate is high, which is the dataset setting appropriate for MCC.

5. Conclusion

In this paper, we introduced a label noise correction method, called MCC, for improving data quality in the presence of label noise. It employs a procedure to estimate the level of the noise in the data and combines this noise estimation with a correction process. The experimental studies show that MCC can effectively improve label quality. In particular, MCC appears to be better than existing correction methods. In addition, the improvement of label quality has contributed to the classification in terms of accuracy and AUC.

Our proposed method concerns the estimation of the noise rate. In future work, we will consider how to estimate the noise rate for the noisy data accurately.

Footnotes

Acknowledgments

The authors wish to thank the Editor, an Associate Editor, and two anonymous referees for their valuable comments which have helped improve the presentation and quality of this paper a great deal. This research was supported by the Natural Science Basic Research Plan in Shaanxi Province (Program NO. 2017JQ1013), the Fundamental Research Funds for the Central Universities (Grant JB180705), and the National Natural Science Foundation of China (Grant 61573266).

Appendix

Tables 4 and 5 present the accuracy results for two datasets: diabetes and credit-a. The average accuracy results on the twelve multi-class datasets are summarized in Table 6. Tables 7–10 illustrate the average model quality results and the average AUC results for binary class datasets and multi-class datasets. Figure 11 shows the effect of the parameter $k$ in $k$ NN for autos, segment, vowel and zoo. Figure 13 illustrates average accuracy results on 14 datasets using SOM. Figures 12, 14 and 15 display the results for multi-class on accuracy, AUC and model quality, respectively.

Table 4

Accuracy results for diabetes dataset

Method	Noise_rate
	5%	10%	15%	20%	25%	30%	35%	40%	45%
MCC	0.850	0.840	0.790	0.780	0.770	0.750	0.720	0.660	0.600
CC	0.831	0.802	0.802	0.788	0.767	0.738	0.677	0.616	0.552
STC	0.836	0.819	0.810	0.770	0.736	0.721	0.676	0.615	0.605
PL	0.803	0.797	0.781	0.763	0.770	0.745	0.701	0.688	0.660

Table 5

Accuracy results for credit-a dataset

Method	Noise_rate
	5%	10%	15%	20%	25%	30%	35%	40%	45%
MCC	0.851	0.851	0.820	0.770	0.759	0.759	0.72	0.620	0.551
CC	0.883	0.896	0.878	0.854	0.836	0.803	0.758	0.696	0.617
STC	0.899	0.881	0.868	0.846	0.822	0.787	0.799	0.772	0.597
PL	0.913	0.820	0.912	0.857	0.829	0.723	0.717	0.649	0.535

Table 6

Average multi-class accuracy

Method	Noise_rate
	5%	10%	15%	20%	25%	30%	35%	40%	45%
MCC	0.864	0.849	0.838	0.838	0.819	0.821	0.800	0.784	0.747
CC	0.855	0.847	0.842	0.833	0.810	0.796	0.780	0.753	0.697
STC	0.885	0.838	0.802	0.758	0.705	0.653	0.623	0.590	0.524
PL	0.750	0.758	0.736	0.724	0.703	0.701	0.690	0.653	0.611

Table 7

Average binary model quality

Method	Noise_rate
	5%	10%	15%	20%	25%	30%	35%	40%	45%
True	0.812	0.801	0.808	0.810	0.807	0.798	0.799	0.802	0.795
MCC	0.810	0.814	0.801	0.807	0.800	0.772	0.716	0.685	0.595
CC	0.818	0.830	0.798	0.806	0.781	0.745	0.675	0.635	0.557
STC	0.839	0.811	0.787	0.793	0.740	0.697	0.687	0.649	0.564
PL	0.786	0.789	0.783	0.772	0.731	0.704	0.657	0.647	0.539
Noise	0.758	0.717	0.689	0.649	0.612	0.581	0.550	0.534	0.481

Table 8

Average multi-class model quality

Method	Noise_rate
	5%	10%	15%	20%	25%	30%	35%	40%	45%
True	0.762	0.765	0.760	0.757	0.757	0.761	0.761	0.765	0.767
MCC	0.746	0.746	0.736	0.731	0.720	0.718	0.704	0.674	0.620
CC	0.730	0.723	0.711	0.696	0.682	0.669	0.640	0.591	0.535
STC	0.676	0.632	0.594	0.548	0.504	0.449	0.420	0.368	0.336
PL	0.676	0.682	0.670	0.641	0.629	0.617	0.608	0.557	0.518
Noise	0.672	0.613	0.561	0.501	0.451	0.408	0.390	0.316	0.292

Table 9

Average binary AUC

Method	Noise_rate
	5%	10%	15%	20%	25%	30%	35%	40%	45%
True	0.835	0.822	0.828	0.832	0.827	0.818	0.821	0.824	0.819
MCC	0.829	0.834	0.817	0.821	0.819	0.791	0.738	0.710	0.631
CC	0.837	0.845	0.818	0.826	0.803	0.771	0.707	0.664	0.592
STC	0.854	0.829	0.808	0.814	0.765	0.725	0.717	0.681	0.599
PL	0.796	0.799	0.798	0.786	0.754	0.732	0.679	0.670	0.580
Noise	0.784	0.753	0.722	0.693	0.656	0.634	0.599	0.586	0.537

Table 10

Average multi-class AUC

Method	Noise_rate
	5%	10%	15%	20%	25%	30%	35%	40%	45%
True	0.885	0.887	0.883	0.884	0.885	0.886	0.886	0.885	0.888
MCC	0.865	0.865	0.860	0.855	0.850	0.853	0.845	0.827	0.803
CC	0.862	0.860	0.853	0.846	0.840	0.833	0.822	0.800	0.766
STC	0.842	0.821	0.806	0.786	0.758	0.729	0.718	0.690	0.664
PL	0.812	0.819	0.811	0.792	0.788	0.783	0.783	0.754	0.735
Noise	0.844	0.818	0.794	0.765	0.738	0.718	0.706	0.666	0.646

Figure 11.

Effect of the parameter k for autos, segment, vowel and zoo.

Figure 12.

Accuracy results on multi-class data sets.

Figure 13.

Average accuracy results on 14 datasets using SOM.

Figure 14.

AUC results on multi-class data sets.

Figure 15.

Model quality results on multi-class data sets.

References

Abellán

Mantas

C.J.

and Castellano

J.G.

, Adaptative CC4.5: Credal C4.5 with a rough class noise estimator, Expert Systems with Applications 92 (2017).

Aha

D.W.

Kibler

and Albert

M.K.

, Instance-based learning algorithms, Machine Learning 6(1) (1991), 37–66.

Bouveyron

and Girard

, Robust supervised classification with mixture models: Learning from data with uncertain labels, Pattern Recognition 42(11) (2009), 2649–2658.

Delany

S.J.

Segata

and Namee

B.M.

, Profiling instances in noise reduction, Knowledge-Based Systems 31(31) (2012), 28–40.

Devijver

P.A.

and Kittler

, On the edited nearest neighbor rule, in: Proceedings of the Fifth International Conference on Pattern Recognition, 1980, pp. 72–80.

Everitt

B.S.

Landau

Leese

and Stahl

, Miscellaneous Clustering Methods, John Wiley and Sons, Ltd, 2011, pp. 215–255.

Fefilatyev

Shreve

Kramer

and Hall

, Label-noise reduction with support vector machines, in: International Conference on Pattern Recognition, 2012, pp. 3504–3508.

Fix

and Hodges

J.L.

, Discriminatory analysis. Nonparametric discrimination: Consistency properties, International Statistical Review 57(3) (1989), 238–247.

Frénay

and Verleysen

, Classification in the presence of label noise: A survey, IEEE Transactions on Neural Networks and Learning Systems 25(5) (2014), 845–869.

10.

Gamberger

Lavrac

and Groselj

, Experiments With Noise Filtering in a Medical Domain, in: Proceedings of the Sixteenth International Conference on Machine Learning, 1999, pp. 143–151.

11.

García

Luengo

and Herrera

, Dealing with noisy data, Intelligent Systems Reference Library 72 (2015), 107–145.

12.

Guan

Yuan

Lee

Y.K.

and Lee

, Identifying mislabeled training data with the aid of unlabeled data, Applied Intelligence 35(3) (2011), 345–358.

13.

Hartigan

J.A.

and Wong

M.A.

, Algorithm as 136: A K-means clustering algorithm, Journal of the Royal Statistical Society 28(1) (1979), 100–108.

14.

Kononenko

and Kukar

, Machine Learning and Data Mining: Introduction to Principles and Algorithms, Horwood Publishing Limited, 2007.

15.

Sheng

V.S.

Jiang

and Li

, Noise filtering to improve data and model quality for crowdsourcing, Knowledge-Based Systems 107(C) (2016), 96–103.

16.

Liu

and Tao

, Classification with noisy labels by importance reweighting, IEEE Transactions on Pattern Analysis & Machine Intelligence 38(3) (2016), 447–461.

17.

Natarajan

Dhillon

I.S.

Ravikumar

and Tewari

, Learning with noisy labels, Advances in Neural Information Processing Systems 26 (2013), 1196–1204.

18.

Nicholson

Sheng

V.S.

and Zhang

, Label noise correction and application in crowdsourcing, Expert Systems with Applications 66 (2016), 149–162.

19.

Northcutt

C.G.

and Chuang

I.L.

, Learning with Confident Examples: Rank Pruning for Robust Classification with Noisy Labels, arXiv preprint arXiv:1705.01936, 2017.

20.

Olvera-López

J.A.

Carrasco-Ochoa

J.A.

Martínez-Trinidad

J.F.

and Kittler

, A review of instance selection methods, Artificial Intelligence Review 34(2) (2010), 133–143.

21.

Quinlan

J.R.

, C4.5: programs for machine learning, San Francisco, CA, USA: Morgan Kaufmann Publishers Inc, 1993.

22.

Song

Wang

Zhang

Sun

and Yang

, Spectral label refinement for noisy and missing text labels, in: Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015, pp. 2972–2978.

23.

Teng

C.M.

, Correcting Noisy Data, in: Sixteenth International Conference on Machine Learning, 1999, pp. 239–248.

24.

Verbaeten

and Assche

A.V.

, Ensemble Methods for Noise Elimination in Classification Problems, in: Multiple Classifier Systems, International Workshop, Mcs 2003, Guilford, Uk, June 11–13, 2003, Proceedings, 2003, pp. 317–325.

25.

Ianakiev

K.G.

and Govindaraju

, Improvements in K-Nearest Neighbor Classification, in: Advances in Pattern Recognition – ICAPR 2001, Second International Conference Rio de Janeiro, Brazil, March 11–14, 2001, Proceedings, 2001, pp. 222–229.

26.

Yuan

Guan

and Khattak

A.M.

, Classification with class noises through probabilistic sampling, Information Fusion 41 (2018), 57–67.

27.

Zhang

and Wu

, Integrating induction and deduction for noisy data mining, Information Sciences 180(14) (2010), 2663–2673.

28.

Zhang

Sheng

V.S.

and Wu

, Improving crowdsourced label quality using noise correction, IEEE Transactions on Neural Networks and Learning Systems PP(99) (2017), 1–14.

Improving data quality with label noise correction

Abstract

Keywords

1. Introduction

2. Related work

3. The new label noise correction method

3.1 Estimating noise rate of data

3.2 Correction algorithm

4. Experiments and results

4.1 Experiment setup

Table 1 Description of the 24 UCI datasets used in the experiments

4.2.1 Label quality

Table 2 Accuracy results for labor dataset

4.2.3 AUC

4.2.4 Effects of the type of features

5. Conclusion

Footnotes

Acknowledgments

Appendix

References

Table 1
Description of the 24 UCI datasets used in the experiments

Table 2
Accuracy results for labor dataset