Abstract
Real-world machine learning applications seldom provide perfect labeled data, posing a challenge in developing models robust to noisy labels. Recent methods prioritize noise filtering based on the discrepancies between model predictions and the provided noisy labels, assuming samples with minimal classification losses to be clean. In this work, we capitalize on the consistency between the learned model and the complete noisy dataset, employing the data’s rich representational and topological information. We introduce LaplaceConfidence, a method that to obtain label confidence (i.e., clean probabilities) utilizing the Laplacian energy. Specifically, it first constructs graphs based on the feature representations of all noisy samples and minimizes the Laplacian energy to produce a low-energy graph. Clean labels should fit well into the low-energy graph while noisy ones should not, allowing our method to determine data’s clean probabilities. Furthermore, LaplaceConfidence is embedded into a holistic method for robust training, where co-training technique generates unbiased label confidence and label refurbishment technique better utilizes it. We also explore the dimensionality reduction technique to accommodate our method on large-scale noisy datasets. Our experiments demonstrate that LaplaceConfidence outperforms state-of-the-art methods on benchmark datasets under both synthetic and real-world noise. Code available at https://github.com/chenmc1996/LaplaceConfidence.
Introduction
Deep learning’s success hinges on the availability of high-quality labeled datasets. Nonetheless, the labeling process often encounters numerous challenges, including prohibitive labor costs, sizable data volumes, and domain-specific knowledge prerequisites [1, 2, 3]. As a result, fully intact labels are typically unavailable in real-world applications. What’s more, recent studies have found that label noise can seriously damage the performance of deep models [4, 5]. Therefore, it is necessary to improve model robustness against noisy labels.
The field of Learning with Noisy Labels (LNL) encompasses a broad spectrum of algorithms. Recently, a series of methods [6, 7, 8] has significantly improved robustness by leveraging the memorization effect, the behavior that deep models fit generalizable patterns before memorizing the noisy patterns [5]. Essentially, when training a model on a dataset containing both clean and mislabeled samples, the model tends to prioritize fitting the clean samples first. This can be attributed to the fact that the shared patterns among the clean samples are relatively easier to learn, while the unique mapping relationships within mislabeled samples pose a greater challenge. Consequently, the model exhibits smaller losses on the clean samples. Therefore, the clean probability of given labels, i.e., label confidence, can be estimated according to per-sample loss, as shown in the left of Figure 1.
For instance, DivideMix [8] employs a Gaussian Mixture Model (GMM) to dynamically establish the loss threshold

Conventional individual data point estimation vs. our graph-based estimation. Our graph-based method takes all samples and their topological relationship into consideration.
Our work is based on two primary observations. First, prior studies indicate that overfitting to noisy labels is less likely to occur in hidden representations [13, 14]. Second, the relationship of the learned representations can provide a more accurate estimation of label confidence compared to individual samples. To validate these observations, we analyze the learned feature space to confirm that samples frequently have same-class neighbors, even when trained on noisy datasets (refer to Section 4.2.1 for our analysis). Based on this perspective, it is feasible to estimate the label confidence of a sample by considering the number of different-class neighbors it has [15, 13, 16]. However, mislabeled data points can bias the predictions of their neighbors. To address this issue, our objective is to update labels for globally optimal estimation, ensuring that label consistency for all data points reaches a stable state. Accordingly, we propose leveraging the graph structure and the concept of Laplacian energy from graph theory [17, 18, 19] to achieve two goals: utilizing the representation space and obtaining unbiased label confidence. We construct a graph using the extracted representations of all samples, capturing the interaction between model predictions and the geometric structure. To mitigate the bias introduced by mislabeled samples, we globally optimize the original labels to minimize the Laplacian energy on the graph, resulting in a low-energy connection structure. Within this graph, clean labels integrate seamlessly, while noisy labels do not. Therefore, our method can determine an unbiased “clean probability” for each data point. To segment the learning process, we utilize the derived label confidence and divide it into two components: unbiased label confidence generation and label refurbishment. The former employs a co-training technique to generate unbiased label confidence, while the latter rejuvenates the labels based on the generated confidence, optimizing their utilization.
In conclusion, we propose a novel label confidence estimation method, named LaplaceConfidence, fully utilizing the learned representations and their topological relationship. We embed it into a holistic method by combining it with other techniques, including co-training, label refurbishment [20, 21], and data augmentation. Given that the real-world applications of LNL often involve large-size models or datasets [22], we also investigate the role of the dimensionality reduction technique in the scalability of our method. The main contributions of this work are as follows:
Novel confidence estimation method: We introduce LaplaceConfidence, a feature-based confidence estimation method. It optimizes confidence estimation by leveraging the topological information of samples. Additionally, it integrates with other techniques, providing a comprehensive solution for learning with noisy labels. Scalability and robustness: Addressing real-world needs for large-scale networks and noisy datasets, we explore the impact of dimensionality reduction on our method. Our findings show that this approach not only significantly accelerates the process but also maintains performance. In certain scenarios, it even enhances robustness. Performance Benchmarking: We prove the superiority of LaplaceConfidence over existing classification-loss-based estimation methods. Our method sets new standards in LNL benchmarks, including CIFAR-10, CIFAR-100 with synthetic label noise, and the real-world noisy dataset, Mini-WebVision. Furthermore, we conduct a systematic study of LaplaceConfidence’s components to validate their effectiveness.
This section commences with an introduction of label noise taxonomy. Subsequently, we classify recent Learning with Noisy Labels (LNL) algorithms into two primary categories: classification-loss-based and feature-based. This classification serves as a foundation for the introduction of our proposed LaplaceConfidence. Additionally, we briefly discuss relevant works in the field of semi-supervised learning.
Taxonomy of label noise
We formally define
However, the instance-independent noise model can often be unrealistic. For instance, in certain real-world datasets, images that are difficult to recognize are more susceptible to mislabeling. Consequently, the probability of corruption should be influenced by the data features themselves, suggesting the presence of instance-dependent label noise [27, 28]. The explicit modeling of such noise presents considerable challenges. A selection of recent methods [8, 29, 11] have made headway in this area by leveraging confidence-based sample selection and semi-supervised learning, thereby achieving notable improvements.
In the approach adopted by Mentornet [6], a pre-trained network is utilized to select samples with minimal losses for the training of another model. In contrast, Co-teaching [7] employs two equivalent models, with each selecting low-loss samples for the other. Co-teaching+ [30] builds upon the Co-teaching strategy by further filtering agreed predictions to prevent the two base models from converging prematurely towards a consensus. ITLM [31] seeks to optimize a trimmed loss by strategically choosing a fraction of samples and updating the model on it. INCV [32] selects clean samples from the noisy ones at every round via a cross-validation process. Then it deploys the co-teaching training schema. SELFIE [15] refurbishes small-loss samples with the most frequently predicted labels in previous training epochs. [21] fits the distribution of per-sample loss on a Beta Mixture Model (BMM). Then the produced probability is used as the coefficient of the bootstrapping loss [20]. [33] designs a surrogate loss of the robust 0–1 loss and uses it for clean sample selection. [34] constrains the network output to permutations over a fixed vector and utilizes a sparse regularization strategy, including network output sharpening and norm regularization, to approximate the one-hot constraint and improve performance in the presence of noisy labels and class imbalance. DivideMix [8] segregates the data into a clean set and a noisy set based on loss values, leveraging a Gaussian Mixture Model (GMM) to dynamically ascertain the loss threshold. However, due to the overlapping nature of loss distributions, even some clean samples may exhibit high losses. Additionally, the threshold can be highly sensitive to alterations in the loss distribution, which can result in instability during the learning process. Different from these, our method investigates the LNL problem from a latent representational and topological perspective.
Feature-based LNL
There are several recent LNL methods [15, 13, 16] based on latent feature representation. Dimensionality-Driven Learning (D2L) [35] adopts a label refurbishment framework and backpropagates the loss for a linear combination of predictions and noisy labels. The method chooses an optimal weight, i.e., label confidence, for the combination such that the increase of local intrinsic dimensionality [36] is prevented. TopoFilter [10] filters noise according to the topological relationship between samples. It constructs a
Semi-supervised learning
Semi-supervised learning aims to leverage both labeled data and unlabeled data. Recent LNL methods attempt to convert the LNL problem into a semi-supervised learning problem by removing some possible noisy labels and utilizing powerful semi-supervised learning techniques. For instance, DivideMix’s success can be largely attributed to the deployment of MixMatch. Our idea of constructing a graph using data points’ representations is inspired by the semi-supervised learning method LaplaceNet [19], which assigns pseudo-labels to unlabeled data using the label propagation algorithm. One difference between LaplaceNet and our method is that the former employs propagated labels for training, while we utilize the resulting graph for label confidence estimation. One may wonder why LaplaceConfidence does not directly use the refined labels as training targets. We find that, unlike in semi-supervised learning, this would yield suboptimal results in LNL. We note that obtaining pseudo-labels in each iteration (instead of per epoch) using the latest model leads to less noisy training targets, which is essential to an LNL method. Therefore, LaplaceConfidence estimates label confidence in every epoch and generates new training targets in every iteration.
Method
Problem formulation
Different from the standard supervised learning, only a noisy training dataset
LaplaceConfidence
We introduced graph structures to leverage the geometric information present in the data, allowing us to capture the interaction between model predictions and the data’s geometric structure. Specifically, an undirected
A graph formed from a clean dataset should have low graph Laplacian energy because samples are likely to agree with their neighbors’ labels. Our obtained graph is not the case due to the inconsistency between the learned features and the noisy labels. Therefore, we can identify noisy nodes that cause such inconsistency according to which nodes should be changed for a low-energy graph structure. The graph Laplacian energy over the label distribution is minimized to obtain the clean graph structure.
Having the refined label distribution
Though we mainly focus on label confidence estimation in this work, some other techniques are also unified to form a holistic pipeline for LNL. The whole training schema is described in Algorithm 1.
Specifically, the label refurbishment framework trains the model with the refurbished label
We use mini-batch stochastic gradient descent algorithm for optimization. The loss of a sample
Experimental details
Benchmark datasets
We benchmark the proposed method on experimental settings using CIFAR-10, CIFAR-100 [44] with different levels of synthetic noises, as well as the real-world dataset Mini-WebVision [22]. On CIFAR-10 and CIFAR-100, There are two commonly used types of synthetic noise [45, 8]: symmetric noise and asymmetric noise. Symmetric noise corrupts samples to random classes with the same probability, while asymmetric noise corrupts samples to specific classes according to a pre-defined label transition matrix (shown in Figure 2(a)). The noise rate ranges from 20% to 90% (note that samples are randomly corrupted to

Confusion matrices on CIFAR-10 under 20%–90% symmetric noise and 40% asymmetric noise. (a) The corrupted training set. (b) The prediction on the training set. (c) The prediction on the test set. The airp. and auto. are airplane and automobile for short.
Comparison with state-of-the-art methods on CIFAR-10 and CIFAR-100 with synthetic noise. Sym. and Asym. are symmetric and asymmetric for short, respectively. The best results are indicated in bold. We include the results of AugDesc* with two different augmentation policies.
The backbone for CIFAR is the 18-layer PreAct Resnet [46]. The backbone for Mini-WebVision is the Inception-ResNet v2 [47].
Training schema
For all experiments, we mainly tune two hyper-parameters, namely the temperature
We add a regularization term for encouraging the network output uniform distribution following many LNL methods [48, 21, 8]:
List of operations for strong transformations of the modified RandAugment. Three transformations are randomly chosen and performed with stochastic magnitude.
List of operations for strong transformations of the modified RandAugment. Three transformations are randomly chosen and performed with stochastic magnitude.
For the data augmentation
Comparison to SOTA
CIFAR-10, CIFAR-100
For comparison on CIFAR-10 and CIFAR-100, results of Bootstrap [20], F-correction [26], P-correction [51], M-correction [21] Mixup [52], Co-teaching+ [30], Meta-learning [53], DivideMix [8], AugDesc [29], Robust LR [54] are reported. Following [8, 29], we report both the
The proposed method outperforms the previous best method by up to 2.8% on CIFAR-10 under symmetric noise and by up to 14% on CIFAR-100 under symmetric noise, as shown in Table 1. Only AugDesc-AutoAugment achieves competitive results on CIFAR-100 under 50% noise, with AutoAugment that has a higher computation cost as shown in [43]. The performance gain is bigger under heavy noise. We remark that suboptimal confidence estimation would misguide the training, and the model, in turn, overfits the wrong labels and adversely affects the subsequent confidence estimation. Thus, a good label confidence estimation method could bring huge improvements under heavy noise.
In terms of asymmetric noise, the proposed method also achieves better results, surpassing the previous best method by over 1.8%. After training, the model is less biased towards the given class-dependent noise on the training and test set, as shown in Figure 2(b) and (c).
Comparison with state-of-the-art methods on Mini-WebVision with real-world noise. The best results are indicated in bold.
Comparison with state-of-the-art methods on Mini-WebVision with real-world noise. The best results are indicated in bold.

Feature representations visualized by t-SNE. Color represents the ground-truth class. (a). The clean feature representation (trained on clean CIFAR-10). (b). The noisy feature representation (trained on 50% uniformly corrupted CIFAR-10 without robust training techniques). (c). The noisy feature representation our method learned. The standard softmax-based linear classifier could easily fail on the twisted feature in (b), yielding bad label confidence estimation. Best viewed in color.
In Figure 3(a) and (b), we visualize the learned features of clean and noisy datasets. It can be seen that clean features have a clear cluster structure, where samples in the same class are close. However, noisy features are more twisted, as in Figure 3(b), where some samples locate far away from their cluster center. The softmax-based linear classifier on top of the penultimate layer features would be unable to separate such twisted features and, subsequently, the cross-entropy loss is unable to identify label noise [14, 56]. We suggest that this is one of the reasons why the small-loss criterion fails.
From Figure 3 (b), we can also notice that most samples, even though they do not form cluster structure, still have same-class neighbors. Therefore, we attempt to utilize the topological relationships between samples for label confidence estimation. The possibility of a sample/node being mislabeled can be determined by all data points’ distribution in the feature space. The graph structure is commonly used to model such relationships, on which all nodes can propagate their labels to their neighbors. Considering that the propagation could be an iterative process, where corrected nodes can affect their neighbors again until reaching convergence, we solve the label confidence estimation as the classic graph Laplacian minimization problem [17, 18, 19], i.e., making the intrinsic connection structure sufficiently smooth. After this process, the more the label changes, the more possible it is that the original labels are wrong.
Learning with real-world label noise is more challenging. We evaluate LaplaceConfidence on Mini-WebVision to verify that it performs well on a larger dataset with more complex noise. For comparison, we choose classification-loss-based methods, namely MentorNet [6], Co-teaching [7], Iterative-CV [32], DivideMix [8], and feature-based methods, namely D2L [35] and NGC [11]. It is worth noting that NGC is also a graph-based LNL method which utilizes the largest connected components for noise identification. Comparison with it can verify that our method based on Laplacian energy minimization is better than other graph-based methods.
LaplaceConfidence outperforms all other classification-loss-based and feature-based LNL methods by achieving a top-1 and top-5 accuracy of 80.52% and 94.56%, respectively. It is 1.35% and 2.71% better than the previous best NGC, respectively.
Accelerating LaplaceConfidence
The computation time difference between LaplaceConfidence and conventional loss-based LNL methods is due to the label confidence estimation process. Given that the field of LNL often deal with large datasets or models, reducing the computation cost further would be beneficial for the potential real-world usage of our method. Since our method takes a bilevel optimization form, and we do not need to back-propagate the gradient in the label confidence process, the features can be freely manipulated. Therefore, we consider a simple strategy that is known to be capable of removing irrelevant or redundant features, i.e., reducing the dimension of the extracted feature embeddings using Principal Component Analysis (PCA) for the calculation in Eq. (3.2).
Time cost in seconds and accuracy. “per conv.” stands for the running time of per convergence of the estimation method.
Time cost in seconds and accuracy. “per conv.” stands for the running time of per convergence of the estimation method.
To further accelerate the process of
We run our experiments and calculate the computational cost on 24 cores Intel(R) Xeon(R) Platinum 8255C CPU and a single NVIDIA RTX V100 GPU to get the real time cost. As shown in Table 4, by simply adding the PCA technique, LaplaceConfidence is quicker with almost no accuracy loss on CIFAR-10. Noteworthy, the PCA even improves the accuracy on Mini-WebVision by 0.8%. We remark that it is because the dimension reduction, keeping the important features in the learned representation, could reduce the damage of ambiguous/wrong features learned from the noisy training signals.
Ablation study. Results on CIFAR-10 with different levels of symmetric noise are reported.

(a) The quality of confidence estimation of LaplaceConfidence and GMM on CIFAR-10 under 90% noise. (b) LaplaceConfidence with other augmentations on CIFAR-10 under 90% noise. (c) Varying the
To study the importance of the main components in LaplaceConfidence, we test each of them separately and report the performance:
To study the effect of LaplaceConfidence, we replace it with GMM. To study the effect of the co-training, we only use one model. To study the effect of the augmentation, we replace RandAugment with other augmentation methods.
We also study the influence of two important hyper-parameters in Figure 5, namely the number of nearest neighbors
LaplaceConfidence plays a key role in our method. For comparison, we choose Gaussian Mixture Model (GMM), the confidence estimation method used in the previous SOTA. A mixture distribution of two Gaussian distributions is fitted on the loss value using the Expectation-Maximization algorithm. Regarding its hyper-parameters setting and implementation, we follow the official implementation of DivideMix (https://github.com/LiJunnan1992/DivideMix). As shown in Table 5, the accuracy of LaplaceConfidence is better than GMM. We further compare their confidence estimation quality during training in Figure 4(a), which confirms LaplaceConfidence’s superiority against GMM (we use F1 score because the clean-noisy binary classification is imbalanced). What is more, we find that the estimation of GMM is unstable under heavy noise. It is because the loss of clean and noisy samples seriously overlaps, and the GMM fails to converge under heavy noise. Note that DivideMix models the averaged loss over the last 5 epochs to improve convergence stability. We also add it for the experiment of GMM. Otherwise, the model would collapse. The proposed LaplaceConfidence, on the other hand, produces better estimation and does not need extra tricks to stabilize training.
Co-training
Co-training is employed in our approach to address the error accumulation problem associated with the self-training process and to aggregate predictions from two models. Our experiments consistently demonstrate that co-training outperforms a single model, highlighting its efficacy in enhancing performance. However, it is important to note that co-training introduces additional computational costs, more than doubling the computational requirements. This increase in cost must be considered along with the potential benefits. In contrast, our label confidence estimation method incurs minimal extra cost compared to the overall training process.
Data augmentations
Augmentation has been found useful in many tasks, such as semi-supervised learning [50], and unsupervised learning [57, 58]. We report LaplaceConfidence with other augmentations methods in Figure 4(b). One may argue that data augmentation has such a big influence that it could cause unfair comparisons. We show that LaplaceConfidence outperforms other methods that use the same or even stronger augmentation as in Table 1. It also does not diminish the contributions of other components because they further make improvements upon data augmentation, as in Table 5. We find that LaplaceConfidence with Cutmix
Conclusion
This paper studies the key problem in LNL: label confidence estimation. We propose LaplaceConfidence, a new graph-based method that utilizes the rich topological information in the feature space. It is superior to previous feature-based methods by correcting the bias in the label confidence estimation caused by mislabeled data points. We demonstrate that our approach beats other methods through systematical experiments, significantly advancing the state-of-the-art. We also find that reducing the dimension of learned features before calculating the feature similarities permits smaller computations without damaging generalization. Furthermore, we conduct ablation experiments to study the effects of our components.
There are many possible avenues for future research into label confidence estimation, including the exploration of other forms of noise present in datasets, such as distribution shifted data [59] or out-of-distribution data [60]. With LaplaceConfidence as a starting point, we believe that techniques from graph-based approaches may be adapted to solve these more challenging problems. For instance, one could adjust the contribution of different examples by introducing different weights for nodes in the graph. Due to its flexibility and scalability, we anticipate that our method can be applied to a range of real-world applications.
Footnotes
Acknowledgments
This paper is supported by the National Natural Science Foundation of China (Grant No. 62192783, 62376117), the Collaborative Innovation Center of Novel Software Technology and Industrialization at Nanjing University.
