LaplaceConfidence: A graph-based approach for learning with noisy labels

Abstract

Real-world machine learning applications seldom provide perfect labeled data, posing a challenge in developing models robust to noisy labels. Recent methods prioritize noise filtering based on the discrepancies between model predictions and the provided noisy labels, assuming samples with minimal classification losses to be clean. In this work, we capitalize on the consistency between the learned model and the complete noisy dataset, employing the data’s rich representational and topological information. We introduce LaplaceConfidence, a method that to obtain label confidence (i.e., clean probabilities) utilizing the Laplacian energy. Specifically, it first constructs graphs based on the feature representations of all noisy samples and minimizes the Laplacian energy to produce a low-energy graph. Clean labels should fit well into the low-energy graph while noisy ones should not, allowing our method to determine data’s clean probabilities. Furthermore, LaplaceConfidence is embedded into a holistic method for robust training, where co-training technique generates unbiased label confidence and label refurbishment technique better utilizes it. We also explore the dimensionality reduction technique to accommodate our method on large-scale noisy datasets. Our experiments demonstrate that LaplaceConfidence outperforms state-of-the-art methods on benchmark datasets under both synthetic and real-world noise. Code available at https://github.com/chenmc1996/LaplaceConfidence.

Keywords

Learning with noisy labels graph energy label refurbishment

1. Introduction

Deep learning’s success hinges on the availability of high-quality labeled datasets. Nonetheless, the labeling process often encounters numerous challenges, including prohibitive labor costs, sizable data volumes, and domain-specific knowledge prerequisites [1, 2, 3]. As a result, fully intact labels are typically unavailable in real-world applications. What’s more, recent studies have found that label noise can seriously damage the performance of deep models [4, 5]. Therefore, it is necessary to improve model robustness against noisy labels.

The field of Learning with Noisy Labels (LNL) encompasses a broad spectrum of algorithms. Recently, a series of methods [6, 7, 8] has significantly improved robustness by leveraging the memorization effect, the behavior that deep models fit generalizable patterns before memorizing the noisy patterns [5]. Essentially, when training a model on a dataset containing both clean and mislabeled samples, the model tends to prioritize fitting the clean samples first. This can be attributed to the fact that the shared patterns among the clean samples are relatively easier to learn, while the unique mapping relationships within mislabeled samples pose a greater challenge. Consequently, the model exhibits smaller losses on the clean samples. Therefore, the clean probability of given labels, i.e., label confidence, can be estimated according to per-sample loss, as shown in the left of Figure 1.

For instance, DivideMix [8] employs a Gaussian Mixture Model (GMM) to dynamically establish the loss threshold $τ$ for sample selection. Subsequently, it trains the model using supervised signals from clean labels and self-supervised signals from noisy examples. Despite these advancements, recent studies have criticized loss-based criteria [9, 10, 11]. Notably, the loss distributions of correctly labeled and mislabeled samples invariably overlap, a problem exacerbated when the noise rate is high or samples are challenging to learn [9, 12].

Figure 1.

Conventional individual data point estimation vs. our graph-based estimation. Our graph-based method takes all samples and their topological relationship into consideration.

Our work is based on two primary observations. First, prior studies indicate that overfitting to noisy labels is less likely to occur in hidden representations [13, 14]. Second, the relationship of the learned representations can provide a more accurate estimation of label confidence compared to individual samples. To validate these observations, we analyze the learned feature space to confirm that samples frequently have same-class neighbors, even when trained on noisy datasets (refer to Section 4.2.1 for our analysis). Based on this perspective, it is feasible to estimate the label confidence of a sample by considering the number of different-class neighbors it has [15, 13, 16]. However, mislabeled data points can bias the predictions of their neighbors. To address this issue, our objective is to update labels for globally optimal estimation, ensuring that label consistency for all data points reaches a stable state. Accordingly, we propose leveraging the graph structure and the concept of Laplacian energy from graph theory [17, 18, 19] to achieve two goals: utilizing the representation space and obtaining unbiased label confidence. We construct a graph using the extracted representations of all samples, capturing the interaction between model predictions and the geometric structure. To mitigate the bias introduced by mislabeled samples, we globally optimize the original labels to minimize the Laplacian energy on the graph, resulting in a low-energy connection structure. Within this graph, clean labels integrate seamlessly, while noisy labels do not. Therefore, our method can determine an unbiased “clean probability” for each data point. To segment the learning process, we utilize the derived label confidence and divide it into two components: unbiased label confidence generation and label refurbishment. The former employs a co-training technique to generate unbiased label confidence, while the latter rejuvenates the labels based on the generated confidence, optimizing their utilization.

In conclusion, we propose a novel label confidence estimation method, named LaplaceConfidence, fully utilizing the learned representations and their topological relationship. We embed it into a holistic method by combining it with other techniques, including co-training, label refurbishment [20, 21], and data augmentation. Given that the real-world applications of LNL often involve large-size models or datasets [22], we also investigate the role of the dimensionality reduction technique in the scalability of our method. The main contributions of this work are as follows: −

Novel confidence estimation method: We introduce LaplaceConfidence, a feature-based confidence estimation method. It optimizes confidence estimation by leveraging the topological information of samples. Additionally, it integrates with other techniques, providing a comprehensive solution for learning with noisy labels.

−

Scalability and robustness: Addressing real-world needs for large-scale networks and noisy datasets, we explore the impact of dimensionality reduction on our method. Our findings show that this approach not only significantly accelerates the process but also maintains performance. In certain scenarios, it even enhances robustness.

−

Performance Benchmarking: We prove the superiority of LaplaceConfidence over existing classification-loss-based estimation methods. Our method sets new standards in LNL benchmarks, including CIFAR-10, CIFAR-100 with synthetic label noise, and the real-world noisy dataset, Mini-WebVision. Furthermore, we conduct a systematic study of LaplaceConfidence’s components to validate their effectiveness.

2. Related work

This section commences with an introduction of label noise taxonomy. Subsequently, we classify recent Learning with Noisy Labels (LNL) algorithms into two primary categories: classification-loss-based and feature-based. This classification serves as a foundation for the introduction of our proposed LaplaceConfidence. Additionally, we briefly discuss relevant works in the field of semi-supervised learning.

2.1 Taxonomy of label noise

We formally define $\tilde{y}$ and $y$ as the noisy label and the corresponding true label, respectively. The distribution of noisy labels is affected by the dependency between data features and class labels: $p (\tilde{y} = j ∣ x, y)$ . Based on this, some [23, 24, 25, 26] assume an instance-independent label noise model:

\begin{aligned} p (\tilde{y} = j ∣ x, y) & = \sum_{i = 1}^{C} p (\tilde{y} = j, y = i ∣ x) = \sum_{i = 1}^{C} p (\tilde{y} = j ∣ y = i) p (y = i ∣ x) \end{aligned}

(1)

where

C

is the number of classes.

p (\tilde{y} = j ∣ y = i)

is the noise model, i.e., the probability of the sample in class

i

being corrupted to class

j

. The second equation holds because the label noise is assumed to be independent of input features.

However, the instance-independent noise model can often be unrealistic. For instance, in certain real-world datasets, images that are difficult to recognize are more susceptible to mislabeling. Consequently, the probability of corruption should be influenced by the data features themselves, suggesting the presence of instance-dependent label noise [27, 28]. The explicit modeling of such noise presents considerable challenges. A selection of recent methods [8, 29, 11] have made headway in this area by leveraging confidence-based sample selection and semi-supervised learning, thereby achieving notable improvements.

2.2 Classification-loss-based LNL

In the approach adopted by Mentornet [6], a pre-trained network is utilized to select samples with minimal losses for the training of another model. In contrast, Co-teaching [7] employs two equivalent models, with each selecting low-loss samples for the other. Co-teaching+ [30] builds upon the Co-teaching strategy by further filtering agreed predictions to prevent the two base models from converging prematurely towards a consensus. ITLM [31] seeks to optimize a trimmed loss by strategically choosing a fraction of samples and updating the model on it. INCV [32] selects clean samples from the noisy ones at every round via a cross-validation process. Then it deploys the co-teaching training schema. SELFIE [15] refurbishes small-loss samples with the most frequently predicted labels in previous training epochs. [21] fits the distribution of per-sample loss on a Beta Mixture Model (BMM). Then the produced probability is used as the coefficient of the bootstrapping loss [20]. [33] designs a surrogate loss of the robust 0–1 loss and uses it for clean sample selection. [34] constrains the network output to permutations over a fixed vector and utilizes a sparse regularization strategy, including network output sharpening and norm regularization, to approximate the one-hot constraint and improve performance in the presence of noisy labels and class imbalance. DivideMix [8] segregates the data into a clean set and a noisy set based on loss values, leveraging a Gaussian Mixture Model (GMM) to dynamically ascertain the loss threshold. However, due to the overlapping nature of loss distributions, even some clean samples may exhibit high losses. Additionally, the threshold can be highly sensitive to alterations in the loss distribution, which can result in instability during the learning process. Different from these, our method investigates the LNL problem from a latent representational and topological perspective.

2.3 Feature-based LNL

There are several recent LNL methods [15, 13, 16] based on latent feature representation. Dimensionality-Driven Learning (D2L) [35] adopts a label refurbishment framework and backpropagates the loss for a linear combination of predictions and noisy labels. The method chooses an optimal weight, i.e., label confidence, for the combination such that the increase of local intrinsic dimensionality [36] is prevented. TopoFilter [10] filters noise according to the topological relationship between samples. It constructs a $k$ -NN graph and treats nodes in the largest connected component for each class as clean samples. Multi-Objective Interpolation Training [13] identifies noise by comparing the predictions of samples and those of their neighbors before correcting the wrong labels for training. Noisy Graph Cleaning (NGC) [11] considers a new problem setup: learning with open-world noisy data. The method constructs a graph and performs label propagation to obtain pseudo-labels. Then it selects clean samples using the largest connected component within each class. Besides, it adopts contrastive learning at a sub-graph level. Neighbor Consistency Regularization (NCR) [14] deploys a simple regularization loss term for robust training, encouraging examples with similar feature representations to have similar predictions. RapNets [37] introduces the concept of robust few-shot learning. They propose robust attentive profile networks, which can perform a feature-level similarity assessment, as a means to suppress outliers. [38] emphasize the importance of robustness to label noise in few-shot learning methods. The authors propose feature aggregation techniques and introduce a Transformer [39] model that leverages attention mechanisms to weigh mislabeled samples against correct ones. To address the challenge of open-world few-shot learning, where both in-domain and out-of-domain noise exist in few-shot datasets, IDEAL [40] proposes a framework that incorporates instance-wise and metric-wise calibration. This framework consists of a contrastive network and a meta network, which aim to extract intra-class information and inter-class variations. Additionally, they introduce a prototype modification technique to mitigate the impact of noise. However, we suggest that the topological information can be further utilized by optimizing the whole dataset’s topological structure. The major distinction of LaplaceConfidence is that it reaches an optimal estimation by minimizing the graph Laplacian energy.

2.4 Semi-supervised learning

Semi-supervised learning aims to leverage both labeled data and unlabeled data. Recent LNL methods attempt to convert the LNL problem into a semi-supervised learning problem by removing some possible noisy labels and utilizing powerful semi-supervised learning techniques. For instance, DivideMix’s success can be largely attributed to the deployment of MixMatch. Our idea of constructing a graph using data points’ representations is inspired by the semi-supervised learning method LaplaceNet [19], which assigns pseudo-labels to unlabeled data using the label propagation algorithm. One difference between LaplaceNet and our method is that the former employs propagated labels for training, while we utilize the resulting graph for label confidence estimation. One may wonder why LaplaceConfidence does not directly use the refined labels as training targets. We find that, unlike in semi-supervised learning, this would yield suboptimal results in LNL. We note that obtaining pseudo-labels in each iteration (instead of per epoch) using the latest model leads to less noisy training targets, which is essential to an LNL method. Therefore, LaplaceConfidence estimates label confidence in every epoch and generates new training targets in every iteration.

3. Method

3.1 Problem formulation

Different from the standard supervised learning, only a noisy training dataset $\tilde{D} = {x_{i}, {\tilde{y}}_{i}}_{i = 1}^{N}$ is available in LNL, where $x$ is the input feature and $\tilde{y} \in {0, 1}^{C}$ are the one-hot noisy label vector in $C$ -class. $N$ is the number of training samples. LNL is to train a robust model, which can be viewed as a composition of a feature extractor $g (\cdot)$ and a linear classifier $f (\cdot)$ . The performance is evaluated on a clean test dataset.

3.2 LaplaceConfidence

We introduced graph structures to leverage the geometric information present in the data, allowing us to capture the interaction between model predictions and the data’s geometric structure. Specifically, an undirected $k$ -NN graph is first constructed using penultimate layer features of all training samples ${v_{1} = g (x_{1}), \dots, v_{N} = g (x_{N})}$ . The weighted adjacency matrix $A$ :

\begin{aligned} A_{i j} = {\begin{cases} ⟨ v_{i}, v_{j} ⟩, & if i \in N_{k} (j) \\ 0, & otherwise \end{cases} \end{aligned}

(2)

where

⟨ \cdot, \cdot ⟩

is the inner product.

N_{k}

denotes the

k

nearest neighbors. To balance the influence of samples with different numbers of neighbors, the diagonal degree matrix

D

is used to normalize

A

\begin{aligned} \bar{A} & = D^{- 1 / 2} A D^{- 1 / 2} \\ D & = diag (A 1_{N}) \end{aligned}

(3)

where diag

(\cdot)

is a diagonal matrix whose diagonal consists of the vector in the bracket.

A graph formed from a clean dataset should have low graph Laplacian energy because samples are likely to agree with their neighbors’ labels. Our obtained graph is not the case due to the inconsistency between the learned features and the noisy labels. Therefore, we can identify noisy nodes that cause such inconsistency according to which nodes should be changed for a low-energy graph structure. The graph Laplacian energy over the label distribution is minimized to obtain the clean graph structure.

\begin{aligned} Q (\bar{Y}) = \frac{1}{2} \sum_{i, j = 1}^{N} {\bar{A}}_{i j} {‖ \frac{{\bar{y}}_{i}}{\sqrt{D_{i i}}} - \frac{{\bar{y}}_{j}}{\sqrt{D_{j j}}} ‖}^{2} + \frac{μ}{2} \sum_{i = 1}^{N} ‖ {\bar{y}}_{i} - {\tilde{y}}_{i} ‖^{2} \end{aligned}

(4)

where the refined label distribution

\bar{Y} = [{\bar{y}}_{1}, \dots, {\bar{y}}_{N}] \in R^{N \times C}

. The second term is a fidelity term that avoids the refined labels changing too much from the original labels.

μ

is the coefficient that balances between the node’s neighborhoods and itself. Minimizing

Q

is a typical convex optimization problem. To side-step the calculation of matrix inverse, we use the conjugate gradient method to solve the linear system

(I - (1 + μ)^{- 1} \bar{A}) \bar{Y} = \tilde{Y}

. The calculation follows the prevalent practice, so we don’t elaborate on the details here. Finally, a global optimal label distribution is obtained.

Having the refined label distribution $\bar{Y}$ , there are many ways to map it to clean probability, e.g., calculating the cross-entropy between the original labels $H (\bar{y}, \tilde{y})$ and then put it in previous GMM framework. We find that simply using the probability on the original class as label confidence $w = \bar{y} [\tilde{y}]$ yields good results, where label confidence $w$ is the an estimation of the probability of a given label being correct which serves as an indicator of the confidence or reliability associated with each individual label, $[\cdot]$ selects a specific value from a given vector.

3.3 The overall training process

Though we mainly focus on label confidence estimation in this work, some other techniques are also unified to form a holistic pipeline for LNL. The whole training schema is described in Algorithm 1.

Specifically, the label refurbishment framework trains the model with the refurbished label $y^{*}$ , which is obtained from a convex combination of the noisy label $\tilde{y}$ and the pseudo-label $\hat{y}$ from the model’s prediction.

\begin{aligned} y^{*} = w \tilde{y} + (1 - w) \hat{y} \end{aligned}

(5)

where

w

is the label confidence from Section 3.2. The bigger the label confidence, the more the model fits the given label. The smaller the label confidence, the more the optimization objective leans toward self-training. Using one model’s own predictions to guide its subsequent training leads to the error accumulation problem [41, 42]. Co-training alleviates the problem by training two models simultaneously. We adopt the co-training schema in DivideMix [8]. Specifically, two models with the same structure but different parameter initialization are maintained. The confidence one model uses comes from its peer. For pseudo-labeling, two models’ predictions are ensembled. Let

p_{model 1} (y ∣ x)

and

p_{model 2} (y ∣ x)

be the two networks’ predictions, respectively. The pseudo-labels are generated by:

\begin{aligned} \hat{y} = Sharpen (\frac{p_{model 1} (y ∣ α (x)) + p_{model 2} (y ∣ α (x))}{2}) \end{aligned}

(6)

where

α (\cdot)

is a basic image augmentation function, which randomly flips and crops the input images.

Sharpen (p_{i}) = p_{i}^{\frac{1}{T}} / \sum_{j = 1}^{C} p_{j}^{\frac{1}{T}}

reduces the entropy of the label distribution

p = (p_{1}, \dots, p_{C})

with a temperature

T

We use mini-batch stochastic gradient descent algorithm for optimization. The loss of a sample $x$ in a mini-batch is the cross-entropy H between the soft pseudo-labels and the predictions of the model:

\begin{aligned} L = H (y^{*}, f (g (A (x)))) \end{aligned}

(7)

where

A

is an augmentation method RandAugment [43]. It first randomly selects a given number of operations from a set of image transformations, including geometric and photometric transformations. Sequentially, these operations are applied with random magnitudes. Its details are in the supplementary material.

f

and

g

is the linear classifier and feature extractor, respectively.

4. Results and discussion

4.1 Experimental details

4.1.1 Benchmark datasets

We benchmark the proposed method on experimental settings using CIFAR-10, CIFAR-100 [44] with different levels of synthetic noises, as well as the real-world dataset Mini-WebVision [22]. On CIFAR-10 and CIFAR-100, There are two commonly used types of synthetic noise [45, 8]: symmetric noise and asymmetric noise. Symmetric noise corrupts samples to random classes with the same probability, while asymmetric noise corrupts samples to specific classes according to a pre-defined label transition matrix (shown in Figure 2(a)). The noise rate ranges from 20% to 90% (note that samples are randomly corrupted to $C$ classes for symmetric noise, and the true labels may be maintained afterward). For Mini-WebVision, we use the first 50 classes of the Google image subset. The ImageNet ILSVRC12 is used as the validation set following [32, 8].

Figure 2.

Confusion matrices on CIFAR-10 under 20%–90% symmetric noise and 40% asymmetric noise. (a) The corrupted training set. (b) The prediction on the training set. (c) The prediction on the test set. The airp. and auto. are airplane and automobile for short.

Table 1

Comparison with state-of-the-art methods on CIFAR-10 and CIFAR-100 with synthetic noise. Sym. and Asym. are symmetric and asymmetric for short, respectively. The best results are indicated in bold. We include the results of AugDesc* with two different augmentation policies.

		CIFAR-10					CIFAR-100
		Sym.					Sym.
Dataset Noise type Method/Noise ratio		20%	50%	80%	90%	Asym. 40%	20%	50%	80%	90%
Bootstrap	Best	86.8	79.8	63.3	42.9	–	62.1	46.6	19.9	10.2
	Last	82.9	58.4	26.8	17.0	–	62.0	37.9	8.9	3.8
F-correction	Best	86.8	79.8	63.3	42.9	87.2	61.5	46.6	19.9	10.2
	Last	83.1	59.4	26.2	18.8	83.1	61.4	37.3	9.0	3.4
Co-teaching+	Best	89.5	85.7	67.4	47.9	–	65.6	51.8	27.9	13.7
	Last	88.2	84.1	45.5	30.1	–	64.1	45.3	15.5	8.8
Mixup	Best	95.6	87.1	71.6	52.2	–	67.8	57.3	30.8	14.6
	Last	92.3	77.6	46.7	43.9	–	66.0	46.6	17.6	8.1
P-correction	Best	92.4	89.1	77.5	58.9	88.5	69.4	57.5	31.1	15.3
	Last	92.0	88.7	76.5	58.2	88.1	68.1	56.4	20.7	8.8
Meta-Learning	Best	92.9	89.3	77.4	58.7	89.2	68.5	59.2	42.4	19.5
	Last	92.0	88.8	76.1	58.3	88.6	67.7	58.0	40.1	14.3
M-correction	Best	94.0	92.0	86.8	69.1	87.4	73.9	66.1	48.2	24.3
	Last	93.8	91.9	86.6	68.7	86.3	73.4	65.4	47.6	20.5
DivideMix	Best	96.1	94.6	93.2	76.0	93.4	77.3	74.6	60.2	31.5
	Last	95.7	94.4	92.9	75.4	92.1	76.9	74.2	59.6	31.0
AugDesc-AutoAugment^*	Best	96.3	95.4	93.8	91.9	94.6	79.5	77.2	66.4	41.2
	Last	96.2	95.1	93.6	91.8	94.3	79.2	77.0	66.1	40.9
AugDesc-RandAugment^*	Best	96.1	–	–	89.6	–	78.1	–	–	36.8
	Last	96.0	–	–	89.4	–	77.8	–	–	36.7
Robust LR	Best	96.5	95.8	94.3	92.8	94.4	79.1	75.3	66.7	37.5
	Last	96.4	95.7	94.2	92.8	93.7	78.6	74.6	66.2	37.3
LaplaceConfidence	Best	96.4	96.0	95.0	94.7	95.2	79.6	76.5	70.4	55.2
	Last	96.3	95.8	94.8	94.6	94.7	79.3	75.5	69.4	44.6

4.1.2 Backbone models

The backbone for CIFAR is the 18-layer PreAct Resnet [46]. The backbone for Mini-WebVision is the Inception-ResNet v2 [47].

4.1.3 Training schema

For all experiments, we mainly tune two hyper-parameters, namely the temperature $T_{p l}$ for pseudo-labeling and the $k$ in the $k$ -NN graph. Specifically, using a validation set of 5000 samples, we choose $T_{p l}$ from ${1, 2, 3, 4}$ and $k$ from ${2, 10, 50, 100}$ . For two light noise settings, namely CIFAR-10 under 20% symmetric noise and CIFAR-100 under 20% symmetric noise, and Mini-WebVision, the $T_{p l}$ is set to $1$ . Otherwise, the $T_{p l}$ is set to 2. For CIFAR-10 under symmetric noise, $k$ is set to 50. In other cases, the $k$ is set to 2. For CIFAR-10 and CIFAR-100, the network is trained using SGD with a learning rate of 0.01, a momentum of 0.9, a weight decay of 0.0005, and a batch size of 128 for 400 rounds. The model is warmed up for 15 epochs (simple supervised training using the original noisy dataset). We reduce the learning rate to 0.001 in the last 100 training rounds. For Mini-WebVision, the network is trained using SGD with a learning rate of 0.01, a momentum of 0.9, a weight decay of 0.0005, and a batch size of 160 for 300 rounds. The warm-up period is 1 epoch. We reduce the learning rate to 0.001 in the last 100 training rounds.

We add a regularization term for encouraging the network output uniform distribution following many LNL methods [48, 21, 8]: $L_{r e g} = \sum_{c} π_{c} l o g (\frac{π_{c}}{{\bar{p}}_{c}})$ where ${\bar{p}}_{c} = \frac{1}{B} \sum_{i = 1}^{B} p (y = c ∣ x_{i}; θ)$ . $π$ is the uniform prior distribution, we set $π_{c} = \frac{1}{C}$ . For asymmetric noise, we add a negative entropy loss term during warm-up following [49, 8]: $L_{a s y m} = \sum_{c} p (y ∣ x; θ) l o g (p (y ∣ x; θ))$ .

Table 2
List of operations for strong transformations of the modified RandAugment. Three transformations are randomly chosen and performed with stochastic magnitude.

Operation Range Operation Range

AutoContrast [0, 1] Rotate [−30, 30]

Brightness [0.05, 0.95] Sharpness [0.05, 0.95]

Color [0.05, 0.95] ShearX [−0.3, 0.3]

Contrast [0.05, 0.95] ShearY [−0.3, 0.3]

Equalize [0, 1] Solarize [0, 256]

Identity [0, 1] TranslateX [−0.3, 0.3]

Posterize [4, 8] TranslateY [−0.3, 0.3]

Operation	Range	Operation	Range
AutoContrast	[0, 1]	Rotate	[−30, 30]
Brightness	[0.05, 0.95]	Sharpness	[0.05, 0.95]
Color	[0.05, 0.95]	ShearX	[−0.3, 0.3]
Contrast	[0.05, 0.95]	ShearY	[−0.3, 0.3]
Equalize	[0, 1]	Solarize	[0, 256]
Identity	[0, 1]	TranslateX	[−0.3, 0.3]
Posterize	[4, 8]	TranslateY	[−0.3, 0.3]

4.1.4 Data augmentation

For the data augmentation $A$ , we use the modified version of RandAugment [43] follows the setting of FixMatch [50]. The operations of RandAugment are shown in Table 2. The meaning of range is the same as the original version, so we don’t elaborate here.

4.2 Comparison to SOTA

4.2.1 CIFAR-10, CIFAR-100

For comparison on CIFAR-10 and CIFAR-100, results of Bootstrap [20], F-correction [26], P-correction [51], M-correction [21] Mixup [52], Co-teaching+ [30], Meta-learning [53], DivideMix [8], AugDesc [29], Robust LR [54] are reported. Following [8, 29], we report both the Best test accuracy across all epochs and the average test accuracy over the Last 10 epochs. The performance of LaplaceConfidence over 3 trials with different random seeds for noise generation and parameter initialization is averaged. It is also worth noting that AugDesc is extended from DivideMix by adding different augmentations on top of it. We include two versions of AugDesc: one is with RandAugment [43], which is the same as the augmentation we use, and another is with AutoAugment [55], which uses reinforcement learning to determine the selection and ordering of a set of augmentation functions.

The proposed method outperforms the previous best method by up to 2.8% on CIFAR-10 under symmetric noise and by up to 14% on CIFAR-100 under symmetric noise, as shown in Table 1. Only AugDesc-AutoAugment achieves competitive results on CIFAR-100 under 50% noise, with AutoAugment that has a higher computation cost as shown in [43]. The performance gain is bigger under heavy noise. We remark that suboptimal confidence estimation would misguide the training, and the model, in turn, overfits the wrong labels and adversely affects the subsequent confidence estimation. Thus, a good label confidence estimation method could bring huge improvements under heavy noise.

In terms of asymmetric noise, the proposed method also achieves better results, surpassing the previous best method by over 1.8%. After training, the model is less biased towards the given class-dependent noise on the training and test set, as shown in Figure 2(b) and (c).

Table 3
Comparison with state-of-the-art methods on Mini-WebVision with real-world noise. The best results are indicated in bold.

Mini-WebVision ILSVRC12

Method Top-1 Top-5 Top-1 Top-5

D2L 62.68 84.00 57.80 81.36

MentorNet 63.00 81.40 57.80 79.92

Co-teaching 63.58 85.20 61.48 84.70

Iterative-CV 65.24 85.34 61.60 84.98

DivideMix 77.32 91.64 75.20 90.84

NGC 79.16 91.84 74.44 91.04

LaplaceConfidence 80.52 94.56 77.36 94.12

	Mini-WebVision	ILSVRC12
D2L	62.68	84.00	57.80	81.36
MentorNet	63.00	81.40	57.80	79.92
Co-teaching	63.58	85.20	61.48	84.70
Iterative-CV	65.24	85.34	61.60	84.98
DivideMix	77.32	91.64	75.20	90.84
NGC	79.16	91.84	74.44	91.04
LaplaceConfidence	80.52	94.56	77.36	94.12

Figure 3.

Feature representations visualized by t-SNE. Color represents the ground-truth class. (a). The clean feature representation (trained on clean CIFAR-10). (b). The noisy feature representation (trained on 50% uniformly corrupted CIFAR-10 without robust training techniques). (c). The noisy feature representation our method learned. The standard softmax-based linear classifier could easily fail on the twisted feature in (b), yielding bad label confidence estimation. Best viewed in color.

In Figure 3(a) and (b), we visualize the learned features of clean and noisy datasets. It can be seen that clean features have a clear cluster structure, where samples in the same class are close. However, noisy features are more twisted, as in Figure 3(b), where some samples locate far away from their cluster center. The softmax-based linear classifier on top of the penultimate layer features would be unable to separate such twisted features and, subsequently, the cross-entropy loss is unable to identify label noise [14, 56]. We suggest that this is one of the reasons why the small-loss criterion fails.

From Figure 3 (b), we can also notice that most samples, even though they do not form cluster structure, still have same-class neighbors. Therefore, we attempt to utilize the topological relationships between samples for label confidence estimation. The possibility of a sample/node being mislabeled can be determined by all data points’ distribution in the feature space. The graph structure is commonly used to model such relationships, on which all nodes can propagate their labels to their neighbors. Considering that the propagation could be an iterative process, where corrected nodes can affect their neighbors again until reaching convergence, we solve the label confidence estimation as the classic graph Laplacian minimization problem [17, 18, 19], i.e., making the intrinsic connection structure sufficiently smooth. After this process, the more the label changes, the more possible it is that the original labels are wrong.

4.2.2 Mini-WebVision

Learning with real-world label noise is more challenging. We evaluate LaplaceConfidence on Mini-WebVision to verify that it performs well on a larger dataset with more complex noise. For comparison, we choose classification-loss-based methods, namely MentorNet [6], Co-teaching [7], Iterative-CV [32], DivideMix [8], and feature-based methods, namely D2L [35] and NGC [11]. It is worth noting that NGC is also a graph-based LNL method which utilizes the largest connected components for noise identification. Comparison with it can verify that our method based on Laplacian energy minimization is better than other graph-based methods.

LaplaceConfidence outperforms all other classification-loss-based and feature-based LNL methods by achieving a top-1 and top-5 accuracy of 80.52% and 94.56%, respectively. It is 1.35% and 2.71% better than the previous best NGC, respectively.

4.3 Accelerating LaplaceConfidence

The computation time difference between LaplaceConfidence and conventional loss-based LNL methods is due to the label confidence estimation process. Given that the field of LNL often deal with large datasets or models, reducing the computation cost further would be beneficial for the potential real-world usage of our method. Since our method takes a bilevel optimization form, and we do not need to back-propagate the gradient in the label confidence process, the features can be freely manipulated. Therefore, we consider a simple strategy that is known to be capable of removing irrelevant or redundant features, i.e., reducing the dimension of the extracted feature embeddings using Principal Component Analysis (PCA) for the calculation in Eq. (3.2).

Table 4
Time cost in seconds and accuracy. “per conv.” stands for the running time of per convergence of the estimation method.

LC PCA $+$ LC

CIFAR-10 Time cost per conv. 8.66 5.66

Average accuracy 95.45 95.41

Mini-WebVision Time cost per conv. 22.89 14.53

Average accuracy 80.52 81.32

		LC	PCA $+$ LC
CIFAR-10	Time cost per conv.	8.66	5.66
	Average accuracy	95.45	95.41
Mini-WebVision	Time cost per conv.	22.89	14.53
	Average accuracy	80.52	81.32

To further accelerate the process of $k$ -NN graph construction, we implement it using the Faiss library (https://faiss.ai/). Even though the underlying $k$ -selection algorithm’s worst time complexity is still $O (n^{2})$ , the average case is reduced to $O (n)$ . What’s more, the Faiss library provides quick GPU implementation and optimization, which takes advantage of parallelism. On the same infrastructure, we test the running time of LaplaceConfidence and LaplaceConfidence $+$ PCA on two datasets.

We run our experiments and calculate the computational cost on 24 cores Intel(R) Xeon(R) Platinum 8255C CPU and a single NVIDIA RTX V100 GPU to get the real time cost. As shown in Table 4, by simply adding the PCA technique, LaplaceConfidence is quicker with almost no accuracy loss on CIFAR-10. Noteworthy, the PCA even improves the accuracy on Mini-WebVision by 0.8%. We remark that it is because the dimension reduction, keeping the important features in the learned representation, could reduce the damage of ambiguous/wrong features learned from the noisy training signals.

Table 5

Ablation study. Results on CIFAR-10 with different levels of symmetric noise are reported.

Method/Noise ratio		20%	50%	80%	90%
LaplaceConfidence	Best	96.4	96.0	95.0	94.7
	Last	96.3	95.8	94.8	94.6
Replacing LaplaceConfidence with GMM	Best	96.4	95.6	94.8	93.4
	Last	96.2	95.5	94.3	93.1
Without co-training	Best	96.0	95.3	93.9	94.0
	Last	95.4	95.2	93.3	93.6

Figure 4.

(a) The quality of confidence estimation of LaplaceConfidence and GMM on CIFAR-10 under 90% noise. (b) LaplaceConfidence with other augmentations on CIFAR-10 under 90% noise. (c) Varying the $k$ for $k$ -NN graph. (d) Varying the temperature $T_{p l}$ for pseudo-labeling. In (a), a threshold of 0.5 is used for separating clean and noisy. F1-score is used considering the imbalance between clean and noise. LC is short for LaplaceConfidence.

4.4 Ablation study

To study the importance of the main components in LaplaceConfidence, we test each of them separately and report the performance:

−
To study the effect of LaplaceConfidence, we replace it with GMM.
−
To study the effect of the co-training, we only use one model.
−
To study the effect of the augmentation, we replace RandAugment with other augmentation methods.

We also study the influence of two important hyper-parameters in Figure 5, namely the number of nearest neighbors $k$ and the temperature $T$ .
4.4.1 LaplaceConfidence

LaplaceConfidence plays a key role in our method. For comparison, we choose Gaussian Mixture Model (GMM), the confidence estimation method used in the previous SOTA. A mixture distribution of two Gaussian distributions is fitted on the loss value using the Expectation-Maximization algorithm. Regarding its hyper-parameters setting and implementation, we follow the official implementation of DivideMix (https://github.com/LiJunnan1992/DivideMix). As shown in Table 5, the accuracy of LaplaceConfidence is better than GMM. We further compare their confidence estimation quality during training in Figure 4(a), which confirms LaplaceConfidence’s superiority against GMM (we use F1 score because the clean-noisy binary classification is imbalanced). What is more, we find that the estimation of GMM is unstable under heavy noise. It is because the loss of clean and noisy samples seriously overlaps, and the GMM fails to converge under heavy noise. Note that DivideMix models the averaged loss over the last 5 epochs to improve convergence stability. We also add it for the experiment of GMM. Otherwise, the model would collapse. The proposed LaplaceConfidence, on the other hand, produces better estimation and does not need extra tricks to stabilize training.

4.4.2 Co-training

Co-training is employed in our approach to address the error accumulation problem associated with the self-training process and to aggregate predictions from two models. Our experiments consistently demonstrate that co-training outperforms a single model, highlighting its efficacy in enhancing performance. However, it is important to note that co-training introduces additional computational costs, more than doubling the computational requirements. This increase in cost must be considered along with the potential benefits. In contrast, our label confidence estimation method incurs minimal extra cost compared to the overall training process.

4.4.3 Data augmentations

Augmentation has been found useful in many tasks, such as semi-supervised learning [50], and unsupervised learning [57, 58]. We report LaplaceConfidence with other augmentations methods in Figure 4(b). One may argue that data augmentation has such a big influence that it could cause unfair comparisons. We show that LaplaceConfidence outperforms other methods that use the same or even stronger augmentation as in Table 1. It also does not diminish the contributions of other components because they further make improvements upon data augmentation, as in Table 5. We find that LaplaceConfidence with Cutmix $+$ Augmix achieves the best and the last accuracy of 95.0% and 94.8% on CIFAR-10 under 90% noise, respectively. Considering that no previous LNL method use Augmix, we only report the RandAugment version of LaplaceConfidence for a fair comparison.

5. Conclusion

This paper studies the key problem in LNL: label confidence estimation. We propose LaplaceConfidence, a new graph-based method that utilizes the rich topological information in the feature space. It is superior to previous feature-based methods by correcting the bias in the label confidence estimation caused by mislabeled data points. We demonstrate that our approach beats other methods through systematical experiments, significantly advancing the state-of-the-art. We also find that reducing the dimension of learned features before calculating the feature similarities permits smaller computations without damaging generalization. Furthermore, we conduct ablation experiments to study the effects of our components.

There are many possible avenues for future research into label confidence estimation, including the exploration of other forms of noise present in datasets, such as distribution shifted data [59] or out-of-distribution data [60]. With LaplaceConfidence as a starting point, we believe that techniques from graph-based approaches may be adapted to solve these more challenging problems. For instance, one could adjust the contribution of different examples by introducing different weights for nodes in the graph. Due to its flexibility and scalability, we anticipate that our method can be applied to a range of real-world applications.

Footnotes

Acknowledgments

This paper is supported by the National Natural Science Foundation of China (Grant No. 62192783, 62376117), the Collaborative Innovation Center of Novel Software Technology and Industrialization at Nanjing University.

References

Roh

Heo

Whang

S.E.

, A Survey on Data Collection for Machine Learning: A Big Data – AI Integration Perspective, IEEE Transactions on Knowledge and Data Engineering 33(4) (2021), 1328–1347. doi: 10.1109/TKDE.2019.2946162.

Han

Yao

Liu

Niu

Tsang

I.W.

Kwok

J.T.

Sugiyama

, A Survey of Label-noise Representation Learning: Past, Present and Future, CoRR abs/2011.04406 (2020). https://arxiv.org/abs/2011.04406.

Zhou

Z.-H.

, A brief introduction to weakly supervised learning, National Science Review 5(1) (2018), 44–53.

Zhang

Bengio

Hardt

Recht

Vinyals

, Understanding deep learning requires rethinking generalization, arXiv preprint arXiv:1611.03530 (2016).

Arpit

Jastrzębski

Ballas

Krueger

Bengio

Kanwal

M.S.

Maharaj

Fischer

Courville

Bengio

et al., A closer look at memorization in deep networks, in: International Conference on Machine Learning, PMLR, 2017, pp. 233–242.

Jiang

Zhou

Leung

L.-J.

Fei-Fei

, Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels, in: International Conference on Machine Learning, PMLR, 2018, pp. 2304–2313.

Han

Yao

Niu

Tsang

Sugiyama

, Co-teaching: Robust training of deep neural networks with extremely noisy labels, arXiv preprint arXiv:1804.06872 (2018).

Socher

Hoi

S.C.

, Dividemix: Learning with noisy labels as semi-supervised learning, arXiv preprint arXiv:2002.07394 (2020).

Song

Kim

Park

Shin

Lee

J.-G.

, Learning from noisy labels with deep neural networks: A survey, arXiv preprint arXiv:2007.08199 (2020).

10.

Zheng

Goswami

Metaxas

Chen

, A topological filter for learning with label noise, arXiv preprint arXiv:2012.04835 (2020).

11.

Z.-F.

Wei

Jiang

Mao

Tang

Y.-F.

, Ngc: A unified framework for learning with open-world noisy data, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 62–71.

12.

Karthik

Revaud

Chidlovskii

, Learning From Long-Tailed Data With Noisy Labels, CoRRabs/2108.11096 (2021). https://arxiv.org/abs/2108.11096.

13.

Ortego

Arazo

Albert

O’Connor

N.E.

McGuinness

, Multi-Objective Interpolation Training for Robustness To Label Noise, in: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19–25, 2021, Computer Vision Foundation/IEEE, 2021, pp. 6606–6615. doi: 10.1109/CVPR46437.2021.00654. https://openaccess.thecvf.com/content/CVPR2021/html/Ortego\_Multi-Objective\_Interpolation\_Training\_for\_Robustness\_ToLabel_Noise\_CVPR\_2021\_paper.html.

14.

Iscen

Valmadre

Arnab

Schmid

, Learning with Neighbor Consistency for Noisy Labels, in: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18–24, 2022, IEEE, 2022, pp. 4662–4671. doi: 10.1109/CVPR52688.2022.00463.

15.

Song

Kim

Lee

, SELFIE: Refurbishing Unclean Samples for Robust Deep Learning, in: Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, K. Chaudhuri and R. Salakhutdinov, eds, Proceedings of Machine Learning Research, Vol. 97, PMLR, 2019, pp. 5907–5915. http://proceedings.mlr.press/v97/song19b.html.

16.

Bahri

Jiang

Gupta

M.R.

, Deep k-NN for Noisy Labels, in: Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13–18 July 2020, Virtual Event, Proceedings of Machine Learning Research, Vol. 119, PMLR, 2020, pp. 540–550. http://proceedings.mlr.press/v119/bahri20a.html.

17.

Newman

M.E.

, Detecting community structure in networks, The European Physical Journal B 38(2) (2004), 321–330.

18.

Zhu

Ghahramani

Lafferty

J.D.

, Semi-Supervised Learning Using Gaussian Fields and Harmonic Functions, in: Machine Learning, Proceedings of the Twentieth International Conference (ICML 2003), August 21–24, 2003, Washington, DC, USA, T. Fawcett and N. Mishra, eds, AAAI Press, 2003, pp. 912–919. http://www.aaai.org/Library/ICML/2003/icml03-118.php.

19.

Sellars

Avilés-Rivero

A.I.

Schönlieb

, LaplaceNet: A Hybrid Energy-Neural Model for Deep Semi-Supervised Classification, CoRRabs/2106.04527 (2021). https://arxiv.org/abs/2106.04527.

20.

Reed

Lee

Anguelov

Szegedy

Erhan

Rabinovich

, Training deep neural networks on noisy labels with bootstrapping, arXiv preprint arXiv:1412.6596 (2014).

21.

Arazo

Ortego

Albert

OâĂŹConnor

McGuinness

, Unsupervised label noise modeling and loss correction, in: International Conference on Machine Learning, PMLR, 2019, pp. 312–321.

22.

Wang

Agustsson

Van Gool

, Webvision database: Visual learning and understanding from web data, arXiv preprint arXiv:1708.02862 (2017).

23.

Liu

Tao

, Classification with noisy labels by importance reweighting, IEEE Transactions on pattern analysis and machine intelligence 38(3) (2015), 447–461.

24.

Sukhbaatar

Bruna

Paluri

Bourdev

Fergus

, Training convolutional networks with noisy labels, arXiv preprint arXiv:1406.2080 (2014).

25.

Chen

Gupta

, Webly supervised learning of convolutional networks, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1431–1439.

26.

Patrini

Rozza

Krishna Menon

Nock

, Making deep neural networks robust to label noise: A loss correction approach, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1944–1952.

27.

Menon

A.K.

Van Rooyen

Natarajan

, Learning from binary labels with instance-dependent noise, Machine Learning 107(8) (2018), 1561–1595.

28.

Cheng

Liu

Ramamohanarao

Tao

, Learning with bounded instance and label-dependent label noise, in: International Conference on Machine Learning, PMLR, 2020, pp. 1789–1799.

29.

Nishi

Ding

Rich

Hollerer

, Augmentation strategies for learning with noisy labels, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 8022–8031.

30.

Han

Yao

Niu

Tsang

Sugiyama

, How does disagreement help generalization against label corruption, in: International Conference on Machine Learning, PMLR, 2019, pp. 7164–7173.

31.

Shen

Sanghavi

, Learning with bad training data via iterative trimmed loss minimization, in: International Conference on Machine Learning, PMLR, 2019, pp. 5739–5748.

32.

Chen

Liao

B.B.

Chen

Zhang

, Understanding and utilizing deep neural networks trained with noisy labels, in: International Conference on Machine Learning, PMLR, 2019, pp. 1062–1070.

33.

Lyu

Tsang

I.W.

, Curriculum loss: Robust learning and generalization against label corruption, arXiv preprint arXiv:1905.10045 (2019).

34.

Zhou

Liu

Wang

Zhai

Jiang

, Learning with Noisy Labels via Sparse Regularization, in: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 72–81. doi: 10.1109/ICCV48922.2021.00014.

35.

Wang

Houle

M.E.

Zhou

Erfani

Xia

Wijewickrema

Bailey

, Dimensionality-driven learning with noisy labels, in: International Conference on Machine Learning, PMLR, 2018, pp. 3355–3364.

36.

Houle

M.E.

, Local intrinsic dimensionality I: an extreme-value-theoretic foundation for similarity applications, in: International Conference on Similarity Search and Applications, Springer, 2017, pp. 64–79.

37.

Jin

Liang

Zhang

, Robust Few-Shot Learning for User-Provided Data, IEEE Transactions on Neural Networks and Learning Systems 32(4) (2021), 1433–1447. doi: 10.1109/TNNLS.2020.2984710.

38.

Liang

K.J.

Rangrej

S.B.

Petrovic

Hassner

, Few-shot learning with noisy labels, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 9089–9098.

39.

Vaswani

Shazeer

Parmar

Uszkoreit

Jones

Gomez

A.N.

Kaiser

Polosukhin

, Attention is all you need, Advances in Neural Information Processing Systems 30 (2017).

40.

Xue

Zhao

Wang

, From Instance to Metric Calibration: A Unified Framework for Open-World Few-Shot Learning, IEEE Transactions on Pattern Analysis and Machine Intelligence 45(8) (2023), 9757–9773. doi: 10.1109/TPAMI.2023.3244023.

41.

Tarvainen

Valpola

, Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results, arXiv preprint arXiv:1703.01780 (2017).

42.

Arazo

Ortego

Albert

OâĂŹConnor

N.E.

McGuinness

, Pseudo-labeling and confirmation bias in deep semi-supervised learning, in: 2020 International Joint Conference on Neural Networks (IJCNN), IEEE, 2020, pp. 1–8.

43.

Cubuk

E.D.

Zoph

Shlens

Q.V.

, Randaugment: Practical automated data augmentation with a reduced search space, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2020, pp. 702–703.

44.

Krizhevsky

Hinton

et al., Learning multiple layers of features from tiny images (2009).

45.

Kim

Yim

Yun

Kim

, Nlnl: Negative learning for noisy labels, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 101–110.

46.

Zhang

Ren

Sun

, Identity mappings in deep residual networks, in: European conference on computer vision, Springer, 2016, pp. 630–645.

47.

Szegedy

Ioffe

Vanhoucke

Alemi

, Inception-v4, inception-resnet and the impact of residual connections on learning, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 31, 2017.

48.

Tanaka

Ikami

Yamasaki

Aizawa

, Joint optimization framework for learning with noisy labels, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 5552–5560.

49.

Pereyra

Tucker

Chorowski

Kaiser

Hinton

, Regularizing neural networks by penalizing confident output distributions, arXiv preprint arXiv:1701.06548 (2017).

50.

Sohn

Berthelot

C.-L.

Zhang

Carlini

Cubuk

E.D.

Kurakin

Zhang

Raffel

, Fixmatch: Simplifying semi-supervised learning with consistency and confidence, arXiv preprint arXiv:2001.07685 (2020).

51.

, Probabilistic end-to-end noise correction for learning with noisy labels, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 7017–7025.

52.

Zhang

Cisse

Dauphin

Y.N.

Lopez-Paz

, mixup: Beyond empirical risk minimization, arXiv preprint arXiv:1710.09412 (2017).

53.

Wong

Zhao

Kankanhalli

M.S.

, Learning to learn from noisy labeled data, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 5051–5059.

54.

Chen

Cheng

Jiang

Wang

, Two wrongs donâĂŲt make a right: Combating confirmation bias in learning with label noise, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37, 2023, pp. 14765–14773.

55.

Cubuk

E.D.

Zoph

Mane

Vasudevan

Q.V.

, Autoaugment: Learning augmentation strategies from data, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 113–123.

56.

Kim

Cho

Choi

Yun

, FINE Samples for Learning with Noisy Labels, in: Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6–14, 2021, virtual, Ranzato

Beygelzimer

Dauphin

Y.N.

Liang

Vaughan

J.W.

, eds, 2021, pp. 24137–24149. https://proceedings.neurips.cc/paper/2021/hash/ca91c5464e73d3066825362c3093a45f-Abstract.html.

57.

Van Gansbeke

Vandenhende

Georgoulis

Proesmans

Van Gool

, Scan: Learning to classify images without labels, in: European Conference on Computer Vision, Springer, 2020, pp. 268–285.

58.

Khosla

Teterwak

Wang

Sarna

Tian

Isola

Maschinot

Liu

Krishnan

, Supervised contrastive learning, Advances in Neural Information Processing Systems 33 (2020), 18661–18673.

59.

Lee

Zhang

Yang

, CleanNet: Transfer Learning for Scalable Image Classifier Training With Label Noise, in: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18–22, 2018, Computer Vision Foundation/IEEE Computer Society, 2018, pp. 5447–5456. doi: 10.1109/CVPR.2018.00571. http://openaccess.thecvf.com/content_cvpr_2018/html/Lee_CleanNet_Transfer_Learning_CVPR_2018_paper.html.

60.

Wang

Liu

Bailey

Zha

Song

Xia

, Iterative Learning With Open-Set Noisy Labels, in: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18–22, 2018, Computer Vision Foundation/IEEE Computer Society, 2018, pp. 8688–8696. doi: 10.1109/CVPR.2018.00906. http://openaccess.thecvf.com/content_cvpr_2018/html/Wang_Iterative_Learning_With_CVPR_2018_paper.html.

	Mini-WebVision		ILSVRC12
Method	Top-1	Top-5	Top-1	Top-5
D2L	62.68	84.00	57.80	81.36
MentorNet	63.00	81.40	57.80	79.92
Co-teaching	63.58	85.20	61.48	84.70
Iterative-CV	65.24	85.34	61.60	84.98
DivideMix	77.32	91.64	75.20	90.84
NGC	79.16	91.84	74.44	91.04
LaplaceConfidence	80.52	94.56	77.36	94.12

LaplaceConfidence: A graph-based approach for learning with noisy labels

Abstract

Keywords

1. Introduction

2.1 Taxonomy of label noise

2.3 Feature-based LNL

2.4 Semi-supervised learning

3. Method

3.1 Problem formulation

3.2 LaplaceConfidence

4.1 Experimental details

4.1.1 Benchmark datasets

4.1.3 Training schema

4.2 Comparison to SOTA

4.2.1 CIFAR-10, CIFAR-100

4.3 Accelerating LaplaceConfidence

Table 4 Time cost in seconds and accuracy. “per conv.” stands for the running time of per convergence of the estimation method. LC PCA + LC CIFAR-10 Time cost per conv. 8.66 5.66 Average accuracy 95.45 95.41 Mini-WebVision Time cost per conv. 22.89 14.53 Average accuracy 80.52 81.32

4.4.2 Co-training

4.4.3 Data augmentations

5. Conclusion

Footnotes

Acknowledgments

References

Table 4
Time cost in seconds and accuracy. “per conv.” stands for the running time of per convergence of the estimation method.

LC PCA $+$ LC

CIFAR-10 Time cost per conv. 8.66 5.66

Average accuracy 95.45 95.41

Mini-WebVision Time cost per conv. 22.89 14.53

Average accuracy 80.52 81.32