Abstract
Background
The precise pneumoconiosis staging suffers from progressive pair label noise (PPLN) in chest X-ray datasets, because adjacent stages are confused due to unidentifialble and diffuse opacities in the lung fields. As deep neural networks are employed to aid the disease staging, the performance is degraded under such label noise.
Objective
This study improves the effectiveness of pneumoconiosis staging by mitigating the impact of PPLN through network architecture refinement and sample selection mechanism adjustment.
Methods
We propose a novel multi-branch architecture that incorporates the dual-threshold sample selection. Several auxiliary branches are integrated in a two-phase module to learn and predict the progressive feature tendency. A novel difference-based metric is introduced to iteratively obtained the instance-specific thresholds as a complementary criterion of dynamic sample selection. All the samples are finally partitioned into clean and hard sets according to dual-threshold criteria and treated differently by loss functions with penalty terms.
Results
Compared with the state-of-the-art, the proposed method obtains the best metrics (accuracy: 90.92%, precision: 84.25%, sensitivity: 81.11%, F1-score: 82.06%, and AUC: 94.64%) under real-world PPLN, and is less sensitive to the rise of synthetic PPLN rate. An ablation study validates the respective contributions of critical modules and demonstrates how variations of essential hyperparameters affect model performance.
Conclusions
The proposed method achieves substantial effectiveness and robustness against PPLN in pneumoconiosis dataset, and can further assist physicians in diagnosing the disease with a higher accuracy and confidence.
Introduction
Pneumoconiosis is a group of heterogeneous occupational interstitial lung diseases caused by the inhalation of mineral dust, and can lead to chronic pulmonary inflammation and fibrosis. 1 These potentially severe symptoms have been threatening the health of millions of mine workers exposed to dust worldwide. 2 Discovering its early symptoms and determining the stage precisely are essential for disease prevention and diagnosis. Beyond necessary epidemiological investigation of occupational dust exposure, radiologists follow a specification that grades the disease from Stage 0 (considered to be normal) through Stage 3, and determine the stage by comparing chest X-ray radiographs (CXR) of patients to the standard films (Figure 1). The stages are primarily determined by assessing small opacity agglomerations in the lung fields. The opacities are so tiny and diffuse that even the experienced radiologists may unintentionally confuse them with normal lung tissue, resulting in incorrect diagnostic conclusions. To this end, one possible solution is to introduce deep neural networks (DNNs) for pneumoconiosis stage determination. However, the primary methods introduced only simple techniques3–7 and did not consider the noisy labels caused during annotation procedure.

Standard pneumoconiosis CXR films of Stage 0, Stage 1, Stage 2 and Stage 3.
In medical images, label noise originates synthetically from both subjective and objective factors. And CXRs of pneumoconiosis are typical examples among them. One of the key subjective factors is inter/intra-observer variability, which indicates the label uncertainty caused by disagreement of multiple annotators8–10 or inconsistent determination criteria by a single individual.
11
Objectively, the intrinsic similarity and deviation among medical image samples of different classes affect the acquisition of true labels. From a domain-specific perspective, it appears in pneumoconiosis staging that one class may only be confused with its neighbouring classes. This is because, apart from varied opinion of different observers, the radiological feature is nuanced at the decision boundary between neighbouring classes, as illustrated in Figure 2. It remains ambiguous how the underlying semantic feature of CXRs correlates with the stages. In this paper, we denote this special type of label noise as

The radiological feature of pneumoconiosis in CXRs by and large varies continuously and chronologically along with the patients’ condition progression in principle, as aligned with the grey thick dashed bi-directional arrow. Some typical samples in real-world dataset are spreaded across the arrow. And the bi-directional blue arrow sketches the concept of progressive feature tendency we propose in this paper.
For one thing, PPLN may not simply be ascribed to either class-conditional or instance-dependent factors. In fact, common label noise cases presume that different classes have definite criteria and clear boundaries, whereas pneumoconiosis staging depends on empirical deduction by experienced experts with fuzzy, undetermined standards. For another, some researchers have taken this special label noise type as a basic assumption. Sun et al. 12 proposed a log-normal label distribution learning method to alleviate the ambiguity at the decision boundary between pneumoconiosis stages. However, unlike most label distribution scenarios with densely sequenced labels and corresponding continuous metrics,11,13 pneumoconiosis staging labels are relatively sparse, and there has not been a sufficiently evident metric that linearly quantifies the density of small opacities. This can introduce profound unreliability into the formulation of the specific label ambiguity.
Therefore, this study is motivated to explore a novel and reasonable paradigm to mitigate the impact of confusion between adjacent classes under the setting of PPLN. An intuitive way is to train several smart virtual experts at the decision boundaries to uncover the potential criteria. Working as a special observer, each of the experts is responsible for distinguishing the confusing samples into two adjacent stages. As in Figure 2, the confusing samples are usually located in the dashed circles, within which the samples are prone to be mistakenly assigned to either of the adjacent stages. Yet, this plausible conduction is difficult to achieve because the potentially genuine criteria for classification may not be easily found. If the standpoint of an expert is shifted from the boundary to a certain stage (e.g., Stage 1), we believe that there should exist an implicit progressive feature tendency (details in Section “Potential neighbouring label mining”) in both the mild and severe directions. That is, standing at Stage 1, observers may find in the lung fields that Stage-0 samples have a milder opacity agglomeration, while Stage-2 and -3 ones appear in a severer condition. Such a tendency inspires us to ascertain if it may assist in making decision at each boundary in a discretized binary manner.
In this paper, we design a novel multi-branch architecture named

The overall architecture of FTLTD-Net consists of 3 parts. (a) The main branch serves as a common classifier. (b) The Potential Neighboring Label Disambiguation is embedded in the auxiliary branches. The module finally determines the progressive feature tendency of samples and generates the dynamic latent difference threshold for sample selection. (c) Both probability- and difference-based dynamic thresholds are involved in sample selection to partition the samples into clean and hard sets. Here is a demonstration how the image instance
Our main contributions in this paper are concluded as follows.
To adapt PPLN, a two-phase assistant module is integrated into FTLTD-Net as auxiliary branches to learn and predict the progressive feature tendency. A branch-consistent focal loss function is proposed and incorporated to learn the consistent pattern between the main and auxiliary branches, and to avoid occasional sample imbalance. A special metric, LT-Diff, is proposed from the progressive feature tendency prediction, and the double layback-ratio trick is employed in the iteration of LT-Diff-based threshold for stricter sample selection criteria. Penalty terms based on LT-Diff are appended to the sample selection loss functions to enforce the LT-Diff-based thresholds to decrement.
Related work
Noisy labels are referred to as inaccurate labels corrupted from the real ones. Due to their powerful representational capability, DNNs are sensitive to, or can even memorize, noisy labels. Learning from noisy labels aims at establishing robust training paradigms for DNNs against noisy samples. Intensive efforts have been made to address this problem, and the existing approaches are categorized into several orthogonal directions. 15 In this section, we reorganize the taxonomy from the perspective of our motivation.
Noise-type-oriented mechanism
In fact, the noise type is regarded as the hypothesis how the labels are corrupted from ground truth. Generally, the corruption probability is essentially affected by the dependency between data features and class labels. Label noise can be categorized into instance-independent noise (IIN) and instance-dependent noise (IDN).
For IIN, the prototypical approach was to design dedicated architectures by simulating the label transition process, with the primary technique being to add a noise adaptation layer.16,17 Patrini et al. 18 designed a two-procedure approach that is agnostic to both the application domain and network architecture. Yet IIN is actually an impractical assumption for real-world datasets due to substantial variance among training samples. Recent studies have focused on IDN, which considers more about instance-specific confusing factors. Xia et al. 19 proposed part-dependent label noise to describe IDN by approximating a combined transition matrix from parts of instances, whereas it relies on loss correction by estimating noise rates. Later, Cheng et al. 20 designed a confidence-regularized sample sieve to eliminate the estimation of noisy rate. Observing that these methods focused on first-order statistics, Zhu et al. 21 exploited second-order statistics to transform an IDN problem into a new class-dependent label noise. Garg et al. 22 presented a graphical modelling approach named InstanceGM, combining discriminative and generative models. Yang et al. 23 investigated directly modelling the transition from Bayes optimal labels to noisy labels (Bayes-Label Transition Matrix). Observing low search efficiency in instance-dependent transition matrices, Zhang et al. 24 introduced a human-cognition-assisted structured transition matrix network (STMN) for better matrix estimation performance.
However, none of these have incorporated into network design a special noise type where there are no definite criteria discriminating adjacent classes and the confusion concentrates at the decision boundary. PPLN in pneumoconiosis CXR dataset is a typical example, which is considered in our proposed FTLTD-Net.
Sample selection
Sample selection attempts to select true-labelled examples and exclude those false-labelled ones from a noisy training dataset. Empirical studies have proved that the deep networks memorize simple, generalized patterns in the early phases of training.25,26
Initially, Malach et al. 27 proposed a meta-algorithm to maintain two networks and update them using the training data selected based on the disagreement of each network. Several pioneering researchers assumed that the clean (i.e., potentially true-labelled) samples tend to have small loss values.28–31 Han et al. 28 proposed a co-teaching method by simultaneously training two networks to filter possible noisy samples according to the training loss of each minibatch. Li et al. 31 proposed a framework, DivideMix, inspired by semi-supervised learning techniques for learning with noisy labels. DivideMix dynamically divides training samples into a clean set and a corrupted set, and then trains models on both sets with MixMatch 32 strategy. Later, some researchers found that small-loss trick does not always work. Lyu et al. 33 proposed adaptively selecting trusted samples for model training by a designed curriculum loss. Lu et al. 34 proposed a method called self-ensemble label correction (SELC) to correct noisy labels using model predictions of historical epochs. Inspired by SELC, Wang et al. 35 designed a novel stochastic neural ensemble learning (SNEL) with the idea of collecting parameters simultaneously both across model instances and along optimization trajectories.
Despite some approaches not explicitly narrating the targeting noise type, intuitively, sample selection can handle the training dataset with IDN based on confidence scores calculated from samples. Rather than using a fixed confidence value, 36 some recent studies employed an instance-specific threshold mechanism dynamically or adaptively. Li et al. 14 proposed dynamic instance-specific selection and correction (DISC) with a dynamic threshold strategy for all the instances, each of which is grouped into subsets marked clean, hard, and purified. By employing semi-supervised learning, Li et al. 37 devised a novel adaptive threshold for unlabelled instances, providing a bounded probabilistic guarantee for the correctness of the pseudo-labels it assigns. Inspired by curriculum learning, 33 Wu et al. 38 proposed a time-consistency curriculum to select samples with consistent predictions over training epochs. However, most of the methods focused on probability-based confidence scores transformed into adaptive thresholds for sample selection. These methods did not consider the intrinsic feature tendency and difference, as in PPLN. The proposed FTLTD-Net utilizes a latent difference metric to investigate a novel form of dynamic-threshold sample selection.
Method
As shown in Figure 3, the fundamental architecture of FTLTD-Net consists of three modules: (a) the main branch serving as a common classifier; (b) the PNLD (Potential Neighboring Label Disambiguation) module containing several auxiliary branches; and (c) the dual-threshold sample selection leveraging a difference-based measurement. Such an architecture considers both class-conditional and instance-dependent factors of PPLN. The auxiliary branches in PNLD are designed to match the corresponding classes to simulate the potential class-conditional factor of PPLN, and discern the causal pattern that results in adjacent-class ambiguity. This auxiliary module aims at analytically learning from the pattern and making an assertion about which of the two adjacent classes has a more definite possibility of being confused. The assertion is based on the newly-proposed
In this section, we first formulate the problem to be solved, especially the special label noise type PPLN, and demonstrate in detail the modules introduced into this novel architecture.
Problem formulation
Objective
Let the grey-scale image space be denoted as
Progressive pair label noise
Here we define this special label noise type with intrinsically continuous feature and sparsely discrete classes. For a given label
Potential neighbouring label disambiguation
This module is introduced into the architecture to explore the class-conditional factor of PPLN by learning and predicting the PFT. Its main structure is several auxiliary branches, the number of which is equal to that of the classes. Working in parallel to the main branch, the auxiliary branches diverge from the tail of a backbone network (e.g., a ResNet 39 ), while the head remains.
Module input
The backbone head
In general, PNLD consists of two phases, each of which starts with a feature split to adopt matching samples in a mini-batch
Feature split
The two-phase module performs different processes to learn and predict the PFT. That is, PNLM adopts adjacent-class samples while PFTD adopts the corresponding ones. Formally, for a given sample
The corresponding probability vectors of the features are expressed as follows:
Potential neighbouring label mining
PNLM utilizes the auxiliary branches to mitigate the latent ambiguity under PPLN, by inputting each sample to two branches corresponding to its GPNLs based on its given label.
Branch input
The underlying goal of PNLM is to train C high-accuracy binary classifiers (BCFs). Each
PFT mining
In this phase, each branch learns PFT from its adjacent-class samples, as Branch 1 mines the critical discrimination between classes 0 and 2. This is based on the fact that the features of pneumoconiosis CXRs develops along with the progression of the disease. Figure 4 depicts that Branch 1 observes a relatively clear tendency showing whether a sample it adopts may convey mild or severe feature at the standpoint of Stage 1 of pneumoconiosis. By standpoint, we mean that, from a certain perspective, each branch acts as if it were a dedicated and strict expert that discriminates the samples to be assigned the PFT in either direction. After several iterations, all the auxiliary branches may learn specialized class-conditional knowledge.

Demonstration of
Ideally, to acquire the specialized knowledge as precisely as possible, all C classifiers should be well-trained to distinguish the samples of two adjacent classes by regarding the output as a confidence score measuring the PFT. And the observed PFT should strictly point to the two adjacent classes. However, several special cases remain to be discussed and addressed under the setting of PPLN, as illustrated in Figure 5.

Ideal and actual cases that all possible auxiliary branches may meet.
Case 1: PPLN in given labels
The existence of PPLN implies that the given labels in the dataset are probably noisy. Two possible sub-cases are considered as illustrated in Figure 5. Branch
Case 2: Margin classes
Apparently, if the label

Demonstration of special
It turns out that all the auxiliary branches can learn PFT at their respective standpoints in spite of the two special cases with the goal of robustness against PPLN. For the model to learn consistent PFT in this phase, we propose Branch-Consistent Focal Loss (BCFL) as
In addition, BCFL plays an essential role in managing occasional sample imbalance in a mini-batch, which may cause a certain auxiliary branch to learn biased tendency. That is, as in Branch

Demonstration of occasional sample imbalance in certain mini-batches.
Progressive feature tendency determination
The main task of all auxiliary branches in PNLM is to learn the PFT in either direction at their respective standpoints. In this phase, building upon the knowledge in the previous phase, each branch continues to serve as a BCF to evaluate the latent PFT of the samples.
Branch I/O
The auxiliary branches in PFTD retain the identical structure and parameters to those in PNLM, although they may adopt samples differently. Unlike PNLM, each Branch j in PFTD takes the samples in the form of

Demonstration of progressive feature tendency determination in a certain auxiliary branch.
Based on PFT discussed above, either component represents the confidence score evaluating the tendency in the corresponding direction of a sample. In fact, the direction of tendency is equivalent to the possible direction of PPLN, depicting the probability distribution of the potential neighbouring true labels. The two components directly manifest the degree of PPLN as a direct metric.
Tendency regularization
For its flexibility, PFTD appends one additional operation called Sharpen as regularization following some previous work.31,41 Sharpen is formulated as
Sample selection based on dual dynamic thresholds
Sample selection picks potentially true-labelled samples to mitigate the memorization effect of DNNs. To investigate additional evidence for measuring the memorization strength of model, we propose the
Dynamic instance-specific latent tendency difference
Latent tendency difference
Formally, the LT-Diff for sample
Dual dynamic thresholds
The probability-based and difference-based thresholds are obtained by epoch-wise iteration as
Double layback-ratio trick
Here we introduce a small trick on the layback ratio based on a very common mathematical principle. That is, given two numbers with a fixed sum, their difference changes by 2 units when either increases or decreases by 1 unit. In the calculation of LT-Diff, such a property does naturally exist, which we consider incorporating in the threshold iteration. This can be simply adapted by doubling its variation factor
Sample selection
Additional difference-based criterion
In sample selection, all training samples are assessed and selected into two sets: clean and hard. For sample
Penalty terms for LT-Diff
Subsequently, the samples in
Overall loss function
The previous sections about our proposed method are based on the perspective that PPLN is caused by both class-conditional and instance-dependent factors. Accordingly, the overall objective is adapted as
The benefits of the above setup are twofold. (1) Warm-up is a procedure in which all samples in the dataset are trained with the standard Cross-Entropy loss for a few epochs, and it is prone to be influenced by class-conditional label noise. We believe that, working as virtual specialized experts, the auxiliary branches in FTLTD-Net tend to learn the specific class-conditional knowledge in PPLN, which is correlated with the underlying IIN factors. Moreover, unlike the confidence penalty
31
and the uncertainty assessment,
8
focusing on the knowledge in the auxiliary branches is a reasonable way of fitting the latent distribution pattern of PPLN. Hence, in warm-up procedure,
Experiment
Dataset
The CXRs of pneumoconiosis used in this study were collected from a hospital in Shanxi Province, China. In practice, training with label noise ensures that the test set must be true-labelled. Therefore, three experienced radiologists, who have been working on pneumoconiosis for more than 10 years, participated in this study to select the clean (true-labelled) CXR samples. To select true-labelled CXRs, the radiologists reviewed the diagnostic conclusions of all related patients through the corresponding radiological reports in detail. The annual physical examination records of every individual patient have been taken into consideration because the critical time points of stage determination are essential for label correctness. Based on the historical diagnostic information, the radiologists re-assess the corresponding CXR films to exclude imprecise staging conclusions. And the CXRs are selected into the test set as agreed upon by all three radiologists. In summary, a total of 2014 CXRs are involved in this study, as listed in Table 1.
Structure of the pneumoconiosis CXR dataset and its split into train and test sets.
Implementation details
Preprocessing
Originally, all the pneumoconiosis CXR data are 16-bit grey-scale images in the shape of irregular rectangles and contain text-based personal information about names, dates of examination and so forth, which is less relevant to radiological features of the disease and may distract the model’s attention. We first conduct anonymization to avoid the model to memorize non-radiological feature, by erasing printed text-based information. Then all images are centre-cropped to regular squares and resized to
Synthetic label noise
We regard that there exists the intrinsic PPLN, denoted as Original in the context, in the real-world pneumoconiosis CXR dataset, with its exact latent noise rate remaining unclear. Besides, to demonstrate the effectiveness of the proposed model, we also synthesize the enhanced PPLN considering both the class-conditional and instance-dependent factors by mixing IDN and asymmetric IIN with a given noise rate
Experiment setup
All experiments are conducted in the runtime environment of Python 3.8 and PyTorch 1.11.0 on a PC, which is equipped with Ubuntu 20.04 as the operating system, an Intel Xeon Silver 4210R CPU, and two NVIDIA RTX A5000 GPUs, each with 24 GB memory. Here, we list the hyperparameters in general. The whole training process is iterated for 200 epochs, following most methods for learning from noisy labels. We show that sufficient iteration can motivate the dynamical properties of the thresholds (see Section “Ablation study”). The Adam optimizer with a weight decay of 0.0001 is employed with an initial learning rate 0.001, which is divided by 10 at epoch 45 and 80. The batch size is set to 16. Other hyperparameters involved in Section “Method” are as follows. For the Focal Loss in auxiliary branches,
Comparative study
We compare our proposed method with several state-of-the-art (SOTA) methods mentioned in Section “Related work”. For fair results, all the comparative experiments in this study are equipped with ResNet18
39
as backbone. These SOTA methods are implemented using their released code and their key features are briefly described as follows.
Based on the setting of synthetic PPLN, in Table 2 are listed the results of all comparative methods under different PPLN noise rates. In general, our FTLTD-Net outperforms the SOTA methods in almost all metrics, though DISC has trivial efficiency in terms of F1 score and AUC for original PPLN. And all compared methods degrade to varying extents in the involved metrics as the PPLN rate increases. Other methods, not adapted to PPLN, are relatively more sensitive to the rise of PPLN rate. Specifically, these classical methods dealing with label noise show symbolic effectiveness of the basic training strategies. Co-teaching, Co-teaching+ and DivideMix all employ double networks and small-loss trick and outperform the Base. Encountering PPLN, disagreement update in Co-teaching and co-divide in DivideMix works better than Co-teaching even though their efficiency reduces when PPLN rate goes large. And co-divide makes more sense in dealing with PPLN than both Co-teaching and Co-teaching+. Self ensemble in SELC can handle low-rate PPLN, but the performance degrades as PPLN rate reaches 30%. One possible interpretation may be the accumulation of much more confusing probabilities as the training epoch progresses. DMUE has a similar architecture involving multiple auxiliary branches, but performs worse than our FTLTD-Net in all PPLN rates. We believe that the directionality of PFT helps eliminate the ambiguity between neighbouring classes, while DMUE adopts all other possible latent classes in each auxiliary branch, lacking the directional relevance. Unlike facial expression recognition, CXRs have more intensive inter-class similarity, such as similar positions of lung outline and similar range of grey-scale intensity, because all the images convey radiological features in the lung fields owing to the penetrating ability of X-ray. The binary property of bidirectional PFT can release the influence of inter-class similarity to some extent. And we also believe that the dual dynamic thresholds assist in the selection of trusted samples for a robust model against PPLN. As conducting pure dynamic probability-based sample selection, DISC mildly pales in comparison to our FTLTD-Net. It may make sense that the LT-Diff based dynamic thresholds can have a better efficiency in discriminating CXR samples with adjacent-class confusion in terms of PPLN.
Results (%) of comparative study between the state-of-the-art methods and our proposed FTLTD-Net.
The experiments are conducted under different label noise (PPLN) rates. Original means the intrinsic real-world PPLN in pneumoconiosis dataset while other percentages indicate the synthetic PPLN.
Compared with the above methods, our proposed FTLTD-Net has slower degradation as PPLN rate increases. We employ confusion matrices to demonstrate the robustness of FTLTD-Net across various stages of pneumoconiosis, as depicted in Figure 9. Stages 0 and 3 are less susceptible to lower rates of PPLN while the robustness for Stages 1 and 2 decreases relatively more profound as PPLN rate increases. The model may not identify a sample as a non-adjacent class, which avoids unreasonable classification. It turns out that the model maintains stable robustness across stages of pneumoconiosis despite the increasing noise rate.

Confusion matrices of classification results of our proposed FTLTD-Net under different PPLN rates.
Ablation study
To verify the effectiveness of each key component and critical parameters, we conduct a series of ablation studies, evaluating all possible combinations of key components and analyzing how the model performance changes with varying critical hyperparameters.
Component analysis
We conduct experiments for all possible combinations of essential components proposed in this paper, including BCFL, DLRT, and LT-Diff penalty terms (PTs). The first two are inappropriate to be removed directly from the model but can be replaced with their degraded forms. A degraded form of BCFL can be obtained by making auxiliary branch predictions subjective to given labels, weakening the consistency to the main branch. And the mode of iteration of LT-Diff thresholds can be degraded to be the same as probability-based thresholds, with
Statistics for component analysis.
Degraded BCFL is obtained by making auxiliary branch predictions subjective to given labels, and degraded DLRT is obtained with
The results indicate that all three modules contribute respectively to improving model performance, combined to obtain the best. Besides tackling mini-batch level data imbalance, BCFL helps minimize the deviation between auxiliary and main branches and ensures the features extracted from main and auxiliary branches to preserve consistency. Inspired by how the values of LT-Diff change, DLRT directly affects the strictness of sample selecting criterion about LT-Diff, which makes the difference-based thresholds iterate at a relatively faster rate. In contrast, their degraded forms lack such capabilities. It turns out that branch-wise consistency of parameters and a stricter sample selection criterion account for better statistics even if the effect is less pronounced when either or both of the two modules are integrated in the model. In addition, the decrease of LT-Diff thresholds should be attributed to the PTs, without which the probability-based thresholds are the only ones serving to partition the samples. For one thing, combining PTs with BCFL positively influences probability-based sample selection with stronger branch-wise consistency. For another, DLRT might enhance the effect that PTs conduct to the model with a stricter criterion for difference-based sample selection.
To understand why LT-Diff thresholds are significant to model performance, we depict the variation of LT-Diff threshold distributions across epochs under different PPLN rates using histograms, as shown in Figure 10. In the moving distribution of the histogram group lies typical phenomena that the distribution descends to low values under all PPLN rates and manifests two centres after several epochs, and that the model trained with higher PPLN tends to obtain denser LT-Diff distribution at lower thresholds. From a perspective of LT-Diff, whether a sample is clean depends locally on how its value changes between neighbouring epochs. If LT-Diff is temporarily less than the threshold, it will be regarded as clean at this epoch. Globally, the number of times for which the sample is regard as clean reflects the trend for its threshold to decrease. Hence, through all training epochs, the samples near the low-threshold centre are more likely to be clean while those near the higher-threshold centre are to be hard. The remaining uncertainty arises from the combined selection strategy and the probability-based thresholds.

Histograms of LT-Diff threshold distribution along with epochs. Three rows corresponds to PPLN rates 10%, 20% and 30%, respectively. The ratios indicate the proportion of samples within corresponding instance-specific threshold intervals.
Hyperparameter analysis
We select several critical hyperparameters to evaluate their impact on model performance. The candidate values are chosen with intervals based on prior research and empirical testing to avoid excessively increasing the computational burden. Since FTLTD-Net is a sample selection method, the chosen hyperparameters are regarding dynamic thresholds involved and loss function values. The ratios of clean and hard samples to the whole dataset reflect model’s determination on sample reliability, and the stability of their epoch-wise variation demonstrates the impact of hyperparameter on model performance and consistency.
Effect of temperature T
The result statistics for several candidate values of T are listed in Table 4. Figure 11 shows how the overall LT-Diff threshold level influences the ratios of clean and hard samples in training set. The Sharpen operation in equation 10 smooths the probabilities of PFT. A higher T reduces the gap between two PFT directions, resulting in lower LT-Diff values. When the thresholds are relatively fixed, the model tends to regard the samples to be clean. In contrast, a lower T results in the likelihood of samples identified as hard ones. An appropriate T helps maintain ratios of clean and hard samples stable in the latter epochs, enabling the model to consistently partition samples with high confidence.

Line charts show how
Accuracy, F1 score, and AUC with different sharpen temperatures.
Effect of layback ratios
The model performance after introducing DLRT is shown in Section “Component analysis”. And Table 5 lists how the evaluation metrics change as

Line charts show how
Accuracy, F1 score, and AUC with different layback ratios.
Adversarial effect of temperature T and layback ratios
The two groups of hyperparameters affect the determination of the model on sample reliability in different respects. Temperature T directly influences the scale of LT-Diff, which is the evaluation results about the class-conditional factor of PPLN. On the other hand, layback ratios
Effect of PT coefficients
Table 6 presents the statistics and Figure 13 depicts loss curves under different PT coefficients. The results indicate that the model is more optimal when PTs take relatively larger weights (

Loss curves with different
Accuracy, F1 score, and AUC with different PT coefficients.
Limitations and future work
Despite the mildly superior performance of our proposed FTLTD-Net dealing with PPLN in pneumoconiosis CXR dataset, there are some limitations in terms of generalizability. One main concern is data imbalance that Stage-0 and Stage-1 samples are dominant while the rest are relatively rare, covering both mini-batch and dataset levels. The FTLTD-Net is more susceptible to the impact of data imbalance at the mini-batch level on account of the PFT learning and predicting paradigm in the auxiliary branches, whereas it relatively ignores the biases from the dataset level, which can cause the model to underperform on underrepresented stages. The other is about different data distribution. It is undeniable that the dataset used in this paper may possibly be distributed quite differently compared with real-world ones. We must acknowledge the potential sampling bias, as the data predominantly comes from a specific hospital, which could limit the model’s applicability to other populations with different demographics or clinical settings. In addition, data imbalance may limit generalizability of the model to real-world scenarios where data distribution may vary. Therefore, techniques such as resampling and domain adaptation should be utilized in the future work to ensure data balance and sampling fairness when it comes to observing the model performance on different datasets. Besides, the special label noise formalized in this paper theoretically occurs in other scenarios, where the underlying feature changes continuously but the criteria of classification are not clear enough. A limitation of this study lies in as well the difficulty in identifying suitable scenarios to transfer our proposed method to other applications. Therefore, future work is supposed to focus on exploring other reasonable scenarios where PPLN could be applicable.
Conclusion
In this paper, we propose a novel multi-branch architecture named FTLTD-Net for pneumoconiosis diagnosis, which considers a special label noise type referred to as PPLN. The FTLTD-Net incorporates a two-phase auxiliary module to learn and predict the underlying PFT to mitigate the impact of PPLN. And it employs a special metric LT-Diff to obtain a difference-based threshold for better dynamic sample selection of trusted samples. Extensive experiments demonstrate that, as the PPLN rate increases, our proposed methods outperforms the SOTA methods for its lower sensitivity to PPLN, and that the proposed modules consistently improve performance across all evaluation metrics. However, we acknowledge several limitations in our approach that data imbalance and sampling bias from the dataset’s hospital-specific origin may affect the model’s generalizability. Future work will focus on applying resampling and domain adaptation techniques to address data imbalance, as well as exploring the transferability of our approach to other scenarios involving PPLN.
Footnotes
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work is supported by National Natural Science Foundation of China [grant numbers 62376183, U21A20469]; Funding Project of Local Science and Technology Development Guided by Central Government of China [grant number YDZJSX2022C004]; Open Project of National Health Commission Key Laboratory of Pneumoconiosis, China [grant number YKFKT004]; Special Fund for Science and Technology Innovation Teams of Shanxi Province, China [grant number 202304051001009]; and the Non-profit Central Research Institute Fund of Chinese Academy of Medical Science [No. 2020-PT320-005].
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
