Abstract
Background
Pneumoconiosis staging is challenging due to the low clarity of X-ray images and the small, diffuse nature of the lesions. Additionally, the scarcity of annotated data makes it difficult to develop accurate staging models. Although clinical text reports provide valuable contextual information, existing works primarily focus on designing multimodal image-text contrastive learning tasks, neglecting the high similarity of pneumoconiosis imaging representations. This results in inadequate extraction of fine-grained multimodal information and underutilization of domain knowledge, limiting their application in medical tasks.
Objective
The study aims to address the limitations of current multimodal methods by proposing a new approach that improves the precision of pneumoconiosis diagnosis and staging through enhanced fine-grained learning and better utilization of domain knowledge.
Methods
The proposed
Results
We collected and created the pneumoconiosis chest X-ray (PneumoCXR) dataset to evaluate our proposed MSK-PT method. The experimental results show that our method achieved a classification accuracy of 81.73%, outperforming the state-of-the-art algorithms by 2.53%.
Conclusions
MSK-PT showed diagnostic performance that matches or exceeds the average radiologist's level, even with limited labeled data, highlighting the method's effectiveness and robustness.
Introduction
Pneumoconiosis is currently one of the most serious and common irreversible occupational diseases in China. 1 It is caused by the long-term inhalation of dust, leading primarily to lung fibrosis. As the disease progresses, the lung tissue hardens and calcifies, ultimately resulting in extreme difficulty in breathing, restricted mobility, loss of labor capacity, and eventual death due to respiratory failure. 2 According to the “Diagnostic Criteria for Occupational Pneumoconiosis” (GBZ 70-2015), 3 the staging of pneumoconiosis follows the guidelines of the International Labour Organization (ILO), 4 where chest X-ray (CXR) images of patients are compared with standard reference images 3 to classify the disease into four stages: Stage-0 (normal), Stage-1, Stage-2, and Stage-3, based on the extent and severity of the lesions. This diagnostic process is complex and highly subjective, requiring radiologists to have extensive knowledge and experience, which makes the annotation of pneumoconiosis images costly and time-consuming. Although the use of automated pneumoconiosis assessment methods, such as computer-aided diagnosis (CAD), has significantly increased in recent years, the inherent low clarity of X-ray images and the high inter-class similarity of pneumoconiosis imaging features continue to pose challenges for accurate diagnosis. Therefore, developing an objective and reliable intelligent diagnostic technique is of great clinical significance for the accurate diagnosis of pneumoconiosis.
In recent years, the rapid development of deep learning technology has driven revolutionary progress in the field of computer vision and has been widely applied in medical image analysis.5–8 However, most existing methods9,10 primarily focus on using image data for the diagnosis of pneumoconiosis, overlooking the expert textual annotations that accompany the images. In medical image diagnosis, textual reports can provide rich semantic information and serve as important auxiliary signals, especially in accurately identifying complex disease categories. However, the fine-grained relationships between pneumoconiosis images and their corresponding textual reports remain unclear, as illustrated in Figure 1. This may be due to two main reasons: (1) the degree and progression of lesions in pneumoconiosis patients vary at different stages, resulting in minimal differences in lesion characteristics between similar images; 3 (2) most descriptions in the textual reports pertain only to specific sub-regions of the corresponding medical images. 11 This implies that much of the information may be irrelevant to our analysis, necessitating more granular recognition. Traditional contrastive learning methods12,13 struggle to accurately capture fine-grained representations between modalities, thus limiting the effective utilization of medical image and text data. Consequently, achieving joint training of images and texts remains a highly challenging task.

Illustration of the similarity between images and clinical text reports. Examples are from samples in the PneumoCXR dataset.
Considering the integration of image-text semantic dependency relationships, multimodal pre-training models provide a promising solution. These models not only focus on the features of similar data themselves but also explore dependencies between different instances, aiming for improved performance in classification tasks. In general image classification, Li et al. 14 proposed an unsupervised domain hint distillation framework designed to transfer knowledge from a larger teacher model to a lightweight target model by leveraging hints from unlabeled domain images. Bica et al. 15 introduced a method for pre-training fine-grained multimodal representations from image-text pairs, enhancing performance in image-level and region-level tasks. Similarly, in medical image diagnosis, Monajatipoor et al. 16 introduced BERTHop, utilizing a pre-trained language encoder (BlueBERT) to address domain gap issues and capture correlations between two modalities more effectively. Moon et al. 17 employed a BERT-based architecture combined with a novel multimodal attention masking scheme to maximize generalization performance in visual-language understanding tasks such as diagnostic classification, medical image report retrieval, medical visual question answering, and radiology report generation. Liu et al. 18 proposed a method for better contrastive learning in medical visual-language pre-training by categorizing medical image-text pairs into positive, negative, and neutral groups, thereby facilitating the construction of more suitable contrastive losses for continual improvement in cross-modal retrieval and image classification tasks. You et al. 19 investigated a modal-shared contrastive language-image pre-training framework to optimize parameter sharing in the transform model, promoting the transfer of common semantic structures between language and vision.
The aforementioned studies demonstrate that researchers have explored various robust tasks using multimodal models and proposed a series of methods to mitigate classification errors caused by relying solely on individual instances. However, in scenarios with low paired distinctiveness, existing multimodal methods primarily focus on capturing global information from each modality, neglecting the importance of cross-modal perception of local information. Consequently, these models exhibit limited capability in fine-grained classification of similar structures and lack the ability to effectively comprehend fine-grained details of input data, which restricts their performance in tasks that require a more nuanced understanding. To address these limitations, researchers have started exploring techniques that enable finer alignment and comprehension. Peng et al. 20 achieved fine-grained alignment and comprehension by generating contextually consistent precise location markers, associating text descriptions with specific visual elements in images. You et al. 21 adopted a hybrid region representation approach that combines discrete coordinates with continuous features, capturing spatial relationships in images more accurately and facilitating precise alignment between language and visual content. Additionally, Chen et al. 22 introduced an extra location-aware module to enhance the understanding of local information. By integrating information at the local level, these models demonstrated superior performance in region- or object-level tasks that require precise multimodal understanding.
While the aforementioned approaches provide insights into fine-grained understanding within general image modalities, they are constrained by benchmark tasks within these modalities. In the field of pneumoconiosis imaging, lesion areas exhibit diffuse interstitial fibrosis features, posing challenges for existing models to achieve superior performance. Therefore, there is a need for further exploration in fine-grained understanding of pneumoconiosis images. Addressing the challenge posed by low discriminative and high similarity characteristics, there is a crucial necessity to construct effective multimodal, multi-granular information understanding models for images and text. This entails achieving deeper analysis and quantification. Moreover, modeling the correlation between visual representations and domain knowledge is essential to enhance the capacity for capturing underlying discriminative features, which presents challenges in model design. Thus, resolving these technical issues is the focal point and challenge of this study.
In this study, we propose a novel
In brief, this work contributes to the field by:
We proposed a multi-modal similarity-aware learning and knowledge-driven pre-training method called MSK-PT, designed for robust pneumoconiosis staging diagnosis. We employed a multi-level text-image alignment strategy to understand both coarse and fine-grained multi-modal representations, while developing a similarity-aware modality alignment module to capture the consistency and differences between similar imaging features, thereby exploring local representations more effectively. We incorporated data-associated features and domain knowledge as priors and constraints in the model to enhance the training and inference of downstream visual tasks. Additionally, by introducing an uncertainty threshold, we effectively mitigated the impact of erroneous labels on model performance and provided reliable visual cues for clinical decision-making through Gradient-weighted Class Activation Mapping (Grad-CAM). We collected and constructed a pneumoconiosis chest X-ray (PneumoCXR) dataset for method development and extensive evaluation. Experimental results demonstrate that even with limited labeled data, the MSK-PT method outperforms existing learning methods.
The remainder of this paper is organized as follows: Section 2 summarizes and discusses the existing research on related technologies. Section 3 provides detailed information about the MSK-PT model. Section 4 describes the experimental setup. Section 5 presents the quantitative and qualitative evaluation of MSK-PT's performance compared to existing methods on internal test datasets and conducts extensive ablation studies. Section 6 discusses our limitations and future research work. Section 7 concludes the paper.
Related work
Image-text contrastive learning
Contrastive learning (CL), a self-supervised learning technique, has achieved significant advancements in the field of deep learning.23–25 It focuses on the similarities and differences between data pairs, enhancing network representation learning by maximizing the mutual information between the input and its representation. In recent years, extensive research26–28 has provided a diverse foundation for the development of contrastive learning, covering various models, loss functions, and pretext tasks. In computer vision, He et al. 23 treated contrastive learning as a dictionary look-up, using a dynamic dictionary with a queue and a momentum encoder. Chen et al. 24 employed random data augmentation, viewing multiple augmented versions of the same image as positive samples and other images as negative samples to optimize the contrastive loss. Grill et al. 29 introduced a slowly moving average of the target network output from the online network, effectively eliminating the need for negative samples. Chen et al. 30 dispensed with negative samples and the momentum encoder, relying on the same encoder and a stop-gradient mechanism to prevent output collapse. Caron et al. 31 used vision transformers as the foundation for self-supervised learning, implementing self-distillation without any labels.
Additionally, the principles of contrastive learning are applied to multimodal representation learning, defining robust loss functions by contrasting positive and negative multimodal sample pairs.32–34 Liu et al. 35 used contrastive loss for multimodal data representation learning, capturing complementary and synergistic interactions between modalities. Huang et al. 36 introduced a similarity aggregation strategy that utilizes signals from global and local representations for retrieval, jointly learning multimodal global and local representations of medical images by contrasting attention-weighted image regions with words in paired reports. Han et al. 37 preserved task-relevant information by maximizing mutual information between unimodal inputs. Song et al. 38 combined a distance metric function between label categories with curriculum learning, using a supervised prototype contrastive learning (SPCL) loss to address classification imbalance. Yang et al. 39 proposed a supervised cluster-level contrastive learning (SCCL) method, leveraging variance adaptive density (VAD) to overcome the inefficiency of high-dimensional supervised contrastive learning (SCL). Tu et al. 40 introduced a contrastive learning scenario between context and knowledge (CKCL), more efficiently utilizing contextual information and external knowledge. These studies use contrastive loss to interpret relationships between modalities, based on the principle of pulling an anchor and a positive sample together in the multimodal embedding space while pushing them away from many negative samples. However, the performance of downstream tasks is often limited by the constraints of self-supervised learning tasks due to the absence of data labels. To leverage label information, Khosla et al. 41 proposed a supervised contrastive loss, extending self-supervised contrastive learning methods to fully supervised learning. This approach achieves success in visual representation learning 42 and few-shot learning. 43
Therefore, researchers have utilized contrastive learning algorithms to explore a variety of tasks, effectively leveraging global and local multimodal information during training. However, in scenarios with low pairwise discriminability, existing contrastive learning methods exhibit limited capabilities in fine-grained classification of similar structures. This limitation is particularly evident in pneumoconiosis imaging, where diffuse interstitial fibrosis in lesion areas presents a significant challenge. In this study, we specifically designed a local similarity-aware modal alignment module to enhance the model's ability to comprehend fine-grained details in both images and text. By adopting the principles of contrastive learning, we establish effective connections between images and text, identify discriminative features of similar categories, and thereby enhance the accuracy and reliability of classification.
Knowledge-based medical pre-training
In recent years, advancements in artificial intelligence have yielded promising results across various applications. Through representation learning on image-text datasets,44,45 a series of visual-language pretraining models14,15 have been developed, with significant focus in the field of medical visual-language pretraining (Med-VLP).16–19 Recent research has emphasized enhancing representation learning by integrating domain-specific medical knowledge. Zhang et al. 46 introduced knowledge-augmented diagnosis (KAD), leveraging existing medical domain knowledge to guide the use of paired CXR images and radiology reports. Chen et al. 47 proposed a knowledge-based learning framework for identifying unlabeled biomedical microscopy images through self-supervised pretraining, emphasizing optimizations between encoders and decoders critical for initializing high-quality segmentation decoders. Chen et al. 48 introduced the knowledge-enhanced contrastive visual-language pretraining (KoBo) framework, integrating clinical knowledge to enhance semantic consistency learning in visual-language tasks, addressing challenges of semantic overlap and transfer. Chen et al. 49 proposed a systematic approach to strengthen medical visual-language pretraining by integrating structured medical knowledge, aligned representations, knowledge-injected fusion models, and designing knowledge-induced pseudo tasks. Pan et al. 50 fused knowledge graph-based contrastive pretraining methods to enhance alignment and reasoning capabilities of visual and language representations. These studies underscore the potential of integrating domain-specific knowledge into medical contrastive visual-language pretraining to improve performance on downstream tasks, addressing challenges such as limited medical data and the need for more effective representation learning.
The learning of these models revolves around encoding images and text into crucial representations necessary for downstream tasks, enabling them to perform various zero-shot prediction tasks not specifically trained on by the models. These studies collectively highlight the potential of visual-language pretraining in enhancing performance across various tasks. They have demonstrated outstanding performance in diverse downstream tasks such as medical object detection, 51 image classification, 52 and semantic segmentation. 53 However, they have yet to effectively explore different granularities of visual representations and often rely on partial semantic information. To address these limitations, Li et al. 54 employed a knowledge-guided contrastive framework to capture multi-granular semantic information. It accurately aligns the pathology of each image with corresponding medical terminologies, thereby enhancing the model's performance in downstream tasks.
In addition to the aforementioned efforts, maintaining consistency between the knowledge features of training classes and the image features is crucial. During training, these knowledge features should progressively align with the image features. However, if the representations of a few images from specific classes diverge from those of the entire dataset, it may lead to overfitting on the semantic labels of specific samples. For example, in the context of pneumoconiosis image embeddings, there is a risk of overfitting to comorbidity information with similar clinical manifestations, even if such information is irrelevant to pneumoconiosis staging. To mitigate the overfitting of knowledge features for each class, our work focuses on translating data-associated features and domain knowledge into priors and constraints for the model. We conduct a deep analysis of similar pneumoconiosis images, employing a similarity-aware modal alignment strategy to capture both consistency and diversity among similar images. This approach enhances the model's ability to understand fine-grained details across different modalities, establishes semantic correlations between images and text, promotes feature diversity, and effectively trains the model to discern between similar samples. These efforts aim to serve downstream tasks in fully visual modalities effectively.
Method
Overall architecture
Due to the varying severity of pneumoconiosis lesions across different lung lobes and the diverse and complex nature of lesion characteristics, radiology reports can convey more comprehensive semantic information compared to discrete label values. This capability enhances diagnostic performance in downstream tasks, particularly where labeled data is limited. Therefore, this paper proposes a two-stage multimodal pre-training framework, MSK-PT as shown in Figure 2. In the first stage, we employ a multi-layer alignment strategy to construct an integrated model. Using a similarity-aware modal alignment module, we learn joint representations of domain knowledge and image features. This helps establish semantic correlations between textual and visual features, thereby enhancing the model's predictive capabilities. In the second stage, we utilize data-associated features and pre-trained knowledge to guide visual representation learning. By introducing an uncertainty threshold strategy to improve the model's generalization ability and robustness. Next, we will first describe the problem scenario considered, followed by detailed introductions to the training specifics of each stage.
Problem scenario
Recent literature
54
indicates that the performance of multimodal learning methods largely depends on the availability of large annotated datasets. However, obtaining accurately labeled medical images is both expensive and time-consuming in clinical practice. Therefore, an important research direction is how to effectively learn labeled text-image representations to achieve competitive performance. In this study, we first define a training set
Global representation alignment (GRA)
To learn global text-image information, MSK-PT employs a global contrastive loss
24
operating at the level of global text
Similarity-aware modality alignment (SMA)
In pneumoconiosis images, variations in individual necessitate consideration of both intra-class differences within similar lesions and inter-class differences between dissimilar lesions. These variations are disease-specific but constitute a small proportion of the overall image. Furthermore, clinical text reports only accurately describe a few lesions related to the disease. Given these challenges in existing pneumoconiosis data, unlike traditional similarity alignment in contrastive learning, we introduce a Similarity-aware Modality Alignment (SMA) module. This module focuses on category-level rather than individual-level contrastive learning to accurately differentiate lesion types and identify subtle features of pneumoconiosis lesions. The module comprises two parts: Token-wise image-text alignment and Prototype-wise disease alignment, as illustrated in Figure 3.

Similarity-aware modality alignment module. In this module, (a) denotes the cross-correlation matrix between text-image in TA; (b) and (c) stand for the representation space within text and image in TA; (d) denotes the prototype(disease)-level representation on text-image positive pairs in PA; (e) demonstrates the final aggregation similarity representation space on text and image produces by a dual alignment learning.
Specifically, for each text-image embedding pair
Training process
Our proposed MSK-PT method consists of two main stages: multimodal pre-training and downstream task learning, as shown in Figure 2. In this section, we will explain the training process for each stage in detail.

MSK-PT approach overview. The approach is divided into two stages: (a) Pre-training stage, which utilizes weakly augmented labeled domain data to train the base task model through multi-layer alignment strategies. (b) Downstream task learning stage, where data-related features and pre-trained knowledge features are propagated to fully unlabeled visual samples, and diagnostics are performed by predicting their reliability. When weakly augmented unlabeled pneumoconiosis images are inputted, the model may assign a prediction score higher than the threshold θ, indicating unreliable predictions, prompting radiologists to reevaluate them. Conversely, for strongly augmented unlabeled pneumoconiosis images, if the predicted result falls below the threshold θ, it suggests a reliable diagnosis.
Furthermore, during the pre-training stage, we update the model parameters by minimizing the cross-entropy loss between the predicted probability distribution
To quantify uncertainty, we employ a method based on the maximum predicted probability to calculate uncertainty scores
Where
By combining the uncertainty score
Overall training objective
The overall objective of MSK-PT encompasses three main loss functions: weighted cross-entropy loss to minimize the difference between predicted probabilities and true labels; global text-image contrastive loss to enhance learning of global features; and fine-grained similarity-aware modal alignment loss to accurately capture local similarities between text and images. Specifically, these are detailed as follows:
Experiment setup
In this section, we introduce the pneumoconiosis dataset, provide detailed implementation specifics of the model training, discuss the evaluation metrics used, and outline the baseline methods for comparison.
Datasets
We collected the PneumoCXR dataset, comprising data from independent patients between 2018 and 2021, at an occupational disease prevention and control institute in Shanxi Province, China. In addition, clinical data and diagnostic reports of each patient were collected, and a total of 2014 patients’ image-text pairs were obtained. To minimize assessment errors, predictions were independently diagnosed and reported by at least four radiologists with over 10 years of clinical experience each. Final results were determined based on consensus among the radiologists, categorizing each image into one of four labels: ‘stage-0’, ‘stage-1’, ‘stage-2’, and ‘stage-3’, as detailed in Table 1. All clinical data collection was conducted with informed consent from the patients and approval from the hospital's ethics committee.
PneumoCXR dataset split detailed information.
Training details
The training process for the MSK-PT approach involves two stages: multimodal pre-training and downstream task fine-tuning.
Prior to inputting into the network, all data dimensions are uniformly resized to 224 × 224 pixels for computational convenience. We utilize BERT
59
as the text encoder module and ResNet-50
60
as the image encoder module. The training process iterates for 200 epochs, using the Adam optimizer, with parameters set according to recommendations from:
63
exponential decay rates
Evaluation index
To comprehensively evaluate our model's performance in pneumoconiosis diagnosis, we utilized multiple evaluation metrics including Accuracy, Sensitivity, Specificity, F1 Score (F1), and Area Under the Curve (AUC).
Additionally, AUC refers to the area under the Receiver Operating Characteristic (ROC) curve. The ROC curve plots the False Positive Rate (FPR) on the x-axis against the True Positive Rate (TPR) on the y-axis. A higher AUC value indicates better classification performance of the model. The formulas for calculating FPR and TPR are as follows:
The value of AUC is always less than 1. Since the ROC curve typically lies above the diagonal line (
To facilitate comparison and analysis of different models, we computed the average values of these metrics as a comprehensive evaluation of overall performance. This approach not only helps identify models that excel in specific metrics but also ensures that the chosen model performs well across multiple performance dimensions, thereby enhancing the reliability and accuracy of dust lung disease diagnosis.
Baseline methods
To validate the effectiveness of the proposed MSK-PT model, we compared its performance from multiple perspectives against current state-of-the-art baseline models. Considering significant domain differences between natural image-text pairs and medical image-text pairs, directly transferring pre-trained natural image-text models to medical tasks may result in performance degradation. Therefore, we chose state-of-the-art pre-trained models specifically designed for medical domains as comparison baselines.To ensure fairness in experiments and minimize the impact of network architectures on final performance, all experiments utilized the same BERT
59
and ResNet-50
60
as foundational text and image encoder modules, consistent with MSK-PT. Below are the selected comparison baselines and their highlights:
Evaluation and analysis
In this section, we compare MSK-PT with other state-of-the-art models and conduct visual evaluations. We performed ablation experiments to validate the effectiveness of each module and component of MSK-PT. All baseline models underwent diagnostic performance evaluation on the PneumoCXR test dataset to verify the effectiveness of MSK-PT.
Comparative results with other methods
Table 2 presents the experimental results comparing MSK-PT with several representative visual-linguistic models, demonstrating the effectiveness of their training strategies and outperforming baseline models. Unlike typical medical image classification tasks, pneumoconiosis CXR images exhibit strong inter-class similarities, such as similar lung contour positions and grayscale intensity ranges among lesions, due to the fact that all images reflect the radiological characteristics of the lungs after X-ray penetration. Our similarity perception module mitigates these inter-class similarities to some extent. On the PneumoCXR test dataset, MSK-PT demonstrates outstanding performance across most metrics and averages, surpassing all reference baseline models and achieving state-of-the-art performance. This result indicates that MSK-PT significantly enhances the reliability of detection in unlabeled cases by leveraging data-associated features and domain knowledge acquired during pre-training, even in the absence of annotated medical information or clinical interpretation. Additionally, we believe that uncertainty thresholds contribute to enhancing the robustness of MSK-PT, ensuring reliable predictive outcomes.
Comparative results with the SOTA baselines method on testing sets. The best predictive performance is highlighted with boldface numbers.
ROC curve and confusion matrix analysis
Figure 4 illustrates the ROC curves and confusion matrix of the MSK-PT method in the pneumoconiosis image classification task. Analyzing the ROC curves on the left side of Figure 4, it is evident that the model achieves excellent AUC values across different lesion categories. Specifically, the ROC curves of MSK-PT approach the upper-left corner for each category, indicating high sensitivity and specificity at various thresholds. The confusion matrix on the right side of Figure 4 displays the model's classification results across different categories. Our method demonstrates a substantial number of correct predictions in each category. However, between “Normal” and “Stage 1”, the model incorrectly confuses some pneumoconiosis images. This could be attributed to the early-stage lesions of pneumoconiosis being less pronounced on X-ray chest radiographs, sharing similar radiographic features, and being susceptible to interference from other respiratory diseases. Quantitative analysis further validates the exceptional performance of the MSK-PT model in identifying similar lesion features in pneumoconiosis images.

ROC curve and confusion matrix of MSK-PT for all four pneumoconiosis types.
Reliability analysis
Figure 5 illustrates four pneumoconiosis image samples detected using our MSK-PT model. Previous studies typically relied on selecting the class with the highest probability as the final diagnostic basis. In contrast, the MSK-PT model not only provides the final prediction but also includes uncertainty scores to explicitly indicate the reliability of the diagnosis. Lower uncertainty scores indicate higher confidence in the model's final decision (as shown in the left column of Figure 5). In cases where the final diagnosis is incorrect (as depicted in the right column of Figure 5), assigning high uncertainty scores helps identify the unreliability of these predictions. High uncertainty scores prompt radiologists to reassess the images to avoid misdiagnosis or missed diagnosis. By introducing uncertainty thresholds, the MSK-PT model enhances both diagnostic accuracy and the assessment of diagnostic result reliability.

Four samples of pneumoconiosis images were detected using MSK-PT. The left column is samples that the model predicts correctly, and on the contrary, the right column is samples that the model predicts incorrectly. In addition, our MSK-PT not only provides probability scores, but also provides corresponding uncertainty scores to reflect the reliability of the predictions. If the uncertainty score is less than the threshold θ, it means that the model's prediction is reliable. On the contrary, if the uncertainty score exceeds the threshold θ, it means that the result is unreliable and requires re-evaluation by a radiologist.
Visualization analysis
Grad-CAM images are commonly used to localize discriminative regions for object detection and classification tasks. Here, we use Grad-CAM visual heatmaps to support the outstanding performance of MSK-PT in extracting fine-grained features, highlighting improvements over baseline methods. Figure 6 presents a visualization where MSK-PT focuses more on regions of maximum interest related to data-associated features and pre-trained knowledge, significantly enhancing the model's robustness against irrelevant background information and better showcasing features associated with lesions. In contrast, baseline methods struggle to accurately predict subtle differential features of specific diseases, such as basilar reticular opacities in the posterior bases, centrilobular nodules, and branching shadows with blurred boundaries and low contrast against normal tissues, which are low-resolution lesions diffusely throughout the lungs. Therefore, our approach effectively distinguishes disease-specific low-resolution features from other general attributes in CXR images, thereby accurately differentiating characteristic micro-differences among various stages of pneumoconiosis, which is crucial to ensure the interpretability and credibility of the model.

Visualization of heatmaps of four representative pneumoconiosis samples. On the top, we mark the pathology regions annotated by certified radiologists (the circular markers); on the bottom, we visualize the visual features from MSK-PT (brighter colors means higher feature values). We can see that the baseline method has difficulty in accurately obtaining the location of the disease region, and MSK-PT can successfully highlight the abnormal regions identified by radiologists.
Failure analysis
Additionally, we conducted a visual analysis of the four worst-performing cases in the PneumoCXR test set to further assess the limitations of the model. As shown in Figure 7, since the PneumoCXR dataset provides the exact locations of pneumoconiosis lesions, we compared the lesion areas annotated by radiologists with the high-response areas generated by Grad-CAM. The results indicate that although our method can generate large attention regions covering multiple feature locations in the samples, when dealing with extremely difficult cases, these attention regions still fail to accurately cover the actual lesion areas, ultimately leading to incorrect predictions.

Visualization of representative cases where our method fails to focus on disease regions. The circular markers indicate the ground truth locations.
Ablation studies
We conducted a series of ablation experiments to examine the effects of different structures on the performance of the proposed MSK-PT model.
Effectiveness of Different Components: To analyze the importance of each component within MSK-PT, we conducted comprehensive ablation experiments, systematically altering one variable at a time while following the same training setup as MSK-PT. The specific experiments include:
“ “ “ “ “ Additionally, we ablated components within the SMA module:
“ “ The quantitative results in Table 3 demonstrate that the design of MSK-PT plays a crucial role in its final performance, particularly highlighting the critical role of the SMA module in handling fine-grained similarity and multimodal alignment. Through these experiments, we further validated the importance of each component to the overall performance of the model, proving the rationality and effectiveness of the design. Effectiveness of SMA Module: During the pre-training phase, the SMA module is designed to enhance the model's ability to understand fine-grained representations between images and text, enabling it to more accurately identify similarities and differences in features. To further validate the contribution of the SMA module within the MSK-PT method, we applied t-SNE visualization
65
to analyze the features learned by the model on the PneumoCXR dataset. This demonstrates how similar features affect the deep learning model's recognition of pneumoconiosis and how the SMA module improves this recognition capability. As shown in Figure 8, with the SMA module (Figure 8(a)), the features learned during training exhibit more distinct clustering patterns, with significantly reduced overlap between samples. This indicates that the SMA module effectively guides the model in learning more discriminative features, greatly improving its ability to distinguish between different categories, thereby benefiting downstream tasks. In contrast, without the SMA module (Figure 8(b)), samples from different categories show substantial overlap, and due to the low variability in lesion features across images, overfitting occurs, leading to blurred class boundaries. The experimental results demonstrate that the SMA module plays a crucial role in feature learning by precisely capturing and distinguishing between features of different categories, helping the model create clearer inter-class boundaries in the feature space. This significantly enhances the model's ability to extract discriminative features, ultimately improving its accuracy and robustness in diagnostic tasks. Effectiveness of Knowledge Distillation Strategy: Furthermore, we conducted a detailed comparison of the performance of different distillation methods, with all experiments following the same training settings as MSK-PT. The results indicate that the baseline method showed significant overfitting after just a few epochs of training. By analyzing the accuracy and loss convergence curves in Figure 9, it is evident that compared to the “w/o Shared Data Feature” and “w/o Shared Text Feature” methods, the innovative design of MSK-PT, which includes data-associated features from the pre-training phase and domain knowledge-based distillation strategies, played a crucial role in enhancing model performance. This approach not only effectively mitigated the overfitting issue but also significantly improved the model's generalization ability. Effect of Different Image Backbone:In Table 4, we used BERT
59
as the base text encoder module and performed pre-training with various visual encoders and image resolutions, including ViT-B/16
66
and ResNet-50.
60
We implemented this using the officially released code. The results indicate that MSK-PT improves downstream task classification performance across different image backbone networks. Although ResNet-50 performed best at a resolution of 1024 × 1024, high computational resources and time costs are significant considerations in practical applications. Therefore, we ultimately chose ResNet-50 at a resolution of 224 × 224 as the base visual encoder, as it balances performance while significantly reducing computational overhead, thereby enhancing training and inference speed. Effect of Training Sample Numbers: To explore the classification performance of MSK-PT with varying amounts of training samples, we fine-tuned the model based on different proportions of training samples to analyze its robustness to the number of training samples. Figure 10 shows the accuracy, specificity, F1score, and sensitivity under different amounts of training samples. The horizontal axis represents the percentage of training samples, using 25%, 50%, 75%, and 100% proportions, respectively. The vertical axis corresponds to the values of different metrics. The results indicate that as the percentage of training samples increases, the classification performance of the MSK-PT method significantly improves. We can observe that although the reduction in training samples affects the performance of the classification task, the impact is relatively minor. This is primarily due to MSK-PT's detailed extraction and effective alignment of the underlying multimodal features of pneumoconiosis, as well as the establishment of robust text-image associations, enabling the model to maintain high classification performance even with fewer training samples. This performance not only demonstrates the robustness of the MSK-PT model in handling different amounts of training samples but also highlights its reduced dependency on training data in practical applications. Effect of Different Loss Functions: To emphasize the significance of global and local loss functions within the model, we conducted ablation experiments using various loss combinations. In MSK-PT's loss function, global contrastive learning serves as the performance baseline, supplemented by token-wise and prototype (disease)-wise parts. We focused on the token-wise alignment (TA) and prototype-wise alignment (PA) within the SMA module. As shown in Table 5, both TA and PA contribute to enhanced classification performance, indicating that alignment at both the token and prototype levels aids the text and image encoders in learning representations more suitable for downstream tasks. Notably, when TA and PA are combined, further performance improvements are observed on the PneumoCXR dataset, suggesting the complementary nature of TA and PA. Interestingly, omitting PA loss results in a more significant performance drop compared to omitting TA loss, highlighting the role of prototype-wise contrastive learning in reducing false negatives and better adapting the model to the similarities in pneumoconiosis images. This finding validates the critical role of this loss function in model training. By combining both global and local losses, MSK-PT achieves optimal performance.

T-SNE visualizations of features with and without SMA, where different colors represent their true labels.

The accuracy and loss convergence curves for different knowledge distillation strategies on the test set.

Avg index value for different percentages of training samples. It shows that MSK-PT remains effective under different dataset sizes.
Ablation study of MSK-PT by removing individual modules.
Ablation study of MSK-PT with different image encoders and different image resolutions.
Ablation study of MSK-PT on different loss task setting.
Limitations and future works
While MSK-PT has shown remarkable performance in multi-modal medical image classification tasks, acquiring and developing high-quality multi-modal medical data presents greater challenges compared to single-modal data. This not only increases the complexity and cost of data acquisition but also restricts the scale and diversity of available datasets. In addition, text-image pretraining models inevitably encounter challenges when predicting information not explicitly mentioned in the text or when handling data representations absent from the training set, thereby limiting the model's ability to generalize effectively across specific domains. In future research, we aim to explore more effective domain adaptation techniques or leverage regularization methods to enhance the model's adaptability to novel data scenarios.
Furthermore, Grad-CAM has been instrumental in visualizing and interpreting the pneumoconiosis staging model in this study. By highlighting the critical regions where the model focuses its attention, this approach not only enhances the model's robustness against irrelevant background information but also strengthens its interpretability and clinical applicability. Moving forward, we aim to refine the Grad-CAM method by integrating additional domain knowledge and multi-modal data to further enhance the model's diagnostic accuracy and interpretability in complex scenarios.
Conclusion
This paper proposes a novel
In summary, in the field of medical research where it is difficult to obtain a large amount of labeled data, our method not only enhances the diagnostic capabilities of multimodal text-image models facing limited labeled image datasets but also facilitates learning and understanding of correlations between text and images to better interpret medical images, thereby improving diagnostic accuracy and reliability. Evaluation on the PneumoCXR dataset demonstrates the effectiveness of the MSK-PT model, outperforming several state-of-the-art multimodal techniques across multiple evaluation metrics. Furthermore, through ablation studies, we provide empirical evidence of the unique contributions of each component within the method.
Footnotes
Author contributions
Xueting Ren: Conceptualization, Methodology, Data collection, Experimental deployment, Writing - original draft. Guohua Ji: Data collection & Experimental deployment. Surong Chu: Methodology & Data collection. Shinichi YOSHIDA: Supervision, review & editing. Juanjuan Zhao: Supervision, Project administration, review & editing. Baoping Jia: Data collection & annotation & curation, Clinical support. Yan Qiang: Methodology, Project administration, review & editing.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research is supported by National Natural Science Foundation of China [grant numbers 62376183, U21A20469]; National Health Commission Key Laboratory of Pneumoconiosis open project [grant numbers YKFKT004]; NHC Key Laboratory of Pneumoconiosis Shanxi China [grant numbers 2020-PT320-005]; Shanxi Provincial Science and Technology Innovation Talent Team Special Plan [grant numbers 202304051001009]; The Central Government Guides Local Science and Technology Development Funding Projects [grant numbers YDZJSX2022C004]; China Scholarship Council [grant numbers 202306930020].
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data availability
Pneumoconiosis datasets supporting the findings of this study were not publicly available due to the confidentiality policy of the Chinese National Health Council and institutional patient privacy regulations. However, they were available from the corresponding authors upon request.
