Harnessing the wisdom of a radiologist: Texture-aware curriculum self-supervised learning for thorax disease classification

Abstract

With the rapid advancement of deep learning technologies, self-supervised learning utilizing large-scale unlabeled datasets has emerged as a dominant learning paradigm across multiple fields. This paradigm aligns well with the nature of medical imaging data, which has led to significant research efforts in applying self-supervised learning methods to this domain. However, many of these approaches fail to fully consider the unique characteristics of medical imaging, particularly the critical role that texture information plays in the diagnosis of thorax diseases. To address this gap, we propose a novel texture-aware self-supervised learning framework that leverages the Gray-Level Co-occurrence Matrix (GLCM) as an auxiliary signal to strengthen the model’s capacity to extract disease-relevant texture features. Additionally, we integrate curriculum learning into our approach, which gradually emphasizes texture information throughout the training process. This method enables the model to more effectively capture the inherent characteristics of medical imaging data. Our qualitative and quantitative experimental results show that our approach surpasses the current state-of-the-art methods on both the NIH CXR and Stanford CheXpert datasets.

Keywords

Self-supervised learning medical image analysis thorax disease diagnosis

1. Introduction

With the rapid progression of deep learning techniques, state-of-the-art methods have achieved notable success across various domains, including natural language processing (NLP) and computer vision (CV). Most deep learning approaches are heavily data-dependent, requiring large amounts of accurately labeled data to train models that are both precise and reliable. In this context, self-supervised learning (SSL) methods have emerged as a leading paradigm in deep learning, as they leverage intrinsic signals within the data itself to guide the learning process [21,25].

Recently, SSL approaches can be broadly classified into two main categories: contrastive methods and reconstruction-based methods. Contrastive learning focuses on bringing similar data points closer in the feature space while pushing dissimilar points apart, resulting in a discriminative semantic feature representation. Notable examples of contrastive methods include Contrastive Predictive Coding (CPC) [22], SimCLR [4], and MoCo (Momentum Contrast) [13] and so on. On the other hand, reconstruction-based methods train models by recovering the original input from its corrupted or transformed versions, thereby learning semantic features that are closely aligned with the original data. Recently, BEiT [2] and Masked Autoencoder(MAE) [12] have become popular self-supervised learning methods in the field of CV. Regardless of the specific approach to self-supervised learning, their effectiveness has been demonstrated on a large number of validation results for various downstream tasks.

The SSL method has been widely employed in the medical image domain, leading to several well-performed models. [29,37] apply contrastive approach to enable encoder learns discriminative representations. [35] employ the MAE strategy to pre-train the CNN-based and ViT-based encoder. [10] combine these two strategies to further enhance its representation ability. However, many of these methods are adaptations of original approaches intended for natural images, often overlooking the unique characteristics of medical imaging. This oversight can lead to several challenges: (1) The training data may be randomly sampled, lacking a structured or progressive learning process; (2) Medical images often exhibit high inter-class similarity, making it difficult for methods designed for natural images to significantly improve model performance. Therefore, there remains substantial potential for developing models that better integrate the specific characteristics of medical images, thereby enhancing their interpretative capacity in the field of medical imaging.

After a lot of observation, we found that in the training of radiologists, the learning process follows a progressive sequence from simple to difficult cases. Moreover, although medical images exhibit high inter-class similarity, radiologists are skilled at accurately capturing low-level features such as texture and shape that distinguish between images of different patients. [38] attempted to utilize the GAN to decouple texture information from the overall representation of samples, resulting in an encoder that focuses on texture information. ssl method, at the same time, this method requires the use of another encoder and a discriminator to assist the texture encoder in the decoupling process. This leads to the presence of a large number of redundant parameters in the training process. [23] try to mimic the progessive training scheme to enhance model’s detecting abnomarty ability. ssl method, their method is only limited to abnormality detection in radiographs and does not explore its extensive applications.

Motivated by these observations, we propose a curriculum texture-aware self supervised learning framework for thorax disease classification. [9] Our approach allows the model to better capture the intrinsic characteristics of medical imaging data, as shown in the Figure 1. First of all, we organize the entire dataset into several curriculum based on the complexity of texture information, with increasing difficulty levels. Then, similar to [35], our method also utilizes the reconstruction of masked radiographs for pre-training. We utilize the grayscale co-occurrence matrix (GLCM) [11] of the original input as an auxiliary supervisory signal to constrain the texture information in the reconstructed images. In other words, our approach aims to not only reconstruct the original input but also capture similar texture features as the original radiograph. In order to extract texture information from radiographs to the maximum extent, we replaced the linear projection of patch embedding in the original MAE [12] with a set of lightweight CNN networks. In short, texture information plays a crucial role in our method as it serves as a significant motivation, process, and objective.

Fig. 1.

Comparison of our proposed curriculum-based pretraining strategy with conventional random arrangement methods.

Afterward, we fine-tuned and validated our pre-trained model on several thorax disease classification datasets. The experimental results demonstrated that our method outperforms previous state-of-the-art (SOTA) approaches. In addition, we performed visualization and analysis of the reconstructed images and extracted feature representations of our method. From the visualization results, it is evident that our method indeed enhances the attention of ViT towards texture details.

In summary, our research makes four-fold contributions: –

Our proposed training framework mimic the radiologist training currirulum, from easy samples to complicated ones. We measure the texture complexity of radiographic samples in the pre-training dataset and use this information to divide the entire dataset into several pretraining curricula of increasing difficulty. This progressive process gradually enhances the model’s perception of texture information.

–

We have deeply considered the differences between ViT and CNN and introduced a CNN-based patch embedding module into ViT. This allows us to extract texture information from the input radiograph to a greater extent and alleviate ViT’s tendency to overly focus on the shape characteristics of the input samples.

–

The constraint of the gray-level co-occurrence matrix (GLCM) during the reconstruction process serves to supervise the entire framework and preserve the texture information.

–

Quantitative and qualitative experimental results on public datasets illustrate that our method achieve performance improvement on pretrained representation quality than SOTA methods.

2. Related work

2.1. Curriculum learning

Curriculum learning is a learning paradigm inspired by the human-recognition construct process, where easy patterns are recognized before hard pattern. This paradigm was first raised and applie by Yoshua Bengio [3]. Its main idea is to utilize a progressive approach on model learning, where progression can occur either at the data level, from easy to hard samples, or at the task level, from simple to complex tasks. A few studies show that this paradigm results in better generalization and taking less training epochs to convergence.

In pratical applications, [32] combine speech-text translation task and sample’s internal understanding task, creating a simple to hard curriculum at task aspect. [20] utilizes both original data and strongly augmented data to construct a curriculum, gradually enhancing the 3D point cloud representation capability of the model. [15] apply this strategy in multimodal retrival task. [19] All these previous works use shown that curriculum learning can boost the convergence and level-up model effectiveness. Our method differs from these previous work in two ways: (1) Our method measures the texture information of the input samples, create the texture-based curriculum. (2) We apply curriculum learning in the pretraining stage on the pretext task learning, which gradually enhance the model’s texture capture ability.

2.2. Self-supervised learning in medical image analysis

Self-supervised learning aims to leverage internal supervision signals inherent within the data samples themselves to guide deep learning methods in learning rich and meaningful semantic features. Recent self-supervised learning protocols can be divided into two categories: Contrastive method [4,5,8,13] and restoration method, [2,12,18,36]. Contrastive method [4,5,13] maximize the mutual information of different augmentation views of same images, and minimize the mutual information of representation from different images to encourage deep encoder extract the discriminative representation. Restoration method aims to encode the input sample into a latent represent and decode the acquired latent representation to the original image. MAE [12] and SimMIM [36] mask a large ratio of a input image, using pixel reconstruction loss to guide encoder learns rich semantic representations. BEiT [2] encode the image patches into discrete semantic token to make BERT [6] pretrain protocol applied on continuos visual information. These self-supervised show competitive result on several natural image tasks.

Recently, few researches [10,10,29,30,35,37], has employed visual self-supervised learning on medical image anaylsis. Swin-UNETR [30] utilize self-supervised learning to enhance Swin-transformer’s low-level and high-level extraction ability in order to get accurate organ segmentation result. [35] explored the impact of utilizing different hyperparameters on the classification task performance after pretraining on chest X-ray in-domain data. [10] tries to employ multiple self-supervised learning strategy to enhance the DenseNet’s performance on several medical image benchmarks including 2D medical images and 3D medical images.

However most of these methods directly apply the natural image self-supervised learning framework on the medical image analysis, lacking the prior knowledge of the medical image area. Our proposed work are different from these work at two aspects: (1) The overall pretraining process mimics the radiologist training procedure, from easy pattern to hard pattern. (2) We emphasize the human-observed statistical texture feature to guide ViT learns special texture pattern.

3. Methodology

In this section, we will provide a detailed explanation of our proposed method and the preliminary concept involved. In Section 3.1, we introduce the preliminary concept of the Gray-Level Co-occurrence Matrix (GLCM). Moving forward, Section 3.2 describes the overall structure of our proposed model. Subsequently, in Section 3.3, we present our novel curriculum learning approach, which leverages the complexity of image textures. Then, in 3.4 we introduce the texture-aware patch embedding modules. In Section 3.5, we detail the Masked Radiograph Modeling method, which involves reconstructing the original input radiograph using an asymmetric decoder and calculating the L2 (MSE) loss. Finally, the loss function employed by our whole architecture will be explained in Section 3.6.

3.1. Preliminary: Gray-level co-occurrence matrix

Before proposing our method, we briefly introduce the Gray Level Co-occurrence Matrix(GLCM), a statistical tool to extract visual signal’s texture features.

The Gray Level Co-occurrence Matrix (GLCM) [11], also known as the Gray-Level Spatial Dependency Matrix(GLSDM), is a widely used texture analysis method in image processing and computer vision. It provides a statistical representation of the spatial relationships between pairs of pixels in an image, revealing the statistical texture pattern of an input image.

Considering a grayscale image $I$ , where $I \in R^{M \times N}$ and the total number of pixel gray levels within $I$ is K. The gray level value of single pixel is denoted as $I (i, j)$ , with i representing the row index and j representing the column index. The GLCM of image I can be defined as $G (d, θ)$ , where $G (d, θ) \in R^{K \times K}$ , d and θ denotes the pixel distance and angle. The x-th row, y-th column element $g_{d}^{θ} (x, y)$ within $G (d, θ)$ represents the frequency of occurrence of the pixel pairs with grayscale values x and y. The $g_{d}^{θ} (x, y)$ calculation can be expressed as:

g_{d}^{θ} (x, y) = \sum_{i = 1}^{M} \sum_{j = 1}^{N} δ (x - I (i, j)) \cdot δ (y - I (i + d \cos (θ), j + d \sin (θ))

(1)

Here

δ (\cdot)

is an indicator function that equals 1 when taking argument as 0, and 0 otherwise. In our method, we choose

θ_{1} = 0^{\circ}

θ_{2} = 45^{\circ}

θ_{3} = 90^{\circ}

θ_{4} = 135^{\circ}

as our 4 angles and

d = 1

d = 2

to calculate eight GLCMs. G is used to represent the final mean GLCM and

g (x, y)

represents element within G.

In practical applications, various statistical features, including energy and entropy, can be extracted from the GLCM to effectively describe the texture characteristics of an entire image. This process is usually conducted within local patches of the image, allowing us to capture the relationships between different local textures and structures. A common approach involves partitioning the image into patches and utilizing the GLCM features of these patches to represent the central pixel. This procedure enables the construction of a GLCM graph, as illustrated in our pipeline figure. Moreover, in our study, we adopt global statistical GLCM features from chest radiographs to quantify the texture complexity of input samples, thereby facilitating the creation of a pretraining curriculum. Further details about the curriculum strategy are provided in the subsequent section.

3.2. Overall architecture

Our overall method can be divided into two parts: the curriculum selection module and the texture-aware MAE pre-training. The curriculum selection module initially reorders the entire scattered and unordered pre-training dataset by the texture complexity presented by each sample, arranging them in ascending order of texture complexity. Subsequently, the selection module select the corresponding subset of course data from the entire dataset based on the current pre-training curriculum stage. As the currilum stage progresses, the size of the curriculum subset increases, and curriculum subset includes a larger number of samples with complex textures. In the final curriculum stage, the curriculum selection module considers the entire dataset as the current curriculum dataset. The curriculum selection progress is illustrated in the Figure 2. Once the selection module chooses the current pre-training dataset, the sampler will take batch from the current dataset to serve as input for subsequent model.

Fig. 2.

Overall structure of our proposed method, where the curriculum selection module selects an appropriate subset from the pre-training dataset, tailored to the current stage of the training process.

The input radiograph firstly go through the CNN-based texture-aware patch-embedding module, obtaining a $N \times L$ patch sequence, and then this sequence are fed into the ViT-based encoder to model the context information between patches. Similar to the original MAE framework, we exclusively input the visible unmasked visual tokens and the learnable global $C L S$ token into the ViT backbone, and position encoding is added to the remaining local patches, enabling the modeling of contextual information and spatial information for the tokens. After passing the sequence through the encoder, we insert learnable mask tokens at the original positions of the masked patches. Then, the sequence is input into the lightweight decoder for reconstruction. We optimize the parameters of the encoder-decoder by utilizing the reconstruction loss from MAE and the GLCM Constraint loss during the pretraining phase. During the fine-tuning and inference stages, we remove the decoder and only connect classification heads directly to the encoder to accomplish various downstream tasks.

3.3. Radiologist-like curriculum learning

Our curriculum strategy is based the texture complexity of the input radiograph samples. As shown in the Figure 1, the curriculum selection module determines the training data subset of each training stage, such that the size and overall difficulty of the subsets are gradually increasing throughout the training process. In the last pre-training curriculuim stage, the curriculum will give the entire dataset to the model, which is the final curriculum.

We merge the two commonly used chest radiograph datasets [14,34] into one large-scale pretraining dataset, obtaining 336,436 chest radiographes. however, these 336,436 chest radiographes are randomly ordered and not arranged based on the inherent information within each photo.The entropy of global GLCM is usually used to describe the texture of a visual signal. Based on these hypothesis, we rearranged the entire dataset in an ascending order according to the entropy and energy values of each photo’s GLCM. The entropy value of GLCM is formulated as:

H (\cdot) = - \sum_{x = 1}^{K} \sum_{y = 1}^{K} g (x, y) \log_{2} (g (x, y))

(2)

H denotes the entropy value of the radiograph’s GLCM. We use

H (\cdot)

to calculate each samples in the pretraining dataset.

We can take the whole training process as T stages, $C$ represents the overall curriculum during our training process. $C = {Q_{1}, \dots, Q_{t}, \dots, Q_{T}}$ , t represents the current t-th curriculum. We can divide the whole dataset $D$ into T splits. $D = {d_{1}, \dots, d_{t}, \dots, d_{T}}$ . Here the mean GLCM entropy of the subsets with higher numbers is greater than that of the subsets with lower numbers. At curriculum $Q_{t}$ , the pretraining dataset is $D_{t} = {d_{1} \cup d_{2} \cup \dots d_{t}}$ . Our texture-enhance MAE was pre-trained at above self-paced way.

3.4. Texture-aware patch embedding

Inspired by [7], we propose a light-weight CNN-based texture-aware patch embedding module to mitigate the preference of ViT towards structural and geometric information. In vanilla MAE structure, the patch embedding step is achieved through a delicating designed Conv2D layer so the shape of embedded patch sequence equals $N \times L$ . Here N denotes the number of patches, and L denotes the embedding dim of the patch.

As demonstrated by previous research, a CNN-based encoder tends to focus on texture information, while a ViT-based encoder emphasizes structural and shape information. Therefore, we begin by utilizing a texture-sensitive light-weight CNN to encode the input radiograph into a feature map of the same size as the original image, capturing rich texture information. After that, we patchify the feature map into $16 \times 16$ patches, and use a linear projection layer to obtain the patch embeddings.

3.5. Masked radiograph modeling

Similar to the original MAE framework, we also employ an asymmetric decoder in our approach to reconstruct the original input radiograph from a sequence composed of mask tokens and visible tokens. Here, we use the L2(MSE) loss to measure the distance between the reconstructed patch and the original patch:

L_{M R M} = \frac{1}{M \times N} \sum_{i = 1}^{M} \sum_{j = 1}^{N} {(\hat{I} (i, j) - I (i, j))}^{2}

(3)

here

\hat{I}

denotes the original input of the radiograph. As doing in the MAE, we only calculate L2 loss between the masked patches and the resconstruted patch.

3.6. GLCM constraint

After the decoder reconstruct the original input sample, we use formula (1) to calculate its mean GLCM $\hat{G}$ , we denote the reconstruction image’s GLCM as $G_{r e c o n s t r u c t i o n}$ . As we calculate the pixeles’ l2 distance to model the pixel-level reconstruction loss, we also calculate the L2 distance between the original input sample’s GLCM and reconstruction sample’s GLCM as a contraint item. Here we us L2-distance to contrain the distance between the original input’s texture feature and reconstruct image’s global texture feature, which can be formulated as:

L_{g l o b a l}^{C o n} = \frac{1}{K^{2}} \sum_{y = 1}^{K} \sum_{x = 1}^{K} {(\hat{P} (x, y) - P (x, y))}^{2}

(4)

Here

\hat{P} (x, y)

denotes the input radiograph’s normalized GLCM of row x column y element, which is the gray level x and y co-occurrence probability.

We also compute the local patches’ GLCM L2 loss between the original masked patch and the reconstructed patch, which can be formulated as:

L_{l o c a l}^{C o n} = \frac{1}{N \times K^{2}} \sum_{i = 1}^{N} \sum_{y = 1}^{K} \sum_{x = 1}^{K} {({\hat{P}}_{i} (x, y) - P_{i} (x, y))}^{2}

(5)

Here i denotes the ith patch, and N denote the number of masked patches.

The whole GLCM constraint loss is:

L_{C o n s t r a i n t} = L_{g l o b a l}^{C o n} + L_{l o c a l}^{C o n}

(6)

For summary, the overall pretraining loss function is:

L_{O v e r a l l} = L_{M R M} + λ L_{C o n s t r a i n t}

(7)

λ here is set 0.2 to balance the scale of

L_{C o n s t r a i n t}

4. Experiments

In this section, we first introduce the setups of our pretraining process. Then we evaluate the pretrained model with various downstream tasks, including multilabel thorax disease classification, disease location detection. After that we conduct the ablation study on our proposed modules of GXMAE. The final visualization result shows that our method can capture the texture information within the radiograph images.

4.1. Pretrain setups

Pretraining dataset

We merge the NIH Chest X ray [34] and Standford CheXpert [14] datasets into one large-scale unlabeled images set as the pretrain dataset. The NIH Chest X-Ray 14 dataset includes 112,120 radiographs from 30,805 unique patients, labeled with 14 thorax disease classifications. The Stanford CheXpert dataset comprises 191,028 frontal-view chest radiographs from 65,240 patients. In total, our pretraining dataset consists of 336,436 frontal-view chest radiographs. For data preprocessing, we applied random resized cropping and horizontal flipping as part of our data augmentation strategy to ensure variability in the training data, which helps the model generalize better to unseen cases.

Implementation details And ViT blocks’ parameters are initialzed with xavier uniform

4.2. Evaluations on downstream tasks

We validated our method on two commonly used datasets for lung disease classification. The experimental results demonstrate that our approach outperforms previous methods, indicating its ability to extract low-level and high-level features highly relevant to the diseases. This validates the effectiveness of our improvements and underscores the significance of our approach in extracting disease-related information. Moreover, we use our pretrained encoder finetuned on the NIH subset of thorax disease location, the result shows our method achieve acceptable result compared with other single-task thorax disease detection method.

4.2.1. Thorax disease classification

NIH chest X ray We take the official training spilt of NIH CXR dataset as the fine-tuning dataset. The final validation result was evaluated on the official validation split of NIH dataset. Due to NIH dataset’s multilabel feature, we use the Area Under ROC curve (AUROC, AUC) value to evaluate our framework on the disease classification task. As shown in the Table 1, our approach improves the mean AUC by 0.15 compared to the original MAE method, underscoring the critical role that texture-aware design plays in enhancing the overall performance of the model. This validates the inclusion of GLCM and curriculum learning as essential components in our framework. Due to the lack of inductive biases and the hunger for data in ViT, we observed that the classification results of randomly initialized ViT were even worse than those of a CNN network under the same conditions.however it can be observed that the self-supervised pretrained ViT outperforms the self-supervised pretrained CNN networks. This indicates the crucial importance of in-domain self-supervised pretraining for ViT. The fact that our method outperforms other self-supervised strategies also highlights the importance of texture information for thorax disease classification.

Table 1.
Fine-tuned classification result on NIH CXR dataset. The official train split was used for finetuning the pre-trained ViT encoder, and validation was conducted using the corresponding official validation split.

Supervised Self-Supervised

CXR8-R50 ChestNet CheXNet SwinCheX ${Enc}_{t}$ XProtoNet DiRA MAE

Pathology [34] [33] [26] [31] [38] [17] [10] [35] Ours

Cardiomegaly 81.4 87.4 90.1 87.5 85.7 88.7 88.9 89.9 89.1

Emphysema 75.3 82.2 87.1 91.4 85.5 94.1 92.5 92.1 91.7

Edema 72.8 83.2 83.4 84.8 83.3 84.0 83.7 82.8 84.3

Hernia 74.9 89.9 86.7 85.5 88.5 90.9 91.5 92.2 90.5

Pneumothorax 78.9 80.9 82.3 87.1 84.9 87.1 87.3 87.6 89.8

Effusion 73.6 81.1 81.4 82.4 82.1 83.5 83.9 84.5 83.9

Mass 56.0 78.3 79.6 82.2 79.7 83.1 82.9 83.5 81.7

Fibrosis 70.4 80.4 79.8 82.6 80.2 81.5 83.3 84.7 82.2

Atelectasis 70.6 74.3 75.1 78.1 75.2 78.0 78.5 79.5 83.8

Consolidation 68.2 72.5 71.7 74.8 72.9 74.7 75.3 76.6 77.5

Pleural Thicken 68.7 75.1 75.8 77.8 74.6 79.9 80.9 81.9 82.1

Nodule 71.6 69.7 73.4 78.0 73.5 80.4 77.1 75.8 76.3

Pneumonia 63.3 69.5 70.5 71.3 69.0 73.4 73.9 74.9 84.7

Infiltration 61.2 67.7 68.5 70.1 67.9 71.0 71.5 72.2 71.2

Mean(%) 70.5 78.0 79.0 81.0 78.8 82.2 82.2 82.7 83.5

	Supervised	Self-Supervised
Cardiomegaly	81.4	87.4	90.1	87.5	85.7	88.7	88.9	89.9	89.1
Emphysema	75.3	82.2	87.1	91.4	85.5	94.1	92.5	92.1	91.7
Edema	72.8	83.2	83.4	84.8	83.3	84.0	83.7	82.8	84.3
Hernia	74.9	89.9	86.7	85.5	88.5	90.9	91.5	92.2	90.5
Pneumothorax	78.9	80.9	82.3	87.1	84.9	87.1	87.3	87.6	89.8
Effusion	73.6	81.1	81.4	82.4	82.1	83.5	83.9	84.5	83.9
Mass	56.0	78.3	79.6	82.2	79.7	83.1	82.9	83.5	81.7
Fibrosis	70.4	80.4	79.8	82.6	80.2	81.5	83.3	84.7	82.2
Atelectasis	70.6	74.3	75.1	78.1	75.2	78.0	78.5	79.5	83.8
Consolidation	68.2	72.5	71.7	74.8	72.9	74.7	75.3	76.6	77.5
Pleural Thicken	68.7	75.1	75.8	77.8	74.6	79.9	80.9	81.9	82.1
Nodule	71.6	69.7	73.4	78.0	73.5	80.4	77.1	75.8	76.3
Pneumonia	63.3	69.5	70.5	71.3	69.0	73.4	73.9	74.9	84.7
Infiltration	61.2	67.7	68.5	70.1	67.9	71.0	71.5	72.2	71.2
Mean(%)	70.5	78.0	79.0	81.0	78.8	82.2	82.2	82.7	83.5

Stanford CheXpert Standford cheXpert [14] is large-scale chest x ray image dataset contains 191,028 frontal-view and 33,288 lateral-view radiographs of 65,240 patients. Training set images was annotated with 14 thorax disease labels. In this study, we evaluated our fine-tuned method using the official validation set of CHEXPERT, which includes five common thorax diseases, as the benchmark metrics. This allows for a direct comparison with existing methods on the same set of diseases. As shown in Table 2. Here we use the mean AUC values to evaluate our method compared with other popular methods.

Table 2.

Comparison of fine-tuned classification performance on the CheXpert dataset. Various methods and backbones are evaluated, with mAUC serving as the performance metric.

Method	Backbone	mAUC(%)
Allaouzi et al. [1]	DenseNet-121	82.8
Pham et al. [24]		89.4
Hosseinzadeh et al. [28]		87.1
DiRA [10]		87.6
Kang et al. [16]		89.0
MAE [35]	ViT-B/16	89.3
Ours	ViT-B/16	89.6

4.3. Abolation studies

Using the official validation split of NIH chest X ray 14 thorax disease classification dataset, we conducted the ablation study of the GLCM information and curriculum learning strategy. As shown in Table 3, the vanilla ViT-B model trained from scratch achieves 0.741 mean AUC. With the vanilla MAE pretrain setting, the mean AUC of the ViT-B model fine-tuned on NIH achieves 0.823. Under our proposed pretrain setting, ViT-B achieves 0.835 mAUC. Only applying the curriculum sampler, ViT-B achieves 83.1% mAUC. Combining the GLCM constraint, curriculum learning, and MAE pretraining, the ViT-B model achieves the highest mAUC of 83.5%.

Table 3.
Ablation study results for the key components of the proposed architecture. The table compares the impact of including or excluding GLCM constraint, curriculum learning, and MAE on mAUC, demonstrating the benefits of the full model over baselines.

GLCM Constraint Curriculum MAE mAUC

w/o w/o w/o 74.1

w/o w/o ✓ 82.3

w/o ✓ ✓ 83.1

✓ ✓ ✓ 83.5

GLCM Constraint	Curriculum	MAE	mAUC
w/o	w/o	w/o	74.1
w/o	w/o	✓	82.3
w/o	✓	✓	83.1
✓	✓	✓	83.5

In addition, we conducted ablation experiments based on different pre-training strategies to compare our pre-training strategy with others. As shown in Table 4, our pre-training strategy achieved the highest mAUC value in all experiments, demonstrating its effectiveness.

Table 4.

Ablation study results comparing various pre-training strategies and their impact on model performance (measured by mAUC) across different network architectures.

	Pretraining
Method	Strategy	Data	mAUC(%)
CXR8-R50	Supervised	ImageNet	69.6
DenseNet-121	Supervised	ImageNet	78.0
DenseNet-121	MoCO	NIH CXR	79.3
DenseNet-121	MAE	NIH CXR	81.5
ViT-B	Supervised	ImageNet	74.1
ViT-B	MoCO	NIH-CXR	81.0
ViT-B	MAE	NIH-CXR	82.3
VIT-B	Ours	NIH-CXR	83.5

4.4. Visualization and analysis

In this section, we conducted a comprehensive analysis of our model using a series of visualization methods. Firstly, we visualize the features extracted by our method using t-SNE for dimensionality reduction. Then we employed a GLCM method to visualize the texture in our reconstructed images. Finally we utilize the Grad-CAM [27] method to extract the attention of the last layer of our ViT encoder and generate attention heatmaps on the input radiograph to visualize model’s attention region for predicting disease.

Reconstruction GLCM visualization And we use the visualization of the GLCM feature graph to analyze the reconstruct image texture information. As shown in the Figure 3, the visualization of GLCM is achieved by dividing the image into 2 × 2 patches, calculating the GLCM and its features for each patch, and then mapping the feature values onto the central pixel of the patch. the reconstructed radiograph by our method contains more sharp edges, which illustrates that our method preserves more texture details.

Fig. 3.

GLCM visualization comparison. Our method reconstruct more texture detail compared with the vannila MAE method.

4.5. Conclusion

We present a curriculum self-supervised CXR representation learning method based on texture information. Leveraging a data sampler that reorders the entire pretraining data according to the complexity of texture in each learning sample, we construct different subsets (referred to as “curricula”) arranged from easy to difficult. Our improved MAE CXR representation learning framework is pretrained in a progressively increasing difficulty manner. The results from various downstream experiments demonstrate that our approach achieves superior performance compared to the baseline method while addressing the issue of ViT’s limited sensitivity to texture in the original MAE. Furthermore, the proposed module indeed enhances the effectiveness of the baseline method according to the results of downstream experiments. In the future, we will explore methods to enhance the causal interpretability of our approach and improve its credibility in clinical applications.

Footnotes

Acknowledgements

This work is supported by National Natural Science Foundation of China under Grant 22033002 and Grant 92370127.

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Data availability

The datasets used in this study are publicly available and can be accessed from the following sources: –

NIH Chest X-ray Dataset: This dataset contains 112,120 frontal-view X-ray images of 30,805 unique patients, annotated with 14 disease labels. It is available at https://nihcc.app.box.com/v/ChestXray-NIHCC.

–

Stanford CheXpert Dataset: This dataset includes 224,316 chest radiographs of 65,240 patients, annotated with 14 disease labels. It is available at https://stanfordmlgroup.github.io/competitions/chexpert/.

References

Allaouzi

Ahmed

M.B.

, A novel approach for multi-label chest X-ray classification of common thorax diseases, IEEE Access 7 (2019), 64279–64288. doi:10.1109/ACCESS.2019.2916849.

Bao

Dong

Wei

, Beit: BERT pre-training of image transformers, 2021, CoRR, arXiv:2106.08254.

Bengio

Louradour

Collobert

Weston

, Curriculum learning, in: Proceedings of the 26th Annual International Conference on Machine Learning, ICML 2009, Montreal, Quebec, Canada, June 14–18, 2009, Danyluk

A.P.

Bottou

Littman

M.L.

, eds, ACM International Conference Proceeding Series, Vol. 382, 2009, pp. 41–48. doi:10.1145/1553374.1553380.

Chen

Kornblith

Norouzi

Hinton

G.E.

, A simple framework for contrastive learning of visual representations, in: Proceedings of the 37th International Conference on Machine Learning, ICML 2020, Virtual Event, 13–18 July 2020, Proceedings of Machine Learning Research, Vol. 119, 2020, pp. 1597–1607, http://proceedings.mlr.press/v119/chen20j.html .

Chen

, Exploring simple Siamese representation learning, in: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, Virtual, June 19–25, 2021, 2021, pp. 15750–15758, https://openaccess.thecvf.com/content/CVPR2021/html/Chen_Exploring_Simple_Siamese_Representation_Learning_CVPR_2021_paper.html . doi:10.1109/CVPR46437.2021.01549.

Devlin

Chang

Lee

Toutanova

, BERT: Pre-training of deep bidirectional transformers for language understanding, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2–7, 2019, Vol. 1(Long and Short Papers), Burstein

Doran

Solorio

, eds, 2019, pp. 4171–4186. doi:10.18653/v1/n19-1423.

Geirhos

Rubisch

Michaelis

Bethge

Wichmann

F.A.

Brendel

, Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness, in: 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6–9, 2019, 2019, https://openreview.net/forum?id=Bygh9j09KX .

Grill

Strub

Altché

Tallec

Richemond

P.H.

Buchatskaya

Doersch

Pires

B.Á.

Guo

Azar

M.G.

Piot

Kavukcuoglu

Munos

Valko

, Bootstrap your own latent – a new approach to self-supervised learning, in: Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, Virtual, December 6–12, 2020, Larochelle

Ranzato

Hadsell

Balcan

Lin

, eds, 2020. https://proceedings.neurips.cc/paper/2020/hash/f3ada80d5c4ee70142b17b8192b2958e-Abstract.html .

Guo

Somayajula

S.A.

Hosseini

Xie

, Improving image classification of gastrointestinal endoscopy using curriculum self-supervised learning, Scientific Reports 14(1) (2024), 6100. doi:10.1038/s41598-024-53955-8.

10.

Haghighi

Taher

M.R.H.

Gotway

M.B.

Liang

, Dira: Discriminative, restorative, and adversarial learning for self-supervised medical image analysis, in: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18–24, 2022, 2022, pp. 20792–20802. doi:10.1109/CVPR52688.2022.02016.

11.

Haralick

R.M.

Shanmugam

K.S.

Dinstein

, Textural features for image classification, IEEE Trans. Syst. Man Cybern. 3(6) (1973), 610–621. doi:10.1109/TSMC.1973.4309314.

12.

Chen

Xie

Dollár

Girshick

R.B.

, Masked autoencoders are scalable vision learners, in: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18–24, 2022, 2022, pp. 15979–15988. doi:10.1109/CVPR52688.2022.01553.

13.

Fan

Xie

Girshick

R.B.

, Momentum contrast for unsupervised visual representation learning, in: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13–19, 2020, 2020, pp. 9726–9735. doi:10.1109/CVPR42600.2020.00975.

14.

Irvin

Rajpurkar

Ciurea-Ilcus

Chute

Marklund

Haghgoo

Ball

R.L.

Shpanskaya

K.S.

Seekins

Mong

D.A.

Halabi

S.S.

Sandberg

J.K.

Jones

Larson

D.B.

Langlotz

C.P.

Patel

B.N.

Lungren

M.P.

A.Y.

, Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison, in: The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, the Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, the Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27–February 1, 2019, 2019, pp. 590–597. doi:10.1609/aaai.v33i01.3301590.

15.

Jiang

Meng

Mitamura

Hauptmann

A.G.

, Easy samples first: Self-paced reranking for zero-example multimedia search, in: Proceedings of the ACM International Conference on Multimedia, MM ’14, Orlando, FL, USA, November 03–07, 2014, Hua

K.A.

Rui

Steinmetz

Hanjalic

Natsev

Zhu

, eds, 2014, pp. 547–556. doi:10.1145/2647868.2654918.

16.

Kang

Yuille

A.L.

Zhou

, Data, assemble: Leveraging multiple datasets with heterogeneous and partial labels, 2021, CoRR, arXiv:2109.12265.

17.

Kim

Seo

Yoon

, Xprotonet: Diagnosis in chest radiography with global and local explanations, in: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, Virtual, June 19–25, 2021, 2021, pp. 15719–15728, https://openaccess.thecvf.com/content/CVPR2021/html/Kim_XProtoNet_Diagnosis_in_Chest_Radiography_With_Global_and_Local_Explanations_CVPR_2021_paper.html . doi:10.1109/CVPR46437.2021.01546.

18.

Kingma

D.P.

Welling

, Auto-encoding variational Bayes, in: 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14–16, 2014, Conference Track Proceedings, Bengio

LeCun

, eds, 2014. http://arxiv.org/abs/1312.6114 .

19.

Lhoneux

Zhang

Søgaard

, Zero-shot dependency parsing with worst-case aware automated curriculum learning, in: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), ACL 2022, Muresan

Nakov

Villavicencio

, eds, Dublin, Ireland, May 22–27, 2022, 2022, pp. 578–587. doi:10.18653/v1/2022.acl-short.64.

20.

Wei

Chen

, Pointsmile: Point self-supervised learning via curriculum mutual information, 2023, CoRR, arXiv:2301.12744. doi:10.48550/arXiv.

21.

Liu

Zhang

Hou

Mian

Wang

Zhang

Tang

, Self-supervised learning: Generative or contrastive, IEEE Trans. Knowl. Data Eng. 35(1) (2023), 857–876. doi:10.1109/TKDE.2021.3090866.

22.

Oord

Vinyals

, Representation learning with contrastive predictive coding, 2018, CoRR, arXiv:1807.03748.

23.

Park

Cho

Lee

S.M.

Cho

Y.-H.

Lee

E.S.

Lee

K.H.

Seo

J.B.

Kim

, A curriculum learning strategy to enhance the accuracy of classification of various lesions in chest-pa X-ray screening for pulmonary abnormalities, Scientific reports 9(1) (2019), 1–9. doi:10.1038/s41598-019-56882-1.

24.

Pham

H.H.

T.T.

Tran

D.Q.

Ngo

D.T.

Nguyen

H.Q.

, Interpreting chest X-rays via cnns that exploit hierarchical disease dependencies and uncertainty labels, Neurocomputing 437 (2021), 186–194. doi:10.1016/j.neucom.2020.03.127.

25.

Pouget

Dedieu

, Applying self-supervised learning to image quality assessment in chest ct imaging, Bioengineering 11(4) (2024), 335. doi:10.3390/bioengineering11040335.

26.

Rajpurkar

Irvin

Zhu

Yang

Mehta

Duan

Ding

D.Y.

Bagul

Langlotz

C.P.

Shpanskaya

K.S.

Lungren

M.P.

A.Y.

, Chexnet: Radiologist-level pneumonia detection on chest X-rays with deep learning, 2017, CoRR, arXiv:1711.05225.

27.

Selvaraju

R.R.

Cogswell

Das

Vedantam

Parikh

Batra

, Grad-cam: Visual explanations from deep networks via gradient-based localization, in: IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22–29, 2017, 2017, pp. 618–626. doi:10.1109/ICCV.2017.74.

28.

Taher

M.R.H.

Haghighi

Feng

Gotway

M.B.

Liang

, A systematic benchmarking analysis of transfer learning for medical image analysis, in: Domain Adaptation and Representation Transfer, and Affordable Healthcare and AI for Resource Diverse Global Health – Third MICCAI Workshop, DART 2021, and First MICCAI Workshop, FAIR 2021, Held in Conjunction with MICCAI 2021, Proceedings, Strasbourg, France, September 27 and October 1, 2021, Albarqouni

Cardoso

M.J.

Dou

Kamnitsas

Khanal

Rekik

Rieke

Sheet

Tsaftaris

S.A.

, eds, Lecture Notes in Computer Science, Vol. 12968, 2021, pp. 3–13. doi:10.1007/978-3-030-87722-4_1.

29.

Taher

M.R.H.

Haghighi

Gotway

M.B.

Liang

, Caid: Context-aware instance discrimination for self-supervised learning in medical imaging, in: International Conference on Medical Imaging with Deep Learning, MIDL 2022, 6–8 July 2022, Konukoglu

Menze

B.H.

Venkataraman

Baumgartner

C.F.

Dou

Albarqouni

, eds, Proceedings of Machine Learning Research, Vol. 172, Zurich, Switzerland, 2022, pp. 535–551, https://proceedings.mlr.press/v172/hosseinzadeh-taher22a.html .

30.

Tang

Yang

Roth

H.R.

Landman

B.A.

Nath

Hatamizadeh

, Self-supervised pre-training of swin transformers for 3d medical image analysis, in: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18–24, 2022, 2022, pp. 20698–20708. doi:10.1109/CVPR52688.2022.02007.

31.

Taslimi

Fathi

Salehi

Rohban

M.H.

, Swinchex: Multi-label classification on chest X-ray images with transformers, 2022, CoRR, arXiv:2206.04246. doi:10.48550/arXiv.

32.

Wang

Liu

Zhou

Yang

, Curriculum pre-training for end-to-end speech translation, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5–10, 2020, Jurafsky

Chai

Schluter

Tetreault

J.R.

, eds, 2020, pp. 3728–3738. doi:10.18653/v1/2020.acl-main.344.

33.

Wang

Xia

, Chestnet: A deep neural network for classification of thoracic diseases on chest radiography, 2018, CoRR, arXiv:1807.03058.

34.

Wang

Peng

Bagheri

Summers

, Chestx-ray8: Hospital-scale chest X-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases, in: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 3462–3471. doi:10.1109/CVPR.2017.369.

35.

Xiao

Bai

Yuille

A.L.

Zhou

, Delving into masked autoencoders for multi-label thorax disease classification, in: IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2023, Waikoloa, HI, USA, January 2–7, 2023, 2023, pp. 3577–3589. doi:10.1109/WACV56688.2023.00358.

36.

Xie

Zhang

Cao

Lin

Bao

Yao

Dai

, Simmim: A simple framework for masked image modeling, in: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18–24, 2022, 2022, pp. 9643–9653. doi:10.1109/CVPR52688.2022.00943.

37.

Zhang

Jiang

Miura

Manning

C.D.

Langlotz

C.P.

, Contrastive learning of medical visual representations from paired images and text, in: Proceedings of the Machine Learning for Healthcare Conference, MLHC 2022, 5–6 August 2022, Lipton

Z.C.

Ranganath

Sendak

M.P.

Sjoding

M.W.

Yeung

, eds, Proceedings of Machine Learning Research, Vol. 182, Durham, NC, USA, 2022, pp. 2–25, https://proceedings.mlr.press/v182/zhang22a.html .

38.

Zhou

Bae

Liu

Singh

Green

Samaras

Prasanna

, Chest radiograph disentanglement for COVID-19 outcome prediction, in: Medical Image Computing and Computer Assisted Intervention – MICCAI 2021 – 24th International Conference, Proceedings, Part VII, Strasbourg, France, September 27–October 1, 2021, Bruijne

Cattin

P.C.

Cotin

Padoy

Speidel

Zheng

Essert

, eds, Lecture Notes in Computer Science, Vol. 12907, 2021, pp. 345–355. doi:10.1007/978-3-030-87234-2_33.

	Supervised				Self-Supervised
	CXR8-R50	ChestNet	CheXNet	SwinCheX	${Enc}_{t}$	XProtoNet	DiRA	MAE
Pathology	[34]	[33]	[26]	[31]	[38]	[17]	[10]	[35]	Ours
Cardiomegaly	81.4	87.4	90.1	87.5	85.7	88.7	88.9	89.9	89.1
Emphysema	75.3	82.2	87.1	91.4	85.5	94.1	92.5	92.1	91.7
Edema	72.8	83.2	83.4	84.8	83.3	84.0	83.7	82.8	84.3
Hernia	74.9	89.9	86.7	85.5	88.5	90.9	91.5	92.2	90.5
Pneumothorax	78.9	80.9	82.3	87.1	84.9	87.1	87.3	87.6	89.8
Effusion	73.6	81.1	81.4	82.4	82.1	83.5	83.9	84.5	83.9
Mass	56.0	78.3	79.6	82.2	79.7	83.1	82.9	83.5	81.7
Fibrosis	70.4	80.4	79.8	82.6	80.2	81.5	83.3	84.7	82.2
Atelectasis	70.6	74.3	75.1	78.1	75.2	78.0	78.5	79.5	83.8
Consolidation	68.2	72.5	71.7	74.8	72.9	74.7	75.3	76.6	77.5
Pleural Thicken	68.7	75.1	75.8	77.8	74.6	79.9	80.9	81.9	82.1
Nodule	71.6	69.7	73.4	78.0	73.5	80.4	77.1	75.8	76.3
Pneumonia	63.3	69.5	70.5	71.3	69.0	73.4	73.9	74.9	84.7
Infiltration	61.2	67.7	68.5	70.1	67.9	71.0	71.5	72.2	71.2
Mean(%)	70.5	78.0	79.0	81.0	78.8	82.2	82.2	82.7	83.5