LungNet-ViT: Efficient lung disease classification using a multistage vision transformer model from chest radiographs

Abstract

This research introduces a Multistage-Vision Transformer (Multistage-ViT) model for precisely classifying various lung diseases using chest radiographic (CXR) images. The dataset in the proposed method includes four classes: Normal, COVID-19, Viral Pneumonia and Lung Opacity. This model demonstrates its efficacy on imbalanced and balanced datasets by enhancing classifier accuracy through deep feature extraction. It integrates backbone models with the ViT architecture, creating rigorously hybrid configurations compared to their standalone counterparts. These hybrid models utilize optimized features for classification, significantly improving their performance. Notably, the multistage-ViT model achieved accuracies of 99.93% on an imbalanced dataset and 99.97% on a balanced dataset using the InceptionV3 combined with the ViT model. These findings highlight the superior accuracy and robustness of multistage-ViT models, underscoring their potential to enhance lung disease classification through advanced feature extraction and model integration techniques. The proposed model effectively demonstrates the benefits of employing ViT for deep feature extraction from CXR images.

Keywords

lung disease classification deep features extraction chest X-rays vision transformer

Introduction

Lung diseases, which include COVID-19, pneumonia and lung opacities, pose a significant global health challenge due to their widespread prevalence and potential for severe health impacts. COVID-19, resulting from the SARS-CoV-2 virus, primarily targets the respiratory system, causing symptoms that range from mild cough to severe acute respiratory distress syndrome (ARDS).¹ Pneumonia is a condition where an infection causes the air sacs in one or both lungs to become inflamed. It can result from bacteria, viruses, or fungi, leading to symptoms such as chest pain, fever and breathing difficulties. Lung opacities, seen as areas of increased density on radiographic images, can indicate various conditions including infections, inflammations or tumors. Early and accurate detection of these conditions through imaging techniques such as chest X-rays (CXR) and computer tomography (CT) scans are essential for effective treatment and management, underscoring the importance of advancements in medical imaging and diagnostic technologies.^2,3 Expert evaluation of these images remains crucial, with fluctuations in density indicative of disease severity.

The rapid advancements in artificial intelligence (AI) and deep learning have revolutionized the field of medical image analysis. The accurate and timely diagnosis of lung diseases such as COVID-19, viral pneumonia and lung opacity is critical for effective treatment and patient management. Traditional convolutional neural networks (CNN) have demonstrated significant success in various image classification tasks.⁴ The recent development of Vision Transformers has introduced new possibilities for enhancing the accuracy and reliability of medical image analysis.^5,6

The Vision Transformer (ViT)⁷ utilizes self-attention mechanisms and significantly outperforms CNN.^8–10 ViT has demonstrated significant potential in the field of medical imaging, particularly in the detection and diagnosis of lung diseases using CXR and CT images. Leveraging the transformer architecture's self-attention mechanisms, ViT models can effectively capture global contextual information from medical images, which is crucial for identifying subtle anomalies and complex patterns associated with lung conditions.^11,12 ViT has been utilized for the automated detection of tuberculosis from CXR, showcasing superior performance over traditional CNN in terms of accuracy and robustness.¹³ Additionally, in the context of COVID-19, ViT models have been employed to analyze chest CT scans, providing accurate segmentation and classification of infected regions, thereby aiding in rapid and reliable diagnosis.¹⁴

Furthermore, the application of ViT in lung disease diagnosis extends to enhancing image quality and generating high-resolution images from low-quality scans.¹⁵ This capability is particularly beneficial in resource-limited settings where high-quality imaging equipment may not be available. ViT models have been used to enhance CT images, improving the visibility of fine details and supporting better diagnosis and treatment planning.¹⁶ Moreover, ViT has been applied to segment lung tissues and nodules from CT images, facilitating early detection and classification of lung cancer.¹⁷ Integrated ViT with advanced image processing techniques, researchers are continually pushing the boundaries of medical imaging, making significant strides in the early detection and management of lung diseases.¹⁸

This work offers the following significant contributions:

Propose a Multistage-Vision Transformer (Multistage-ViT) model to accurately classify various lung diseases using chest X-ray (CXR) images, ensuring suitability for both imbalanced and balanced datasets.

Improve classifier accuracy by leveraging deep feature extraction from CXR images through a combination of backbone models and the ViT model.

Conduct experiments with four different combinations of backbone models (VGG19, VGG16, InceptionV3 and ResNet50) integrated with the ViT model.

Perform extensive comparisons between these hybrid models and standalone backbone models, utilizing optimized features extracted from the hybrid models for classification.

Related work

The integration of AI and deep learning technologies in medical diagnostics has significantly advanced the accuracy and efficiency of identifying lung diseases. Vision Transformers (ViTs) have developed as a powerful alternative to traditional CNNs, offering superior capabilities in capturing long-term dependencies and extracting detailed semantic information.¹⁹ ViTs captures long-range dependencies, providing a more comprehensive analysis of chest X-ray (CXR) images. Tang et al. 2021 demonstrated that ViTs surpassed traditional models such as ResNet and VGG in terms of accuracy, sensitivity and specificity, highlighting their potential in medical image analysis.²⁰

ViTs have shown consistent performance across various datasets, including those with diverse image qualities and patient demographics. This robustness is essential for real-world applications, where data heterogeneity is common. Wang et al. 2023 found that ViTs maintain high diagnostic accuracy regardless of variations in the dataset, making them particularly valuable for clinical decision support systems.²¹ These systems can be seamlessly integrated into hospital information systems, providing real-time diagnostic assistance and streamlining clinical workflows, as described by.²²

ViTs have been highly effective in pneumonia detection, particularly when combined with CNN backbones like VGG16 and ResNet50.²³ Gao et al. 2021 demonstrated that hybrid models enhance classification accuracy and reduce false positives compared to standalone CNNs.²⁴ Lin et al. 2022 highlighted the effectiveness of ViT in detecting lung opacities, particularly in the early stages of conditions like ARDS. Their study showed that ViTs handle imbalanced datasets with remarkable precision and recall, making them a reliable tool for accurate clinical diagnostics.²⁵

The combination of ViTs with pre-trained CNNs, has led to higher accuracy and faster convergence, suitable for real-time critical care settings. This synergy exemplifies AI's future in medical imaging, offering tools for timely and accurate diagnoses, including COVID-19, pneumonia and lung opacity.²⁶ Recent developments include CAD systems using both CNN and ViT models for COVID-19 and other lung diseases. Models like VGG-16 and ViT achieved notable accuracies on balanced dataset. Ukwuoma et al. 2022 developed a deep learning model addressing blurriness and low contrast, using an ensemble technique and global second-order pooling with ViTs for detailed analysis.²⁷

Optimizing ViT models with different optimizers, including Adam, AdamW and RAdam, has significantly improved the accuracy of lung disease classification. RAdam showed the best performance on a balanced dataset, while FastViT with NAdam excelled on an imbalanced dataset.²⁸ This demonstrates the importance of choosing the right optimization strategy for enhancing model accuracy and robustness in handling diverse data distributions.

The exploration of EfficientNet, multiscale vision transformers (MViT) and other architectures has advanced COVID-19 diagnosis, though challenges like overfitting, especially in multiscale models, have been noted.²⁹ Fine-tuned ViT models have shown superior accuracy and robustness, enhancing clinical interpretation. Studies combining Convolutional Vision Transformers and CNNs have demonstrated high sensitivity and low false negative rates, proving effective in distinguishing various lung conditions.³⁰ ViT architecture has also achieved impressive results in pneumonia detection, confirming its efficacy in accuracy, sensitivity and specificity.³¹

The continuous integration and development of ViTs alongside existing technologies have enhanced their clinical applicability and reliability. The literature emphasizes the importance of ViTs in accurately classifying lung diseases, particularly through self-attention mechanisms.³² The combination of ViTs with CNNs, data augmentation and optimization techniques has significantly improved diagnostic accuracy.³³ This research focuses on refining these models, addressing overfitting and expanding their use across a broader spectrum of diseases.

Methodology

The proposed deep learning models outlined in this section follow standard procedures, including collecting CXR data, pre-processing images, splitting data, training, testing and analyzing performance. Figure 1 provides a block diagram of the proposed model which includes dataset distribution, feature extraction: primary and advanced deep features and classification.

Figure 1.

Block diagram for the proposed model.

Dataset

The dataset for this study is constructed using the COVID-19 Radiography database from Kaggle.³⁴ It comprises chest X-rays depicting normal, COVID-19, viral pneumonia and lung opacity (non-Covid lung infection) images as shown in Figure 2. The dataset includes X-ray images and their corresponding masks. Table 1 outlines the class distributions in two datasets, imbalance and balance, for training and testing a machine learning model in medical image classification. Imbalance dataset, which mimics real-world class distributions, has 16,930 training cases (2892 COVID-19, 4809 Lung Opacity, 8153 Normal and 1076 Viral Pneumonia) and 4235 testing cases (724 COVID-19, 1203 Lung Opacity, 2039 Normal and 269 Viral Pneumonia). Conversely, the balanced dataset ensures equal representation with 4304 training cases and 1076 testing cases, uniformly divided across all classes (1076 each for training and 269 each for testing). This configuration helps in assessing model performance in both realistic and controlled, balanced conditions. These CXR images are in PNG (Portable Network Graphics) format and have a resolution of 299 * 299 pixels.

Figure 2.

Sample CXR images of (a) COVID-19, (b) Lung opacity, (c) Normal and (d) Viral pneumonia.

Table 1.

Dataset distribution of the proposed method.

Dataset Distribution	Split	COVID-19	Lung Opacity	Normal	Viral pneumonia	Total
Imbalance	Training	2892	4809	8153	1076	16930
Imbalance	Testing	724	1203	2039	269	4235
Balance	Training	1076	1076	1076	1076	4304
Balance	Testing	269	269	269	269	1076

Feature extraction

The proposed model employs a two-stage approach for efficient feature extraction. In the first stage, a backbone model is utilized to extract features from CXR images and reduce their dimensionality, thereby decreasing the processing time required for the subsequent stage. In the second stage, the Vision Transformer (ViT) model is applied to these initial features for a more detailed and comprehensive feature extraction.

Stage 1: primary features

In the first stage, various backbone models are used to extract primary features from the input CXR images. To assess the model's effectiveness, different backbone models such as ResNet50, VGG16, VGG19 and InceptionV3 are evaluated. These models are initialized with weights from the ImageNet dataset and then applied to extract features from both imbalanced and balanced datasets.³⁵

VGG models

VGG16 and VGG19 are deep CNNs that excel in feature extraction, characterized by their structured architecture and small convolutional filters.³⁶ In this study, the CXR image size in the dataset is 299*299*3 pixels, resized to 256*256*3 with three color channels (RGB) to ensure standardized input for consistent feature extraction. The process begins with a series of convolutional layers using 3*3 filters, designed to extract various features such as edges and textures from the images. These layers produce feature maps that capture different aspects of the input images. After the convolutional layers, max-pooling layers are employed to downsample the spatial dimensions of the feature maps by half at each stage, thus progressively shrinking the feature map sizes to 128*128, 64*64, 32*32, 16*16 and finally 8*8 after the fifth pooling layer. The output from the last convolutional layer consists of feature maps sized 8*8*512 as shown in Table 2.

Table 2.

VGG16 and VGG19 model architectures: layer-wise architecture and parameters.

VGG Models
VGG19 Layer	VGG16 Layer	Filter size	Output size for VGG Models	Parameters
Input	Input	-	256* 256*3	0
2 Conv	2 Conv	3364	25625664	36928
Max pool	Max pool	Stride 2	12812864	0
2 Conv	2 Conv	33128	128128128	147584
Max pool	Max pool	Stride 2	6464128	0
4 Conv	3 Conv	33256	6464256	590080
Max pool	Max pool	Stride 2	3232256	0
4 Conv	3 Conv	33512	3232512	2359808
Max pool	Max pool	Stride 2	1616512	0
4 Conv	3 Conv	33512	1616512	2359808
Max pool	Max pool	Stride 2	88512	0

In the context of feature extraction, the fully connected layers typically present in the complete VGG16 and VGG19 architectures are not utilized. Instead, the feature set from the final convolutional layer serves as the extracted features. These models are pre-trained on the ImageNet dataset, which provides a robust foundation of learned features, enhancing their ability to extract relevant features from test CXR images. The primary features extracted by VGG16 and VGG19 are then used to extract deeper features, leveraging their detailed representation of the input images. The selection between VGG16 and VGG19 depends on the application specific requirements. VGG19, with its deeper architecture, provides more detailed feature extraction and increased processing capabilities.

ResNet50

ResNet50 is a deep learning model with 50 layers, designed to mitigate the vanishing gradient problem. It uses residual blocks with skip connections, allowing input to bypass intermediate layers and connect directly to the output.³⁷ Initially, the input images are 299*299*3 and then resized to 256*256*3 for compatibility with ResNet50. The network starts with an input layer for these resized images. The initial Conv1 layer uses 64 filters of 7*7, producing an output of 128*128*64, followed by a max pooling layer that reduces the size to 64*64*64, capturing basic features like edges and textures. The network progresses through four main stages of residual blocks: Conv2 (blocks 1–3), Conv3 (blocks 1–4), Conv4 (blocks 1–6) and Conv5 (blocks 1–3).

Each block uses a bottleneck design with 1*1, 3*3 and 1*1 convolutions, with skip connections facilitating easier gradient flow and deeper network training. The final output from the Conv5 stage, specifically block 3, is an 8*8*2048 output feature map as shown in Table 3. ResNet50 captures detailed features at various levels, from basic edges to complex patterns, making it highly effective for image analysis tasks. Its ability to learn comprehensive feature representations allow it to accurately and efficiently process diverse visual data. These extracted primary features are then passed into stage 2 for further deep feature extraction.

Table 3.

ResNet50: layer-wise architecture and parameters.

Resnet 50 Model
Layer Group	Layer	Filter size	Output size	Parameters
Input	Input	-	256* 256*3	0
1 Conv	Conv2D	7*7, Stride 2	12812864	9472
	Max pool	3*3, Stride 2	646464	0
2 Conv	Conv2D (Block 1–3)	11,33,1*1	6464256	225920
3 Conv	Conv2D (Block 1–4)	11,33,1*1	3232512	1158400
4 Conv	Conv2D (Block 1–6)	11,33,1*1	16161024	7197312
5 Conv	Conv2D (Block 1–3)	11,33,1*1	882048	14771712

InceptionV3

InceptionV3 is a deep CNN designed to efficiently extract rich feature representations from images. It begins with an input layer for images of size 299*299*3, followed by a series of convolutional layers that progressively increase in filter depth to capture various levels of feature abstraction. The early layers focus on detecting edges and textures, while the deeper layers identify more complex patterns.³⁸ The architecture includes several distinctive Inception modules (labeled A, B, C, D and E), which are tailored to extract multi-scale features concurrently. These modules combine 1*1, 3*3, 5*5 and sometimes 7*7 convolutions with pooling operations to capture a wide range of spatial patterns as shown in Table 4. This design enhances computational efficiency and effectiveness in recognizing features across different scales and complexities. Additionally, the model incorporates max pooling layers to reduce the spatial dimensions of feature maps, minimizing computational load while retaining essential information. The final layers aggregate these primary features, which are then used in the subsequent stage of the proposed work.

Table 4.

InceptionV3: layer-wise architecture and parameters.

InceptionV3 Model
Layer Group	Layer	Filter size	Output size	Parameters
Input	Input	-	2992993	0
1 Conv	Conv2D	3*3, Stride 2	14914932	864
2 Conv	Conv2D	3*3, Stride 1	14714732	9216
3 Conv	Conv2D	3*3, Stride 1	14714764	18432
	Max pool	3*3, Stride 2	737364	0
4 Conv	Conv2D	3*3, Stride 1	737380	5120
5 Conv	Conv2D	3*3, Stride 1	7171192	138240
	Max pool	3*3, Stride 2	3535192	0
Inception A	Mixed Conv2D	11,55,3*3	3535256	varies
Inception B	Mixed Conv2D	11,77,3*3	1717768	varies
Inception C	Mixed Conv2D	11,77,3*3	1717768	varies
Inception D	Mixed Conv2D	11,77,3*3	881280	varies
Inception E	Mixed Conv2D	11,33,13,31	882048	varies

Stage 2: advanced feature extraction using ViT

During this stage, ViT model is utilized to enhance the extraction of deep features from the primary features of CXR images. Initially, these features are derived from backbone models like VGG16, VGG19, ResNet50 and InceptionV3, resulting in feature maps with dimensions 8∗8∗512 for VGG models and 8∗8∗2048 for ResNet50 and InceptionV3. The ViT processes these maps to capture intricate patterns and long-range dependencies,³⁹ as detailed in Figure 1.

The process is initiated by reducing the dimensions of these feature maps to make them suitable for the ViT architecture. An average pooling layer is employed to halve the spatial dimensions, resulting in reduced feature maps such as (B, 512, 4, 4) for the VGG models and (B, 2048, 4, 4) for ResNet50 and InceptionV3, where B denotes the batch size. These reduced feature maps are divided into patches, with each patch representing a portion of the spatial region of the feature map. Each patch with dimensions (4 *4*C) (where C is the number of channels) is converted into flattened vector as shown in Eq. (1)

{Patch}_{Feature Vector} = {Flatten}_{Vector} (4 \times 4 \times C)

(1)

These flattened vectors are then linearly projected into a lower-dimensional space D to create patch embeddings (Patch _Embedding) in Eq. (2).

{Patch}_{Embedding} = w_{p} {Patch}_{Feature Vector} + b_{p}

(2)

Here, w_p is the weight matrix, b_p is the bias term and D represents the embedding dimension. The resulting patch embeddings, with dimensions (B, N, D), where N is the number of patches, are then combined with position embeddings (Pos_Embedding) to retain the spatial context of each patch within the image as in Eq. (3).

{Final}_{Embedding} = {Patch}_{Embedding} + {Pos}_{Embedding}

(3)

These combined embeddings are fed into the transformer encoder, the core of the ViT model is shown in Figure 3. It utilizes layers of Multi-Head Self-Attention (MHSA) and Feed-Forward Networks (FFN) to capture complex relationships within the feature map, allowing the model to focus on different parts of the CXR image simultaneously.⁴⁰ This is particularly beneficial in medical imaging, where subtle details can be critical for accurate diagnosis.

Figure 3.

Transformer encoder.

The output of the transformer encoder is given by Eq. (4).

\begin{aligned} {Encoder}_{Output} & = Norm (MHSA ({Final}_{Embedding}) \\ + FFN ({MHSA}_{Output})) \end{aligned}

(4)

Residual connections and layer normalization are incorporated to enhance the stability of the training process and improve model performance, which is vital when dealing with complex CXR images.

In the final step, a Multi-Layer Perceptron (MLP) head is used for classifying the lung CXR images. The deep feature vectors from the last transformer encoder block are passed into the MLP classifier, which includes two hidden layers followed by an output layer. The operations within the MLP classifier are described by Eq`s. (5, 6 & 7).

h_{1} = σ (w_{1} \cdot x + b_{1})

(5)

h_{2} = σ (w_{2} \cdot h_{1} + b_{2})

(6)

{MLP}_{output} = w_{3} \cdot h_{2} + b_{3}

(7)

where, w₁, w₂ and w₃ represent the weight matrices, b₁, b₂ and b₃ are the biases and σ is the ReLU activation function. The four final output layer produces logits that correspond to the classes in the dataset. The class with the highest probability determines the diagnosis, ensuring that the ViT effectively classifies lung CXR images based on the deep, context-rich features extracted by the model.

Results

The proposed model, which integrates a backbone model with the ViT, is assessed using metrics such as accuracy, loss, precision, F1-score and recall. To evaluate the multistage-ViT model's effectiveness, four different backbone models are tested. Initially, the CNN model's weights are set using pre-trained weights from a transfer learning model trained on the ImageNet dataset. This approach bypasses the need for extensive training during the primary feature extraction stage, addressing the challenge of long processing times typically associated with training complex models from scratch, especially with a dataset exceeding 20,000 CXR images.

The backbone models serve two crucial functions: (1) Extracting primary features from the input CXR images while significantly reducing the number of trainable parameters in the first stage, and (2) Decreasing the dimensionality of the input CXR images to optimize the ViT's performance in the subsequent stage, enabling it to concentrate on the most essential features. Following this, the ViT model is trained to extract deep features and classify lung diseases using the training sets from both the imbalanced and balanced datasets. Finally, the overall model's performance is evaluated on the testing sets.

Experimental setup

All experiments in this work were conducted on Windows 11 PCs version 21H2, featuring a 64-bit operating system, an Intel(R) Xeon(R) 3.31 GHz processor and 32 GB of RAM. CUDA version 11.7 was utilized on an NVIDIA Quadro P4000 GPU. The experiments were programed in Python 3.11.5 and executed using Jupyter Notebook, with various libraries such as Keras and TensorFlow. The dataset was split into 80% for training and 20% for validating and testing, providing approximately 4235 images for testing the model. A consistent epoch count of 50 was chosen to maintain fairness in the comparison of all experiments. Given the dataset's multi-class classification requirements, categorical cross-entropy was employed as the loss function to accurately measure the difference between the predicted and true probability distributions. The Adam optimizer was selected for its efficiency and minimal need for parameter adjustments. The backbone models produced primary feature dimensions of 8*8*C, leading to the choice of a patch size of four in the ViT model, corresponding to half of the original feature size. The remaining parameters were fine-tuned based on the specific demands of the experiments as shown in Table 5.

Table 5.

Experimental Hyperparameters and Settings.

Parameters	Values
Learning rate	0.0001
Batch size	32
Epochs	50
Patch size	(4,4)
Drop rate	0.1
Number of heads	4
Embedded Dimensions	64
Transformer Layers	8
Input size	(224*224)
Optimizers	Adam
Loss function	Cross Entropy Loss
MLP size	256

Decision on backbone models

Figure 4 showcases the classification performance of backbone models like VGG16, VGG19, ResNet50 and InceptionV3 on an imbalanced dataset under two different learning rates. Key performance metrics such as accuracy, precision, recall and F1-score are illustrated for each model. A series of experiments were conducted with various learning rates, and 0.001 and 0.0001 were observed to deliver optimal performance, as shown in the Figure 4. In Figure 4 (a), with a learning rate of 0.001, VGG16 shows strong performance with 0.92 accuracy, 0.90 precision, 0.91 recall and 0.92 F1-score, indicating stability across all metrics. VGG19 follows closely VGG16, suggesting marginally reduced effectiveness. ResNet50 and InceptionV3 both display good, consistent performance, with scores around 0.89 for accuracy and slight variations in precision, recall and F1-score. In Figure 4 (b), with a learning rate of 0.0001, VGG16 achieves the highest scores across all metrics, with 0.94 accuracy, 0.92 precision, 0.92 recall and 0.94 F1-score. VGG19 also performs well, with scores slightly lower than VGG16. ResNet50 and InceptionV3 both show consistent results, with 0.91 accuracy and balanced performance in other metrics. Overall, VGG16 exhibits the most notable improvement at the lower learning rate, while all models benefit from the more conservative learning rate, demonstrating strong and balanced performance. The backbone models showed superior performance when Adam optimizers was used with a learning rate of 0.0001, compared to a learning rate of 0.001.

Figure 4.

Classification performance result of the backbone models using imbalance dataset for (a) Learning rate =0.001 and (b) Learning rate =0.0001.

Classification outcomes for the imbalanced dataset

Table 6 compares the performance of backbone models and multistage-ViT on an imbalanced dataset, focusing on training and validation accuracy as well as loss. Backbone models VGG16, VGG19, ResNet50 and InceptionV3 show varying results. VGG16 achieved 93.35% training and 90.18% validation accuracy, with loss of 0.1890 and 0.2782 for training and validation. VGG19 performed better, with 97.91% training and 95.72% validation accuracy and loss of 0.0757 and 0.1337. ResNet50 presented 96.75% training and 95.17% validation accuracy, with loss of 0.1009 and 0.1408. InceptionV3 led among the backbone models, with 97.56% training and 94.52% validation accuracy and the lowest loss at 0.0792 and 0.1482. Multistage-ViT model significantly improved performance across all models. VGG16 with ViT, reached 99.56% training and 95.07% validation accuracy, with a much lower loss of 0.0165 and 0.1800. VGG19 with ViT achieved 99.40% training and 94.89% validation accuracy, with loss of 0.0198 and 0.1911. ResNet50, under this approach, had 98.12% training and 94.33% validation accuracy, with loss of 0.0481 and 0.1954. InceptionV3 with ViT topped the charts, with 99.93% training and 96.88% validation accuracy and the lowest loss of 0.0028 and 0.1528. Multistage-ViT approach significantly enhances the performance of all backbone models, particularly in handling imbalanced datasets, leading to superior classification results with better generalization, which is illustrated in Figure 5.

Figure 5.

The performance of Multistage-ViT models using an imbalance dataset. Loss and Accuracy of (a) VGG16 + ViT (b) VGG19 + ViT (c) ResNet50 + ViT and (d) InceptionV3 + ViT.

Table 6.

Performance comparison of backbone models and multistage-ViT on an imbalanced dataset: evaluating training and validation accuracy and loss.

Model	Backbone Models				Multistage-ViT
	Accuracy (%)		Loss		Accuracy (%)		Loss
	Training	Validation	Training	Validation	Training	Validation	Training	Validation
VGG16	93.35	90.18	0.1890	0.2782	99.56	95.07	0.0165	0.1800
VGG19	97.91	95.72	0.0757	0.1337	99.40	94.89	0.0198	0.1911
ResNet50	96.75	95.17	0.1009	0.1408	98.12	94.33	0.0481	0.1954
InceptionV3	97.56	94.52	0.0792	0.1482	99.93	96.88	0.0028	0.1528

Furthermore, the results from the four experiments show that the InceptionV3 + ViT model excels in the training phase, achieving an accuracy of 99.93% with a loss of 0.0028. The model also performs strongly during validation, registering an accuracy of 96.88% and a loss value of 0.1528.

Figure 6 shows the model performance rankings on an imbalanced dataset across four metrics: train accuracy, validation accuracy, train loss and validation loss. The chart compares the rankings of various models, including VGG16, VGG19, ResNet50, InceptionV3 and their respective versions with ViT integration. In Table 7, the Multistage-ViT model consistently outperforms the backbone models in precision, recall, and F1-score, with the InceptionV3 + ViT combination delivering the highest performance. This highlights the superior capability of the Multistage-ViT approach in effectively managing imbalanced datasets for accurate lung disease classification. Figure 7 displays the confusion matrices for the proposed model, tested with VGG16, VGG19, ResNet50 and InceptionV3, each combined with a ViT-based model.

Figure 6.

Comparison of model performance rankings on imbalanced dataset.

Figure 7.

The confusion matrix for the imbalanced dataset for the models (a) VGG16 + ViT (b) VGG19 + ViT (c) ResNet50 + ViT and (d) InceptionV3 + ViT.

Table 7.

Comparative performance metrics of backbone models and multistage-ViT on imbalanced datasets: F1-score, precision and recall.

Model	Backbone Models			Multistage-ViT
Model	Precision	Recall	F1-score	Precision	Recall	F1-score
VGG16	0.8962	0.8961	0.8959	0.9012	0.9015	0.9017
VGG19	0.9135	0.9115	0.9118	0.9251	0.9249	0.9249
ResNet50	0.9051	0.9041	0.9038	0.9370	0.9359	0.9355
InceptionV3	0.8945	0.8942	0.8942	0.9659	0.9656	0.9656

Classification outcomes for the balanced dataset

Table 8 presents a detailed performance comparison between backbone models and multistage-ViT models on a balanced dataset, focusing on training and validation metrics. The VGG16 backbone model reached a success rate of 94.35% during training and 91.15% during validation, accompanied by an error rate of 0.0603 in training and 0.2300 in validation. VGG19 showed slightly lower effectiveness, with a training phase success rate of 92.78% and a validation phase rate of 89.04%, and corresponding error metrics of 0.1901 and 0.3124. ResNet50 recorded a training phase accuracy of 92.74% and a validation set accuracy of 89.47%, with higher training and validation phase errors of 0.2017 and 0.3773, respectively. InceptionV3 performed better, achieving a training phase success rate of 94.65%, a validation success rate of 90.64%, and the lowest error during training among the backbone models at 0.0238, though with a slightly higher validation error of 0.2998.

Table 8.

Performance comparison of backbone models and multistage-ViT on a balanced dataset: evaluating training and validation accuracy and loss.

Model	Backbone Models				Multistage-ViT
	Accuracy (%)		Loss		Accuracy (%)		Loss
	Training	Validation	Training	Validation	Training	Validation	Training	Validation
VGG16	94.35	91.15	0.0603	0.2300	98.82	88.12	0.0366	0.6931
VGG19	92.78	89.04	0.1901	0.3124	97.77	92.49	0.0698	0.2290
ResNet50	92.74	89.47	0.2017	0.3773	99.07	94.85	0.0170	0.2232
InceptionV3	94.65	90.64	0.0238	0.2998	99.97	96.31	0.0013	0.2152

The multistage-ViT models demonstrated significant improvements across all metrics. The VGG16 + ViT configuration reached a success rate of 98.82% during training and 88.12% during validation, with a training phase error of 0.0366 and a validation phase error of 0.6931. The VGG19 + ViT model showed further improvement, achieving a training phase accuracy of 97.77% and a validation success rate of 92.49%, with reduced error metrics of 0.0698 and 0.2290. ResNet50 + ViT delivered strong results, with a training success rate of 99.07% and a validation phase accuracy of 94.85%, paired with minimal error rates of 0.0170 and 0.2232. The InceptionV3 + ViT model stood out, reaching the highest training phase accuracy of 99.97% and a validation phase rate of 96.31%, with the lowest error in training at 0.0013 and a validation error of 0.2152. This comparison underscores the superior performance and effectiveness of the multistage-ViT models over the standalone backbone models in handling the balanced dataset, resulting in better generalization with reduced loss and these are illustrated by the corresponding loss and accuracy curves for the multistage-ViT models for balanced dataset in Figure 8. Figure 8 highlights the effectiveness of the Multistage-ViT model in lung disease classification on a balanced dataset. The results show a significant improvement in accuracy and reduction in loss across all tested backbone configurations, including VGG16, VGG19, ResNet50, and InceptionV3. Among these, the InceptionV3 + ViT combination stands out with the highest performance, achieving a training accuracy of 99.97% and a validation accuracy of 96.31%, along with minimal loss. This demonstrates the ability of the proposed model to effectively leverage both primary and advanced feature extraction stages, ensuring robust classification and reduced overfitting in balanced dataset conditions.

Figure 8.

The performance of Multistage-ViT models using balanced dataset. Loss and Accuracy of (a) VGG16 + ViT (b) VGG19 + ViT (c) ResNet50 + ViT and (d) InceptionV3 + ViT.

Moreover, the four experiments show that the InceptionV3 + ViT model delivers excellent performance during the training phase, achieving an accuracy of 99.97% and a low loss of 0.0013. The model also performs impressively in the validation phase, with an accuracy of 96.31% and a loss of 0.2152.

Figure 9 illustrates the performance rankings of various backbone models and their corresponding Multistage-ViT integrations on a balanced dataset, evaluated across training accuracy, validation accuracy, training loss, and validation loss. The Multistage-ViT models consistently outperform their standalone backbone counterparts, with significantly higher accuracy and lower loss metrics. Among the configurations, the InceptionV3 + ViT model achieves the best performance across all criteria, demonstrating its exceptional capability to extract relevant features and generalize well. This underscores the superior robustness and reliability of the Multistage-ViT approach in accurately classifying lung diseases from chest X-rays.

Figure 9.

Comparison of model performance rankings on balanced dataset.

In Table 9, the Multistage-ViT model consistently outperforms the backbone models in precision, recall and F1-score on balanced datasets, with particularly strong results for VGG16 and InceptionV3 + ViT. This highlights the superior accuracy and reliability of the Multistage-ViT approach in lung disease classification, even with balanced data.

Table 9.

Comparative performance metrics of backbone models and multistage-ViT on balanced datasets: F1-score, precision and recall.

Model	Backbone Models			Multistage-ViT
Model	Precision	Recall	F1-score	Precision	Recall	F1-score
VGG16	0.9457	0.9452	0.9452	0.9518	0.9519	0.9517
VGG19	0.8945	0.8940	0.8941	0.9490	0.9489	0.9487
ResNet50	0.9376	0.9368	0.9369	0.9405	0.9405	0.9405
InceptionV3	0.9519	0.9517	0.9518	0.9536	0.9535	0.9534

Figure 10 presents confusion matrices for the Multistage-ViT models combined with different backbone architectures (VGG16, VGG19, ResNet50, and InceptionV3) on a balanced dataset. The results highlight the strong classification performance of all configurations, with high precision and recall across all classes (Normal, COVID-19, Viral Pneumonia, and Lung Opacity). Among the models, InceptionV3 + ViT demonstrates the most accurate predictions, with minimal misclassification errors across classes. This indicates the proposed model's ability to handle complex feature relationships effectively and achieve reliable performance in lung disease classification.

Figure 10.

The confusion matrix for the balanced dataset for the models (a) VGG16 + ViT (b) VGG19 + ViT (c) ResNet50 + ViT and (d) InceptionV3 + ViT.

Figure 11 illustrates the ROC curves for the Multistage-ViT model (InceptionV3 + ViT configuration) on both imbalanced and balanced datasets, showing classification performance across four classes (COVID-19, Lung Opacity, Normal, and Viral Pneumonia). The high AUC values across all classes and the micro-average curve indicate excellent discriminative ability of the model. The model performs consistently better in both data scenarios, with slightly higher AUC values for the balanced dataset. This highlights the model's robustness and its effectiveness in handling diverse class distributions while maintaining strong classification accuracy.

Figure 11.

ROC curves showing classification performance: (a) Imbalanced dataset and (b) Balanced dataset.

Figure 12 presents Grad-CAM visualizations comparing the feature representations of InceptionV3 and InceptionV3 + ViT models on chest X-ray images. The input CXR images are shown in Figure 12(a), followed by Grad-CAM outputs from the InceptionV3 model in Figure 12(b) and the InceptionV3 + ViT model in Figure 12(c). The InceptionV3 + ViT model generates more precise and focused feature maps, effectively highlighting critical lung regions associated with the classification of COVID-19, lung opacity, and viral pneumonia. This enhanced visualization demonstrates the model's ability to capture subtle and complex features, offering superior localization of disease-related areas compared to InceptionV3 alone. The results highlight the strength of the Multistage-ViT model in accurately identifying pathological regions with greater precision.

Figure 12.

Grad-CAM analysis of the backbone and multistage-ViT Model on a balanced dataset: (a) Input CXR images (b) Inception V3 and (c) Inception V3 + ViT.

Discussion

The analysis of both imbalanced and balanced datasets reveals substantial improvements in both accuracy and loss metrics when comparing backbone models to multistage-ViT models. The integration of Vision Transformer (ViT) significantly enhances the performance of the models. In the imbalanced dataset, notable enhancements are observed. The VGG16 model training accuracy rises from 93.35% to 99.56% and its training loss drops from 0.1890 to 0.0165. Similar improvements are noted across other models, demonstrating the advanced feature extraction capabilities of ViT, which effectively enhance the learning process, even with imbalanced data. Additionally, validation accuracy generally improves, showcasing the models’ robustness and ability to maintain strong performance on unseen data. In the case of the balanced dataset, the multistage-ViT models exhibit even more impressive gains. InceptionV3 training accuracy is increased from 94.65% to 99.97%, with a significant reduction in training loss from 0.0238 to 0.0013. These results highlight the superior capability of the ViT-enhanced models to extract and utilize relevant features, ensuring higher accuracy and lower loss. The consistent improvements in validation accuracy further indicate the model's ability to generalize well. Overall, the analysis shows that the multistage-ViT models perform exceptionally well on the balanced dataset, achieving consistently higher training and validation accuracy and a significant loss reduction. This suggests that the balanced dataset is more favorable for evaluating the proposed models, as it better supports their ability to generalize and achieve consistent, superior performance across all metrics.

It is also observed that multistage-ViT models consistently outperform standalone backbone models across both imbalanced and balanced datasets, as indicated by higher precision, recall and F1-scores. The improvements are more pronounced in the imbalanced dataset, where the addition of ViT significantly enhances the model's ability to handle class imbalances and improve classification accuracy. On the balanced dataset, both sets of models perform well, but the multistage-ViT configurations still show slight enhancements, ensuring robust performance across all classes. Finally, the InceptionV3 + ViT model demonstrates the highest metrics overall, highlighting its greater capability in feature extraction and classification. These findings underscore the efficacy of integrating ViT with backbone models, particularly in complex medical image classification tasks.

Table 10 and Table 11 compare the accuracy and execution time required by the backbone models and the Multistage-ViT for both imbalanced and balanced datasets over 10 epochs. On the imbalanced dataset in Table 10, InceptionV3 as a backbone model achieved 92% accuracy in 98 min 25 s, whereas the proposed Multistage-ViT with InceptionV3 reached 94% accuracy in 71 min 47 s Similarly, VGG16 as a backbone model achieved 90% accuracy in 83 min 50 s, while the Multistage-ViT using VGG16 improved accuracy to 95% in 65 min 34 s On the balanced dataset in Table 11, the reduction in execution time is even more pronounced. ResNet50 as a backbone model achieved 87% accuracy in 15 min 40 s, whereas the Multistage-ViT with ResNet50 achieved 93% accuracy in just 11 min 27 s Similarly, VGG19 took 26 min 32 s to achieve 89% accuracy as a backbone model, while the Multistage-ViT using VGG19 reached 91% accuracy in 22 min 59 s These results demonstrate that the Multistage-ViT not only achieves higher accuracy but also requires fewer epochs and less execution time compared to backbone models, making it more efficient and effective for both imbalanced and balanced datasets.

Table 10.

Performance Comparison of Backbone Models and Multistage-ViT for 10 Epochs on an imbalanced Dataset: Accuracy and Execution Time.

Model	Backbone Models		Multistage-ViT
Model	Appx. Accuracy (%)	Time	Appx. Accuracy (%)	Time
VGG16	90	83 min 50 s	95	65 min 34 s
VGG19	85	89 min 09 s	89	67 min 56 s
ResNet50	91	69 min 37 s	93	43 min 21 s
InceptionV3	92	98 min 25 s	94	71 min 47 s

Table 11.

Performance Comparison of Backbone Models and Multistage-ViT for 10 Epochs on a balanced Dataset: Accuracy and Execution Time.

Model	Backbone Models		Multistage-ViT
Model	Appx. Accuracy (%)	Time	Appx. Accuracy (%)	Time
VGG16	91	21 min 58 s	93	17 min 18 s
VGG19	89	26 min 32 s	91	22 min 59 s
ResNet50	87	15 min 40 s	90	11 min 27 s
InceptionV3	90	31 min 12 s	94	27 min 13 s

Conclusion

This study presents a multistage-ViT model specifically designed for lung disease classification, leveraging the impact of a backbone model followed by a vision transformer. The dataset includes both imbalanced and balanced datasets, with the Imbalance set consisting of 16,930 training images and 4235 testing images, while the balance set standardizes to 1076 images per category for both training and testing phases. The proposed model follows a two-stage feature extraction process: first, primary features are extracted using a backbone model fine-tuned on the ImageNet dataset, followed by the extraction of deep features using the Vision Transformer. The classification of CXR images into Normal, COVID-19, Viral Pneumonia and Lung Opacity is then finalized using an MLP head classifier.

The performance of the multistage-ViT model was rigorously tested across four different backbone models and evaluated on metrics such as accuracy, loss, precision, F1-score and recall. The findings demonstrate the model's exceptional accuracy in lung disease classification, particularly in configurations involving InceptionV3 + ViT model, demonstrated exceptional training accuracy, reaching 99.93% on the imbalanced dataset and 99.97% on the balanced dataset, with corresponding validation losses of 0.0028 and 0.0013, respectively. Furthermore, the two-stage feature extraction approach improves accuracy and significantly reduces computational time by minimizing redundant parameters during the primary feature extraction stage. This performance underscores the efficacy of using a vision transformer for deep feature extraction, as evidenced by the optimal results compared to standalone backbone models. The study highlights the potential of the multistage-ViT approach in significantly enhancing the accuracy and reliability of lung disease classification, suggesting a promising direction for future research and application in this field.

Footnotes

ORCID iD

V.Padmavathi

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

WHO (2023) Coronavirus disease (COVID-19), https://www.who.int/emergencies/diseases/novel-coronavirus-2019/question-and-answers-hub/q-a-detail/coronavirus-disease-covid-19 .

Vineth Ligi

Kundu

Kumar

, et al. Radiological analysis of COVID-19 using computational intelligence: a broad gauge study. J Healthc Eng 2022; 2022: 5998042.

Al Rahhal

Bazi

Jomaa

, et al. COVID-19 Detection in CT/X-ray imagery using vision transformers. J Pers Med 2022; 12: 10.

Jalali

SMJ

Ahmadian

, et al. X-ray image based COVID-19 detection using evolutionary deep learning approach. Expert Syst Appl 2022; 201: 116942.

Selvano

Paulindino

Elwirehardja

, et al. Evaluating self-supervised pre-trained vision transformer on imbalanced data for lung disease classification. ICIC Express Lett 2024; 15: 2302–004.

Sahoo

Sharma

Saha

, et al. A federated multi-stage light-weight vision transformer for respiratory disease detection. In: Lecture Notes in Networks and Systems, 2023, p. 10.1007/978-981-99-8141-0_23.

Dosovitskiy

Beyer

Kolesnikov

, et al. An image is worth 16(16 words: Transformers for image recognition at scale. Preprint arXiv:2010.11929, 2020.

Khan

Shah

Bhat

. Coronet: a deep neural network for detection and diagnosis of COVID-19 from chest X-ray images. Comput Methods Programs Biomed 2020; 196: 105581.

Wang

Lin

Wong

. COVID-Net: a tailored deep convolutional neural network design for detection of COVID-19 cases from chest X-ray images. Sci Rep 2020; 10: 19549.

10.

Rahaman

Yao

, et al. Identification of COVID-19 samples from chest X-Ray images using deep learning: a comparison of transfer learning approaches. J Xray Sci Technol 2020; 28: 821–839.

11.

Vaswani

, et al. Attention is all you need. In Proceedings of the 31st international conference on neural information processing systems (NIPS’17). Curran Associates Inc., Red Hook, NY, USA, 2017, pp.6000–6010.

12.

Chaudhari

Mithal

Polatkan

, et al. An attentive survey of attention models. ACM Trans Intell Syst Technol 2021; 12: 1–32.

13.

Duong Tuan

Tran

, et al. Detection of tuberculosis from chest X-ray images: boosting the performance with vision transformer and transfer learning. Expert Syst Appl 2021; 184: 115519.

14.

Chen

Zhang

, et al. Deep learning-based model for detecting 2019 novel coronavirus pneumonia on high-resolution computed tomography. Sci Rep 2020; 10: 19196.

15.

Chetoui

Akhloufi

. Explainable vision transformers and radiomics for COVID-19 detection in chest X-rays. J Clin Med 2022; 11: 3013.

16.

Mkindu

Zhao

. Lung nodule detection in chest CT images based on vision transformer network with Bayesian optimization. Biomed Signal Process Control 2023; 85: 104866.

17.

Jiang

Wang

, et al. CTBP-Net: lung nodule segmentation model based on the cross-transformer and bidirectional pyramid. Biomed Signal Process Control 2023; 82: 104528.

18.

Huang

Yang

, et al. Research and implementation of multi-disease diagnosis on chest X-ray based on vision transformer. Quant Imaging Med Surg 2024; 14: 2539–2555.

19.

Shamrat

FMJM

Azam

Karim

, et al. Lungnet22: a fine-tuned model for multiclass classification and prediction of lung disease using X-ray images. J Pers Med 2022; 12: 80.

20.

Tang

Wang

Nie

, et al. EDL-COVID: ensemble deep learning for COVID-19 case detection from chest X-ray images. IEEE Trans Ind Inform 2021; 17: 6539–6549.

21.

Wang

Nie

Wang

, et al. Pneunet: deep learning for COVID-19 pneumonia diagnosis on chest X-ray image analysis using vision transformer. Med Biol Eng Comput 2023; 61: 1395–1408.

22.

Chen

Liu

Yang

, et al. A more effective CT synthesizer using transformers for cone-beam CT-guided adaptive radiotherapy. Front Oncol 2022; 12: 988800.

23.

Labbaf Khaniki

Mirzaeibonehkhater

Manthouri

. Enhancing pneumonia detection using vision transformer with dynamic mapping re-attention mechanism. In: Proceedings of the 2023 11th international conference on computer and knowledge engineering (ICCKE), 2023, pp. 144–149. 10.1109/ICCKE60553.2023.10326313

24.

Gao

Qian

Gao

. COVID-VIT: Classification of COVID-19 from 3D CT chest images based on vision transformer models. In: 2022 International conference on next generation computing applications (NextComp), 2022, pp. 1–4.

25.

Lin

Wang

Liu

, et al. A survey of transformers. AI Open 2022; 3: 111–132.

26.

Toroghi

Sheikh

Irani

. Classification of COVID-19 and lung opacity using vision transformer on chest x-ray images. J Phys Conf Ser 2023; 2622: 012016.

27.

Ukwuoma

Qin

Heyat

MBB

, et al. Automated lung-related pneumonia and COVID-19 detection based on novel feature extraction framework and vision transformer approaches using chest X-ray images. Bioengineering (Basel) 2022; 9: 09.

28.

Park

Woo

. Optimization of vision transformer-based detection of lung diseases from chest X-ray images. BMC Med Inform Decis Mak 2024; 24: 91.

29.

Chen

Philippi

Phan

, et al. High-accuracy fine-tuned vision transformer model for diagnosing COVID-19 from chest X-ray images. TechRxiv, 2024.

30.

Escobar-Ortiz

Amezquita-Dussan

Galindo-Sanchez

, et al. Data-set class-balancing and the convolutional vision transformer: an analysis of chest radiographs for SARS-CoV-2 detection. Biomed Signal Process Control 2024; 93: 106190.

31.

Singh

Kumar

, et al. Efficient pneumonia detection using vision transformers on chest X-rays. Sci Rep 2024; 14: 2487.

32.

Mezina

Burget

. Detection of post-COVID-19-related pulmonary diseases in X-ray images using vision transformer-based neural network. Biomed Signal Process Control 2024; 87: 105380.

33.

Nafisah

Muhammad

Hossain

, et al. A comparative evaluation between convolutional neural networks and vision transformers for COVID-19 detection. Mathematics 2023; 11: 1489.

34.

Chowdhury

MEH

Rahman

Khandakar

, et al. COVID-19 radiography database, https://www.kaggle.com/datasets/tawsifurrahman/covid19-radiography-database (2020).

35.

Badrish

GSSV

Prabhanjali

KGN

Raghuvira Pratap

. Comparative analysis of CNN models with vision transformer on lung infection classification. In: Malhotra

Sumalatha

Yassin

SMW

Patgiri

Muppalaneni

(eds) High performance computing, smart devices and networks. Lecture Notes in Electrical Engineering, vol 1087. Singapore: Springer, 2024. 10.1007/978-981-99-6690-5_12

36.

Tang

. Image classification based on improved VGG network. In: 2021 IEEE 6th international conference on signal and image processing (ICSIP), Nanjing, China, 2021, pp. 316–320. 10.1109/ICSIP52628.2021.9688778

37.

Khan

Zhang

Sharif

, et al. Pixels to classes: intelligent learning framework for multiclass skin lesion localization and classification. Comput Electr Eng 2021; 90: 1–20.

38.

Dutta

Sathi

Hossain

, et al. Conv-ViT: a convolution and vision transformer-based hybrid feature extraction method for retinal disease detection. J Imaging 2023; 9: 40.

39.

Shome

Kar

Mohanty

, et al. COVID-Transformer: interpretable COVID-19 detection using vision transformer for healthcare. Int J Environ Res Public Health 2021; 18: 11086.

40.

Park

Kim

, et al. Multi-task vision transformer using low-level chest X-ray feature corpus for COVID-19 diagnosis and severity quantification. Medical Image Analysis 2022; 75: 102299.