CycleGAN generated pneumonia chest x-ray images: Evaluation with vision transformer

Abstract

The use of generative models in image synthesis has become increasingly prevalent. Synthetic medical imaging data is of paramount importance, primarily because medical imaging data is scarce, costly, and encumbered by legal considerations pertaining to patient confidentiality. Synthetic medical images offer a potential answer to these issues. The predominant approaches primarily assess the quality of images and the degree of resemblance between these images and the original ones employed for their generation.The central idea of the work can be summarized in the question: Do the performance metrics of Frechet Inception Distance(FID) and Inception Score(IS) in the Cycle-consistent Generative Adversarial Networks (CycleGAN) model are adequate to determine how real a generated chest x-ray pneumonia image is? In this study, a CycleGAN model was employed to produce artificial images depicting 3 classes of chest x-ray pneumonia images: general(any type), bacterial, and viral pneumonia. The quality of the images were evaluated assessing and contrasting 3 criteria: performance metric of CycleGAN model, clinical assessment of respiratory experts and the results of classification of a visual transformer(ViT). The overall results showed that the evaluation metrics of the CycleGAN are insufficient to establish realism in generated medical images.

Keywords

Synthetic chest x-ray cycle generative adversarial network pneumonia image-to-image translation visual transformer

1 Introduction

In recent years, studies have increasingly focused on generative models for image synthesis [5]. These advances have increasingly influenced areas of medical research where artificial intelligence (AI) has become increasingly common in the handling and processing images, text, and sound. However, the need for these volumes of data presents a critical problem associated with the costs of time and resources to collect this information. Moreover, this issue also entails the challenge of acquiring substantial quantities of annotated data in order to effectively train CNNs that yield superior performance. CycleGANs [1] have been widely used in various domains, medical imaging being one of the most relevant [4, 7–9]. Some important works in the use of CycleGANs for medical image generation and quality assessment such as the one of Wolterink et al. [21] the authors present the application of GANs, including concepts similar to CycleGANs, to generate Computed Tomography (CT) images from Magnetic resonance imaging (MRI) images. This showcases the capability of such networks in the domain of medical imaging. While some recent works such as the one of Wang et al. [22] and the one of Malygina et al. [23] where the authors have explored the use of CycleGANs as a way of data augmentation for improving a CNN classification pneumonia, there are few works such as the one of Joyce et al. [24] that focus on establishing the relationship between the performance metric of a GAN model and the quality but not in the realism of the generated image, does synthetic image quality translates into how real it is?

In this work, a 256×256 CycleGAN model [1] was used to generate chest X-ray images associated with general pneumonia [25], bacterial pneumonia, and viral pneumonia. To evaluate if the quality of the generated images in regards with the FID and IS [2] translated to how real the pneumonia images are. The FID and IS metrics were contrasted with the evaluation by 3 respiratory care specialists through the application of a series of questionnaires. Likewise, the generated images were evaluated using a Visual Transformer (ViT) model [18–20] trained for the classification of chest-x-ray images into 2 classes: pneumonia and normal.

The contributions of this work are: 1) FID and IS metrics are shown to be lacking [2] as an objective way of assessing the realism of pneumonia chest X-ray generated images by a CycleGAN model. 2)Medical expert assessments are suitable and necessary in a clinical setting for decision making and diagnostics but the use of predictive models such as ViT are tools that can serve to ensure that synthetic images are realistic enough to use in learning models or other practical applications.

This paper is structured as follows: Section 2 presents the materials and methods employed in this work. Section 3 describes the experimental results. Section 4 discusses the results obtained and Section 5 give the conclusions and future works.

2 Materials and methods

This section introduces the GANs, CycleGANs and ViT. The CycleGAN model was used for the generation of synthetic images of pneumonia while the ViT model was used for the classification of the images. This section also describes the 256×256 architecture of the CycleGAN used in this work, description of the used ViT model is also given, the datasets used in the experiments and the questionnaires applied to the specialists are described.

2.1 Dataset

The dataset used for this work is Labeled Optical Coherence Tomography (OCT) and Chest X-Ray Images for Classification [16]. The dataset is organized into 3 folders (train, test, val) and contains subfolders for each image category (Pneumonia/Normal). There are 5,863 X-Ray images (JPEG) and 2 categories (Pneumonia/Normal). Each Pneumonia image is also label as of viral or bacterial origin.

2.2 GANs

A generative adversarial network (GAN) [3, 4] is a type of deep learning model consisting of two neural networks: a generator network and a discriminator network. The GAN framework was introduced by Goodfellow in 2014.

The main objective of a GAN is to produce synthetic data that exhibits a high degree of realism, particularly in the context of generating images, which closely resemble samples from a specified target dataset. The generator network accepts either random noise or a latent input as its input and endeavors to produce samples that closely resemble the distribution of the target data. Conversely, the discriminator network undergoes training to differentiate between authentic samples provided from the target dataset and synthetic samples produced by the generator.

The training process of a GAN entails a competitive interplay between the generator and discriminator neural networks. The primary objective of the generator is to generate samples that exhibit a higher degree of realism in order to deceive the discriminator. Conversely, the discriminator’s primary goal is to accurately distinguish between real and counterfeit samples. The learning process is driven by the antagonistic connection between the two networks.

2.3 CycleGANs

CycleGANs, are generative models designed for unsupervised image-to-image translation tasks. They can learn mappings between two domains without needing paired training data. CycleGANs were introduced to enable image translation between two domains [1]. The fundamental concept underlying CycleGANs involves the incorporation of cycle consistency loss. This characteristic highlights the importance of preserving the original content when a picture is translated from one domain to another and subsequently translated back to its original domain. By integrating the concept of cycle consistency, the model is able to acquire significant mappings between the domains, even in the absence of paired training examples. It was introduced as a new way to learn mappings between two different image domains without needing paired training data. The CycleGANs architecture is different from other GANs in a way that it contains 2 mapping functions (G and F) that act as generators and their corresponding Discriminators (D_x and D_y): The generator mapping functions are as follows:

G : X \to Y, F : Y \to X

(1)

where X is the input image distribution and Y is the desired output distribution. And the cost function used is the sum of adversarial loss and cyclic consistent loss:

L (G, F, D_{x}, D_{y}) = L_{advers} (G, D_{y}, X, Y) + L_{advers} (F, D_{x}, Y, X) + λ L_{cycl} (G, F, X, Y)

(2)

with an objective function with the form of:

\min_{G, F} \max_{D_{x}, D_{y}} L (G, F, D_{x}, D_{y})

(3)

The training process of a CycleGAN involves two main components: -

Adversarial Loss: The training of the generators and discriminators is conducted through the utilization of adversarial learning. The primary objective of the generators is to produce images that deceive the discriminators by causing them to classify the generated images as authentic. Conversely, the discriminators strive to accurately differentiate between real images and those that have been generated. The utilization of an adversarial loss function contributes to the enhancement of both the quality and realism of the generated images. ${Loss}_{adv} (G, D_{y}, X, Y) = \frac{1}{m} Σ {(1 - D_{x} (G (x_{i})))}^{2} {Loss}_{adv} (F, D_{x}, Y, X) = \frac{1}{m} Σ {(1 - D_{y} (F (y_{i})))}^{2}$

Cycle-Consistency Loss: This principle asserts that an image that undergoes translation from one domain to another and subsequently back should have a high degree of similarity to the original input image. The loss function employed in this process guarantees the maintenance of consistency in the mapping between the images in both directions, so aiding in the preservation of the original image content. $\begin{array}{l} {Loss}_{cyc} (G, F, X, Y) = \frac{1}{m} [(F (G (x_{i})) - x_{i}) \\ + (G (F (y_{i})) - y_{i})] \end{array}$

A fundamental weakness of the CycleGAN model is that it learns deterministic mappings. In CycleGAN and other similar models [4, 6], the conditionals between domains correspond to delta functions: $\hat{p} (a ∣ b) = δ (G_{BA} (b))$ and $\hat{p} (b ∣ a) = δ (G_{AB} (a))$ , and cycle consistency forces the learned mappings to be inverses of each other.When confronted with intricate inter-domain connections, CycleGAN tends to acquire an artificial one-to-one correspondence instead of accurately reflecting the genuine, organized conditional distribution. The presence of deterministic mappings poses a challenge in achieving optimized cycle consistency, particularly when the domains exhibit significant differences in complexity. In such scenarios, the mapping from one domain to another typically results in a one-to-many relationship.

2.3.1 256×256 CycleGAN architecture

The CycleGAN generator is composed of three distinct components, namely the Encoder, Transformer, and Decoder. The UNET architecture will be employed for the generator. In order to construct the generator, we establish our downsample and upsample techniques. The process of downsampling involves reducing the two-dimensional dimensions, specifically the width and height of an image, by a factor known as the stride. The stride refers to the measurement of the distance covered by each step taken by the filter. Given a stride value of 2, the filter is applied to alternate pixels, resulting in a reduction of both the width and height dimensions by a factor of 2. In this work, instance normalization was employed as an alternative to batch normalizing.

The architecture of the discriminator employs the PatchGAN discriminator. The distinction between a PatchGAN and a conventional GAN discriminator lies in their respective mapping functions. In the case of a standard GAN, the mapping is performed from a 256×256 picture to a singular scalar output, which serves as an indicator of authenticity (real or fake). On the other hand, the PatchGAN operates by mapping from a 256×256 image to a different output as shown in Fig. 1, which encompasses many patches inside the image to an N × N (here 64×64) array of outputs X, where each X_ij represents whether the patch ij in the image is real or fake–first, a 4×4 convolution-InstanceNorm-LeakyReLU layer with 128,256 and 512 filters and stride of size 2. InstanceNorm on the first layer of 64 filters is not applied. After the last layer, we apply convolution operation to produce a 1×1 output.

Fig. 1

Architecture of 256×256 CycleGAN model.

2.4 Performance metric

The performance metrics for evaluating the CycleGAN generated images [2]:

FID = | μ_{1} - μ_{2} | + Tr (σ_{1} + σ_{2} - 2 \sqrt{σ_{1} * σ_{2}})

(4)

where μ₁ and σ₁ refer to the mean and covariance of the train data and μ₂ and σ₂ refer to the mean and covariance of the test data and Tr refers to the trace.

IS = e^{\frac{1}{N} Σ_{i = 1}^{N} D_{KL} p (y | x^{i} | | \hat{p} (y)}

(5)

where p (y|x) is the conditional probability of image being the given object and p (y) is the marginal probability that the given image is real, G refers to the generated image and D₍KL) refers to KL Divergence of the mentioned probabilities.

2.5 Vision transformers

ViTs have become a central topic in the deep learning community due to their effectiveness in image classification tasks [15–17], previously dominated by CNNs. The basic idea is that instead of using the traditional convolutional layers as in CNNs, ViTs leverage the transformer architecture, which was originally designed for natural language processing tasks, to handle images.

For the preparation of the input the images are divided into fixed-size non-overlapping patches. Each patch is linearly embedded into a flat vector. Then positional embeddings are added to these vectors to maintain the spatial information of patches since transformers do not inherently understand the spatial layout.The sequence of embedded patches is then fed into the transformer.

The core of the transformer architecture is the self-attention mechanism, which weighs input elements differently based on their content and relative positions.Multi-head attention and feed-forward neural networks, alongside normalization and residual connections, constitute a transformer block. Several such blocks are stacked to form the complete transformer model. It is also important to consider some important challenges [15] since ViTs, especially larger models, require substantial amounts of data and computing resources for training from scratch. They can be slower and more memory-intensive than CNNs, especially for smaller-sized input images. Vision Transformers approach image understanding in a way fundamentally different from CNNs. Instead of focusing on local patterns and hierarchical structures, ViTs break down images into patches and use transformer mechanisms to capture both local and global contexts.

2.6 ViT architecture

The ViT model used in this work was pre-trained on ImageNet-21k (14 million images, 21,843 classes) at resolution 224×224. It was introduced in the work of Dosovitskiy et al. [18] and first released at https://github.com/google-research/vision_transformer. The ViT is a transformer encoder model (BERT-like) pretrained on a large collection of images in a supervised fashion, namely ImageNet-21k. Images are presented to the model as a sequence of fixed-size patches (resolution 16×16), which are linearly embedded (see Fig. 2. The model was trained on TPUv3 hardware (8 cores). All model variants were trained with a batch size of 4096 and learning rate warm-up of 10k steps. For ImageNet, the authors found it beneficial to additionally apply gradient clipping at global norm 1. Pre-training resolution is 224. Note that for fine-tuning, the best results were obtained with a higher resolution (384x384).

Fig. 2

Architecture of ViT.

2.7 Questionnaire

Three electronic questionnaires were made with Google Services application Google Forms. Each questionnaire consist consist of 100 generated and real images. The images were divided into four sections for evaluation. Out of the 100 images, 80 were generated, and 20 were real. In the questionnaires, they were given the option to choose if the image they saw was real or fake and if the image corresponded to general pneumonia (any possible cause of pneumonia such as fungal) [25], bacterial pneumonia, viral pneumonia, or did not correspond to pneumonia.

3 Experiments

This section presents the experiments and the results obtained through the evaluation of the generated images. The performance metrics FID and IS of the generated images were contrasted with the results obtained by the ViT classification of the generated images and the results of the applied questionnaire to the medical experts.

3.1 Data pre-processing

The images of the dataset 2.1 were divided into 3 files: 1341 normal X-ray images, 2531 bacterial pneumonia images and 1345 viral pneumonia images. The images were resized to 256×256 pixels. Then the images were processed into TFRecord files with a batch size of 25. Additionally, the images were scaled to a [–1, 1] interval.

3.2 Results and analysis

The CycleGAN model was trained for 100 epochs with a total of 1,341 synthetic images generated of general pneumonia (GP), bacterial pneumonia (BP), and viral pneumonia (VP). The quality of the generated images was assessed at the epochs: 25, 50, 75 and 100. For each set of generated images, its corresponding FID and IS values were calculated (as seen in Table 1) with its corresponding loss functions (as seen in Fig. 3). One loss function for the generator of the Pneumonia-like images, another for the generator of the Normal Chest X-ray images and two more for their corresponding discriminators.

Table 1
FID and IS evaluation of CycleGAN generated images

Dataset Epochs FID IS

General Pneumonia 25 64.629 1.8011

50 93.1049 1.8217

75 106.5786 1.8568

100 86.5787 2.3461

Bacterial Pneumonia 25 68.9245 2.1217

50 64.5383 1.9673

75 78.4502 1.824

100 67.4041 1.296

Viral Pneumonia 25 54.4719 2.2069

50 63.9624 2.2485

75 77.9493 2.489

100 60.9601 2.5063

Fig. 3

Loss function of 256×256 CycleGAN model during 100 epochs of training.

Visually the output quality of the generated images (see Fig. 4) it’s difficult to asses for someone who does not have medical knowledge but according to the FID and IS values of the generated images the best quality correspond to the viral pneumonia images trained 25 epochs with a FID value of 54.4719. Considering the IS metric, it also corresponds to those of viral pneumonia generated images but at 100 training epochs with an IS of 2.5063.Using this metrics do not tell us much on their own regarding the medical usefulness of the generated images. Thus contrasting to the results obtained from the questionnaires applied (see Table 2) is possible to observe that good performance in the FID and IS values does not translate into the evaluation obtained by the assesment of the medical experts.

Fig. 4

Results obtained from CycleGAN: A) Original Image, B) General pneumonia, C) Bacterial pneumonia, D) Viral pneumonia.

For the GP generated images the best expert assessment average was of 0.58 with 25 epochs of training and matches with the best FID but not IS score of its group. The best BP generated images were with 75 epochs of training and got an average expert assessment of 0.9 that does not match the best FID or IS score of its group. Overall this group got the best score images assessment-wise. And for the VP generated images the best expert assessment average of 0.53 was obtained with 100 epochs of training which does not match the best FID or IS of its group. The results obtained give us evidence that the images generated score either FID or IS do not translate into their ability to pass as images of real pneumonia [2]. This evaluation by the experts is important but also presents problems since the number of images that can be shown to the experts to asses has to be a reasonable amount but also a representative one of the almost limitless images that the CycleGAN model can generate.

Table 2

Experts evaluation of generated images

Dataset	Epochs	Expert 1		Expert 2		Expert 3
		SR*	RR*	SR	RR	SR	RR
GP	25	0.55	0.8	0.6	0.8	0.6	0.6
	50	0.35	0.8	0	0.8	0.35	0.8
	75	0.45	1	0.4	0.8	0.35	0.8
	100	0.45	0.6	0.3	0.6	0.45	0.8
BP	25	0.7	0.6	0.6	1	0.7	0.6
	50	0.7	0.8	0.45	0.6	0.65	0.4
	75	0.8	0.2	0.95	1	0.95	0.8
	100	0	0.6	0.2	0.2	0.15	0.8
VP	25	0.55	1	0.6	1	0.35	1
	50	0.35	1	0.8	0.6	0.35	0.6
	75	0.68	0.8	0.55	1	0.25	1
	100	0.55	0.6	0.7	1	0.35	1

*SR: synthetic image and class identified as real, RR: real image and class identified as real.

Although the criteria of experts cannot be substituted, it is important to have a way to evaluate the huge amounts of images that can be generated considering a more objective criterion that is whether the generated image is capable of passing for a chest image x-ray with pneumonia or not.Since it is a classification problem, the validated and trained ViT model vit-xray-pneumonia-classification was used, which is a fine-tuned version of the model architecture ViT (as seen in Fig. 2.5.1) trained on the NIH Chest X-rays dataset. The trained ViT model performance (as seen in Fig. 5): Loss: 0.0868, Accuracy: 0.9742.

Fig. 5

Performance of ViT pneumonia classification model: A) Accuracy, B) Training loss, C) Evaluation loss.

For a basal evaluation of the ViT model, 4 batches of 1000 real images each were given to the ViT model to classify (as seen in Table 3): general pneumonia, bacterial pneumonia, viral pneumonia and normal chest x-ray images. Since the ViT model used in this work only classifies into 2 classes (pneumonia and normal) with a probability associated with each class, a probability of classifying into a class above 0.98 was established as a selection threshold.

Table 3

Evaluation of ViT on real images

Dataset	Positive	Prob.	Negative	Prob.
Pneumonia	0.975	0.992	0.025	0.797
Bacterial P.	0.963	0.993	0.037	0.885
Viral P.	0.945	0.991	0.055	0.887
Normal	0.983	0.988	0.017	0.887

*Positive: positive identified class, Prob.: mean probability for classification in given class, Negative: negative identified class.

Table 4

Evaluation of ViT on the generated images

Dataset	Epochs	Pos.	Prob.	Neg.	Prob.
GP	25	0.761	0.9929	0.239	0.462
	50	0.95	0.9933	0.05	0.5524
	75	0.916	0.9928	0.084	0.7311
	100	0.903	0.992	0.097	0.711
BP	25	0.94	0.9935	0.06	0.4415
	50	0.93	0.9933	0.07	0.2751
	75	0.953	0.9934	0.047	0.376
	100	0.888	0.9892	0.112	0.6842
VP	25	0.579	0.9925	0.43	0.3571
	50	0.584	0.9914	0.416	0.6282
	75	0.677	0.9913	0.323	0.6562
	100	0.533	0.9922	0.467	0.3139

Fig. 6

Overall results of probability classification of ViT of generated images: a) General pneumonia positive, b) Bacterial pneumonia positive, c) Viral pneumonia positive, d) General pneumonia negative, e) Bacterial pneumonia negative, f) Viral pneumonia negative.

Fig. 7

Example of applying Normal Chest X-Ray generator from General Pneumonia trained model: A) Pneumonia Image, B) Normal Chest X-Ray.

For general and bacterial pneumonia images, an average of 80% of generated images identified as pneumonia by the ViT model were obtained. The generated images of viral pneumonia had the worst evaluation by the ViT model classification where almost half of the generated images were not identified as pneumonia. This is quite contrasting since these images had the best FID and IS values and seems to coincide with the experts’ assessment where on average 50% of the images generated were identified as images of viral pneumonia.

4 Discussion

The use of CycleGAN in this context offers a two-fold benefit. Firstly, it aids in the augmentation of pneumonia datasets, potentially improving the generability of models trained on them [6, 10, 11, 21–23]. Secondly, the bidirectional nature of CycleGAN provides a way for understanding the differences between healthy and pneumonia chest x-rays [24]. However, challenges persist such as distinguishing between real and subtle synthetic features is critical, and the generated images must be used carefully in clinical applications [7]. Likewise, an advantage of the CycleGAN model lies in the training of two generators to guarantee the cycle consistency of the model. In this work, a comparison between the expert evaluation and the classification of a ViT model was contrasted with respect to the FID and IS values of the generated images. With this comparison it was possible to show that the evaluation of the FID and IS values associated with the quality of the generated images are insufficient to determine how realistic the generated image is, which is quite contrasting since these metrics measure proximity with respect to the distribution of the dataset used for training. This can have several explanations, ranging from the type of GAN model used to the difficulty of the model in understanding the complexity of the distribution that is sought to be learned, this is especially difficult in images.

5 Conclusion

CycleGANs application in generating synthetic pneumonia chest x-rays showcases the potential of generative models in medical imaging. While promising, care must be taken to validate the clinical relevance of the generated images, ensuring they serve as a boon, not a bane, to medical diagnostics. From this work we show that the FID and IS values associated with the quality of the generated images are insufficient to determine how realistic the generated chest x-ray pneumonia images were. This will suggest the need for a proposal of a better assessment metric that will allow the use of generated images more reliably and helpfully. For future works, a modified ViT model has been proposed for classifying more classes. Also, to help avoid the mode collapse problem, it has been proposed to implement multimodal models such as Augmented CycleGAN [4] and for better quality image it has been proposed the modification of the architecture of this work to 512×512 and 1024×1024 to obtain better image quality.

5.1 Dataset & Code

The CycleGAN code used in this paper and the dataset is available at: https://github.com/Lugo1025/PneumoCGAN. The pretrained ViT model for pneumonia classification is available at: https://huggingface.co/lxyuan/vit-xray-pneumonia-classification. The dataset used for the training of the pretrained ViT model for pneumonia classification is available at: https://www.kaggle.com/datasets/nih-chest-xrays/data.

Footnotes

Acknowledgments

We gratefully acknowledge the help with the proffesional assesment of the images to Dr. Yazmin Guillen Dolores from the National Institute of Cardiology, Dr. Gustavo Lugo Goytia and Dr. Sergio Gustavo Monasterios López from the National Institute of Respiratory Diseases. The authors also wish to thank the support of the Instituto Politécnico Nacional (COFAA, SIP-IPN, Grant SIP 20240610) and the Mexican Government (CONAHCyT, SNI).

References

Zhu

J.Y.

Park

Isola

Efros

A.A.

, Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision (pp. 2223–2232). (2017).

Borji

, Pros and cons of GAN evaluation measures: New developments, Computer Vision and Image Understanding215 (2022), 103329.

Goodfellow

Pouget-Abadie

Mirza

Warde-Farley

Ozair...

Bengio

, Generative adversarial networks, Communications of the ACM63(11) (2020), 139–144.

Pang

Lin

Qin

Chen

, Image-to-image translation: Methods and applications, IEEE Transactions on Multimedia24 (2021), 3859–3881.

Morís

D.I.

de Moura

Novo

Ortega

, Cycle generative adversarial network approaches to produce novel portable chest X-rays images for COVID-19 diagnosis. In ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1060–1064). IEEE. (2021, June).

Antoniou

Storkey

Edwards

, Data augmentation generative adversarial networks, arXiv preprint arXiv:1711.04340. (2017).

Segal

Rubin

D.M.

Rubin

Pantanowitz

, Evaluating the clinical realism of synthetic chest x-rays generated using progressively growing gans, SN Computer Science2(4) (2021), 321.

Zebin

Rezvy

, COVID-19 detection and disease progression visualization: Deep learning on chest X-rays for classification and coarse localization, Applied Intelligence51 (2021), 1010–1021.

Sanchez

Hinojosa

Arguello

Kouamé

Meyrignac

Basarab

, Cx-dagan: Domain adaptation for pneumonia diagnosis on a small chest x-ray dataset, IEEE Transactions on Medical Imaging41(11) (2022), 3278–3288.

10.

Malygina

Ericheva

Drokin

, Data augmentation with GAN: Improving chest X-ray pathologies prediction on class-imbalanced cases. In International conference on analysis of images, social networks and texts (pp. 321–334). Cham: Springer International Publishing. (2019, July).

11.

Buragadda

Rani

K.S.

Vasantha

S.V.

Chakravarthi

M.K.

, HCUGAN: Hybrid Cyclic UNET GAN for Generating Augmented Synthetic Images of Chest X-Ray Images for Multi Classification of Lung Diseases, International Journal of Engineering Trends and Technology70(2) (2022), 229–238.

12.

Sharma

Jain

J.S.

Bansal

Gupta

, Feature extraction and classification of chest x-ray images using CNN to detect pneumonia. In 2020 10th international conference on cloud computing, data science & engineering (Confluence) (pp. 227–231). IEEE. (2020, January).

13.

Dosovitskiy

Beyer

Kolesnikov

Weissenborn

Zhai

Unterthiner...

Houlsby

, An image isworth 16×16words: Transformers for image recognition at scale, arXiv preprint arXiv:2010.11929. (2020).

14.

Vashisht

Lamba

Sharma

, Pneumonia Classification using CNN-GAN. In 2023 International Conference on Sustainable Computing and Data Communication Systems (ICSCDS) (pp. 456–461). IEEE. (2023, March).

15.

, Vision Transformer: Vit and its Derivatives, arXiv e-prints, arXiv-2205. (2022).

16.

Kermany

Zhang

Goldbaum

, Labeled optical coherence tomography (oct) and chest x-ray images for classification, Mendeley Data2(2) (2018), 651.

17.

Tyagi

Pathak

Nijhawan

Mittal

, Detecting Pneumonia using Vision Transformer and comparing with other techniques. In 2021 5th International Conference on Electronics, Communication and Aerospace Technology (ICECA) (pp. 12–16). IEEE. (2021, December).

18.

Raghu

Unterthiner

Kornblith

Zhang

Dosovitskiy

, Do vision transformers see like convolutional neural networks? , Advances in Neural Information Processing Systems34 (2021), 12116–12128.

19.

Dai

Wan

Zhang

Yan...

Vajda

, Visual transformers: Token-based image representation and processing for computer vision, arXiv preprint arXiv:2006.03677. (2020).

20.

Deng

Dong

Socher

L.J.

Fei-Fei

, Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition (pp. 248–255). Ieee. (2009, June).

21.

Wolterink

J.M.

Dinkla

A.M.

Savenije

M.H.

Seevinck

P.R.

van den Berg

C.A.

Išgum

, Deep MR to CT synthesis using unpaired data. In Simulation and Synthesis in Medical Imaging: Second International Workshop, SASHIMI 2017, Held in Conjunction with MICCAI 2017, Québec City, QC, Canada, September 10, 2017, Proceedings 2 (pp. 14–23). Springer International Publishing. (2017).

22.

Liu

Wang

Zhang

Traverso

Dekker

Zhang

Chen

, CycleGAN Clinical Image Augmentation Based on Mask Self-Attention Mechanism, IEEE Access10 (2022), 105942–105953.

23.

Malygina

Ericheva

Drokin

, GANs’ N Lungs: improving pneumonia prediction, arXiv preprint arXiv:1908.00433. (2019).

24.

Joyce

Chartsias

Tsaftaris

S.A.

, Robust multimodal MR image synthesis. In Medical Image Computing and Computer Assisted Intervention MICCAI 2017: 20th International Conference, Quebec City, QC, Canada, September 11–13, 2017, Proceedings, Part III 20 (pp. 347–355). Springer International Publishing. (2017).

25.

Cordier

J.F.

, Organising pneumonia, Thorax55(4) (2000), 318–328.

Dataset	Epochs	FID	IS
General Pneumonia	25	64.629	1.8011
	50	93.1049	1.8217
	75	106.5786	1.8568
	100	86.5787	2.3461
Bacterial Pneumonia	25	68.9245	2.1217
	50	64.5383	1.9673
	75	78.4502	1.824
	100	67.4041	1.296
Viral Pneumonia	25	54.4719	2.2069
	50	63.9624	2.2485
	75	77.9493	2.489
	100	60.9601	2.5063