Abstract
Background
Deep learning-based decision support systems require synthetic images generated by adversarial networks, which require clinical evaluation to ensure their quality.
Objective
The study evaluates perceived realism of high-dimension synthetic spine fracture CT images generated Progressive Growing Generative Adversarial Networks (PGGANs).
Conclusion
The study reveals that AI-based PGGAN can generate realistic synthetic spine fracture CT images up to 512 × 512 pixels, making them difficult to distinguish from real images, and improving the automatic spine fracture type detection system.
Keywords
Introduction
Artificial Intelligence (AI) has dominated the area of computer vision. The most innovative computer vision techniques currently available are deep learning based. DL networks have the advantage of being able to automatically extract features, allowing researchers to describe images without building intricate manual features. Medical image analysis is a significant area for research in this regard and is receiving more attention.1–4 DL based medical image analysis has a lot of research possibilities in the near future, as can be reasonably predicted. DL models mainly require large training datasets to prevent overfitting. Overfitting is the term used to describe the occurrence where a network learns a function with extremely high variation in order to perfectly model the training data. It is quite challenging to gather large datasets in medical applications. Data Augmentation is the solution for limited datasets. The term “data augmentation” refers to a group of methods used to increase the amount of training datasets so that better Deep Learning models can be built.
A modern technique called Generative Adversarial Networks (GANs) creates synthetic but realistic-looking images. Despite the objections to the use of synthetic images in the medical area, GANs have received significant attention in radiological research due to their undeniable benefits. 5 The issue of preserving patients’ privacy is always brought up when diagnostic radiological images are used in the public domain. Deep learning researchers have found this to be a major challenge. These privacy worries might be addressed using GANs.
In GANs, generative model is trained in an adversarial manner. It is based on a simple yet effective concept: it is composed of the generator and discriminator networks. Generator tries to generate realistic synthetic images that can deceive discriminator; which tries to distinguish real and synthetic images. The distribution of the generated samples by the generator from a latent code, is identical to the distribution of real samples used for training. Discriminator network is trained to perform the assessment of generated samples. Additionally, a gradient can be used to direct both networks in the correct direction. The discriminator, which is an adaptive loss function, is typically of secondary importance which is not considered; once the generator has been trained. The absence of high-quality training data with expert labels has put the standard supervised learning techniques under pressure. Building high-quality data needs a significant amount of effort from experts and results in large expenditures. 6 The GAN technique is not stable and collapse, particularly for higher resolutions. High quality realistic images in higher resolutions can be generated by PGGANs. 7 PGGANs have shown that by using progressive growth techniques, high-resolution images (1024 × 1024) can be produced. 7
The technical and contextual reasons for preferring PGGANs in spine CT synthesis are:
First, as noted, PGGANs are known for producing high-quality images, which can be useful in multiple fields, including computer graphics and medical imaging. Furthermore, they also come as pre-trained models, trained on CelebA-HQ or ImageNet, which are much less computationally expensive and time-consuming to train for small size custom data. Second, PGGANs are still good enough at capturing complex data distributions, especially the ones present in the medical images. It is capable of learning and modelling complex structures and patterns in the input data and hence suitable for the generation of spine fracture CT. In conclusion, PGGAN was selected for data synthesis in the present study due to its adherence to specific application requirements like computational power and data set size, image quality, and ability to learn complex distributions.8–10 Moreover, we intend to check the perceived realism across different image dimensions during spine fracture CT synthesis. To determine whether PGGANs can produce realistic spine fracture CT images that are indistinguishable from actual ones, even by subject experts. To assess the generated spine fracture CT image quality by Visual Turing Tests to check the perceived realism in PGGAN generated synthetic spine fracture CT images from the perspective of spine surgeons. To determine whether PGGAN methods can be helpful in domains like spine fractures CT imaging, where images must be processed at relatively high resolutions (512 × 512) to find fine structural details like fracture lines which is of great diagnostic importance.
The quality assessment evaluation of the synthetic images is difficult. various evaluation metrics have been suggested in the literature, including the Wasserstein sliced distance, the Fréchet inception distance, and the inception score.
7
However, they are mostly relevant for comparing various GAN synthesis techniques and are not suited for evaluation of image realism. Our research's major goal is to
Literature review
Deep learning techniques like deep neural networks have seen quick success in the field of medical imaging. A type of deep neural network called GANs11–13 is used to generate realistic synthetic images. Data augmentation, image reconstruction, image segmentation, and image transformation in different modalities are all applications of GANs.14–17 PGGANs to illustrate that this technique was capable of generating realistic medical images in a number of different domains. The generation of synthetic medical images for data augmentation to enhance classification performance with sparse data has also been demonstrated in a number of studies such as neuronal imaging, stain-free cancer cell imaging, or CT imaging of liver lesions,8–10 X-rays in Covid-19 diagnosis, 18 OCT for glaucoma, 19 generation of skin lesion, 20 X-Ray in cephalometric, 21 lung cytological image 22 ocular surface image generation, 23 AS-OCT image generation, 24 X-Ray synthesis, 25 MRI brain tumor. 26 More specific in the field of radiological applications, GANs were used for data augmentation,27–29 translation across various radio imaging modalities,30–34 data segmentation35–38 and image reconstruction and denoising.39–42
There is active discussion concerning how to assess GAN synthetic images. The generator and discriminator models are trained concurrently to maintain an equilibrium in GANs, unlike other DL models that are trained using a loss function until convergence. Therefore, it is impossible to determine the model's relative or absolute quality just based on loss.
GAN generated retinal images, 43 body computed tomography, 44 myocardial perfusion Images, 45 lung cancer images, 46 synthetic brain MR image, 47 were evaluated with visual turing tests. These studies suggested, a visual turing test can judge the quality of the GAN generated images effectively.
In summary, despite the progress made by GANs in medical image synthesis, all of these findings point to a large research gap for PGGAN-powered spine fracture CT image generation. The problem of a limited training dataset for classifying spine fracture types can be resolved by the suggested PGGAN technique. Research on the clinical evaluation of the synthetic CT images by visual Turing tests to determine the perceived realism of PGGANs in the generation of images, and the statistical analysis of the VTT results to identify the interobserver reliability across the different size of image is lacking. The identification of these research gaps prompted the current research.
Data collection
The tertiary care hospital institutional ethics committee approved this retrospective study protocol (No. 503/2020). Patients’ CT scans who undergone spine CT between July 2017 and June 2020 (in the age group of 18–60 years) were considered. In 456 patients, five images from each patient were considered. 2820 images belonging to 8 sub-types of spine fractures namely A0- Minor injuries, A1 – Wedge compression injuries, A2 – Split injuries, A3 – Incomplete burst injuries, A4 – Complete burst injuries, B1 – Chance fractures, B2 – Posterior tension band disruption injuries – Translation injuries were used in the model development. Figure 1 shows sample VCF CT Image of 8- subtypes collected.

Representative CT images illustrating eight distinct types of spinal fractures collected from tertiary care hospital.
PGGAN augmentation
PGGANs have shown that by using progressive growth techniques, high-resolution images (1024 × 1024) can be produced (7). Therefore high-resolution 512 × 512-pixel synthetic spine fracture CT images were generated using PGGANs. Previous GAN models, including deep convolutional GANs, were able to produce synthetic images with a relatively modest resolution (256 × 256), in contrast to PGGAN (7). The architecture of PGGAN to generate the CT image of spine fractures comprises of two neural networks: the Generator and Discriminator. Using a 512 latent vector, spine fracture CT images were generated from low 4 × 4 pixel up to 512 × 512 pixels. Resolution of images raised by a factor of 2 in each step. For doubling the resolution new block of layers are smoothly added to the networks (G &D) without disturbing the existing layers. Both generator and discriminator grow synchronously. Throughout the training of PGGAN, all the layers are trainable. During the incremental development the large-scale features are learned first followed by learning finer scale details rather than learning all at once. The implemented PGGAN architecture is as shown in Figure 2.

Implemented PGGAN architecture 7 for generation of synthetic spine fracture CT images.
During training, batch size of 8 is used for 512-dimension image, batch size of 16 is used for 256- and lower-dimension images, to avoid memory problems. A learning rate of 0.001, Leaky ReLU activation function, WGAN-GP loss function and Adam optimizer with β1 = 0, β2 = 0.99, epsilon = 1 × 10−8 was used to generate synthetic images of excellent quality with clear view of spine fracture. The PGGAN generated images in 128 × 128, 256 × 256, 512 × 512 dimensions are as shown in Figure 3.

PGGAN-generated synthetic spine fracture CT images: (a) 128 × 128, (b) 256 × 256, and (c) 512 × 512 generated fracture CT images of types A0, A1, A2, A3, A4, B1, B2, and C (presented from the top left cell of the table), respectively.

Prediction accuracy in the identification of real and generated spine fracture CT images by spine surgeons.
Global features in the generated spine fracture CT images, are as seen in Figure 2. The images realistically reflect the standard spine fracture features. Fracture subtypes including A0, A1, A2, A3, A4, B1, B2, and C are accurately reconstructed by PGGANs. Bony structures are correctly positioned showing the fracture lines clearly. The quality assessment of the generated images by the spine surgeons further justifies the quality of generated images. Figure 4 depicts the quality assessment of the generated images by the spine surgeons. 12 The VTT is to distinguish the generated synthetic spine fracture images from the original spine fracture images, and the Fracture Identification Test (FIT) is to identify the spine fracture type in the generated images. The VTT involved three spine surgeons. All spine surgeons in the spine clinic received type A0, A1, A2, A3, A4, B1, B2, and C fracture CT images as part of the VTTs trial to assess generated spine fracture. The spine surgeons were unaware of one the other's evaluations.
The VTT consists of 192 images in total, 96 of which are high-quality PGGAN generated spine fracture images that correspond to 8 different fracture subtypes and 96 of which are real images. All spine surgeons were given the option to alter the image's angle of view or zoom in and out during VTT. The evaluation process did not take the patient's information into account. Spine surgeons were informed that the image collection may contain a combination of all real or all synthetic grids, or combination of real and generated images for an evaluation trial.
In the FIT experiment, all spine surgeons were given three types of (A, B, and C) generated images to assess their ability to recognise the type of spine fracture. The spine surgeons were unaware of one another's evaluations. The FIT consists of 192 images in total, 64 high-quality generated images of three different types (A, B, C). The spine surgeons were given the option to alter the image's angle of view or zoom in and out during FIT. Spine surgeons were briefed and requested to determine the type of spine fracture after receiving all of the synthetic fracture images for examination.
Statistical analysis to identify the interobserver reliability for VTT and FIT
Statistical analysis to measure interobserver reliability for VTT and FIT is an important tool for measuring the consistency and agreement among many observers in the interpretation and assessment of VTT. This study provides substantial insights on the dependability and uniformity of the assessments conducted during the study by quantifying the degree of agreement across several observers. Identifying any inconsistencies in the observations made by multiple people contributes to the quality and robustness of the findings drawn from these assessments. Furthermore, by using appropriate statistical methods such as Fleiss kappa coefficient, this study allows for the construction of trustworthy measures of agreement, which improves the overall reliability and robustness of the research findings. As a result, incorporating statistical analysis to evaluate interobserver reliability is a critical first step in ensuring the precision and reliability of the VTT data collected and reviewed for the study.
Interobserver agreement between three spine surgeons for VTT and FIT are evaluated using the Fleiss kappa method. Clinical evaluation observations have met each of the conditions given below for applying Fleiss kappa:
Observation of the three independent spine surgeons is categorical and nominal. Response variable contains the same number of categories for all spine surgeon. The variables evaluated by the spine doctors are mutually exclusive (Real- R and Generated – G).
The Fleiss kappa scores are used to categorise the interobserver agreement between the spine surgeons. The score of 0.00 imply poor agreement, score in the range 0.00–0.20 imply slight agreement, 0.21–0.40 imply fair agreement, 0.41–0.60 imply moderate agreement, 0.61–0.80 imply substantial agreement, and 0.81–1.00 imply almost perfect agreement.
Quantitative methods of the PGGAN generated images:
The evaluation of PGGAN-generated images is often based on two metrics: Fréchet Inception Distance (FID) and Inception Score (IS). FID measures the similarity between generated images and real images by comparing feature vectors from a pre-trained Inception network. Lower FID scores suggest higher-quality generation, while higher IS values indicate diverse and high-quality images. These metrics are crucial for assessing the model's ability to produce realistic, high-resolution images across different scales. FID is more sensitive to image quality and diversity, while IS focuses on the distinguishability and variety of generated images. PGGAN, which progressively increases image resolution during training, relies on these metrics to evaluate its performance. Quantitative Evaluation of PGGAN model at different resolutions are presented in Table 2.
Result
The observation of three spine surgeons for VTT is as shown in Table Table 1. The prediction accuracy in distinguishing between real vs generated images is calculated by the confusion matrix values from the Table 1 using the below equation
VTT confusion matrix.
VTT confusion matrix.
Quantitative evaluation of PGGAN model at different resolutions.
The data evaluates spine surgeons’ performance in a Visual Turing Test (VTT) to distinguish between real and GAN-generated CT images across three resolutions: 128 × 128, 256 × 256, and 512 × 512. The results show a clear trend: as the resolution of the images increases, the surgeons’ accuracy decreases, and their false positive rate (FPR) rises. At the lowest resolution (128 × 128), surgeons achieved an average accuracy of 61.98% with a 46.87% false positive rate, indicating that they could identify real and GAN-generated images relatively well, though still with notable errors. However, at 256 × 256 resolution, their accuracy dropped to 58.33%, while the FPR increased to 56.25%, reflecting a growing difficulty in differentiating the images. The trend continues at 512 × 512 resolution, where accuracy falls further to 54.16%, and the FPR climbs to 63.54%, showing that the surgeons are now more frequently misclassifying real images as GAN-generated. The declining F1-scores across the resolutions (from 65.11% to 61.05%) reinforce this pattern of reduced performance as the image resolution increases. This decrease in accuracy and rise in false positives suggest that GAN-generated images are becoming increasingly realistic at higher resolutions, to the point where even trained surgeons struggle to differentiate them from real CT images. The improving quality of GAN-generated images, as indicated by these metrics, demonstrates the effectiveness of GANs in producing realistic medical images, particularly at higher resolutions (512 × 512), and highlights the challenge in visually detecting synthetic images in medical diagnostics as the technology advances. This is further justified by the quantitative evaluation of the PGGAN model at different resolutions, which is presented in Table 2.
The result of the FIT, as shown in Figure 5, offers important information about fracture type recognition at various image resolutions. The results show an interesting trend: as compared to lower resolutions like 128 × 128 and 256 × 256 pixels, the accuracy of fracture type identification among all observers is much greater in images with a resolution of 512 × 512 pixels. This finding emphasizes the significance of image resolution for tasks involving the identification of fracture types. Higher resolutions preserve the finer details and complexities of the fractures with more accuracy. Improved visualization of complex elements like fracture lines, patterns, and textures is made possible by higher dimensions, which improve the interpretability of the images. The study highlights the significance of improving imaging parameters to achieve optimal performance in fracture classification tasks by highlighting the pivotal impact of image resolution.

The prediction accuracy of spine surgeons in fracture type identification among PGGAN-generated CT images.
Detailed A, B, C type identification in 512 × 512 dimension is as shown in Figure 6. Among 48 (512 ×512) images, 16 images are belonging to Types A, B, C. In 16 Type A, B, C images nearly 13–15 images are correctly type identified by the spine surgeons. Indicating that PGGAN's images are realistic and more clearly depict the fracture line in 512 × 512, than they do at lesser dimensions.

Detail prediction accuracy of spine surgeons in A, B, and C type identification in PGGAN 512 × 512 size generated CT images.
A moderate amount of inter-observer agreement is found when using Fleiss kappa statistics to evaluate the process of differentiating between actual and produced images at a resolution of 512 × 512 pixels (k-value 0.546, p-value 0.0001). As Table 3 illustrates, this degree of agreement is significantly lower than assessments carried out at lower resolutions. The remarkable quality of the images generated by the PGGANs, which makes them very similar to real images, is the main cause of the observed moderation in agreement. As a result, spine surgeons who were entrusted with this differentiation found it very challenging to distinguish between real and artificial images, especially when faced with images that had a greater resolution of 512 × 512 pixels. This challenge derives from the fine detail and reliability achieved by the PGGAN-generated images, which blur the distinction between synthetic and true data.
Interobserver reliability in identification of real and PGGAN generated images.
By taking into account the A, B, and C type identification in the generated 512 ×512 images, Fleiss kappa demonstrates highly significant inter-observer agreement (k-value 0.831, p-value <0.0001) which is more compare to lower resolutions that is highlighted in the Table 4. Because the 512 × 512 generated images are capable of more clearly displaying the fracture line of A, B, and C fracture type. This statistical analysis demonstrates that PGGAN can produce 512 ×512 synthetic images clearly showing the fracture lines than lower resolution images.
Interobserver reliability in fracture type identification in PGGAN generated images.
Our study demonstrates PGGAN's outstanding ability to generate realistic, high-resolution (512 × 512) CT scans of spine fractures. PGGAN produces synthetic CT scans with image quality equivalent to their original CT slices. Three experienced spine surgeons performed a visual Turing test to assess the realism of the generated images. The evaluation results show that spine surgeons can discriminate between actual and produced images with an accuracy of 61.97%, 58.33%, and 54.16% at resolutions of 128 × 128, 256 × 256, and 512 × 512, respectively. Notably, even at the greatest resolution of 512 × 512, the difficulties in distinguishing between synthetic and actual spine fracture images are apparent, with accuracy only slightly lower than at lesser resolutions These findings significantly demonstrate the realism of the generated images, emphasizing the difficulty that spine surgeons encounter in distinguishing between synthetic and true spine fracture CT scans, particularly when presented with images in the 512 × 512 dimension. This demonstrates PGGAN's ability to generate high-fidelity synthetic medical images that closely resemble their real-world counterparts, indicating immense potential for applications in spine CT imaging research and clinical practice. The PGGAN augmentation increases the performance of the automatic classification of spine fractures which is shown in Table 5. 48
Accuracy of VGG16 model with and without PGGAN augmentation in identification of B1 and B2 fractures.
Accuracy of VGG16 model with and without PGGAN augmentation in identification of B1 and B2 fractures.
The inter-observer agreement in distinguishing between actual and synthetic images at a resolution of 512 × 512 was moderate, indicating that images generated by PGGANs are of outstanding quality and closely resemble legitimate photos. Notably, spine surgeons struggled to distinguish between actual and manufactured images, especially at 512 × 512 resolution.
Furthermore, all evaluators reported a significant increase in accuracy with increasing image size when identifying fracture types (classified as A, B, or C). Surprisingly, inter-observer agreement achieved a high level for fracture type identification, notably in the 512 × 512 image category. This demonstrates PGGANs’ capacity to generate synthetic images with a resolution of 512 × 512 that clearly depict fracture lines, outperforming images with lower resolutions.
To enhance the number of samples available for deep learning model training, researchers use data augmentation. In computer vision, augmentation is done by generating synthetic images resembling real dataset. Imbalanced small sample data set pose additional difficulties in the medical applications. The performance of a classifier can be greatly enhanced by the production of numerous high-resolution synthetic images. This work shows, how useful PGGAN approach is for generating 512 × 512 dimension spine fracture CT images. High resolution 512 × 512 images were more realistic with moderate inter-observer agreement indicating PGGAN generated images are of excellent quality and identical to real images and spine surgeons had difficulty in distinguishing real and generated images. The inter- observer agreement was highly significant for fracture type identification in 512 ×512 images. Indicated that AI-based PGGAN can produce 512 ×512 synthetic images clearly showing the fracture lines than lower resolution images making them difficult to distinguish from real images. The automatic spine fracture type detection system performs better with these generated CT images.
Footnotes
Acknowledgements
Our heartfelt gratitude to the Hospital for providing the data. Additionally, thanks to spine surgeons who took part in the Visual Turing Test.
Ethics approval
The Institutional Ethics Committee gave their approval to this investigation. The researchers affirm that the methods used in this study adhere to the institutional research committee's ethical guidelines.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
