Abstract
Objective:
This study compared two deep learning architectures, a generative adversarial network (GAN) and a convolutional neural network combined with implicit neural representations (CNN–INR), for generating cranial synthetic computed tomography (sCT) volumes from biplanar radiographs.
Methods:
Three cranial CT datasets comprising 235 subjects were used for training and evaluation. The GAN model used a dual-view generator and 3D PatchGAN discriminator, whereas the CNN–INR learned a coordinate-to-intensity mapping conditioned on radiographic features. Digitally reconstructed radiographs (DRRs) generated from non-contrast CTs provided standardized 2 D inputs. Both models were trained for 170 epochs under identical preprocessing, normalization, and optimization conditions. Quantitative evaluation employed the structural similarity index measure (SSIM), peak signal-to-noise ratio (PSNR), and cosine similarity (CS), complemented by qualitative assessment of anatomical fidelity.
Results:
The GAN achieved higher SSIM and PSNR in both internal (0.739 ± 0.016; 16.70 ± 0.40 dB) and external (0.730 ± 0.012; 15.61 ± 0.27 dB) validations compared with the CNN–INR (0.686 ± 0.014; 16.41 ± 0.27 dB and 0.673 ± 0.012; 15.23 ± 0.16 dB, respectively). CS values were similar across models. Qualitatively, GAN-generated sCTs exhibited greater resemblance to ground-truth CTs, while CNN–INR reconstructions showed smoother spatial transitions.
Conclusions:
Both architectures demonstrated feasibility for volumetric reconstruction from limited radiographic projections. The GAN’s adversarial training enhanced perceptual realism and structural fidelity, whereas the CNN–INR maintained spatial continuity. Although neither model produced clinically viable sCTs, both represent promising approaches for future development of data-efficient tomographic reconstruction.
Background and Purpose
Computed tomography (CT) remains a cornerstone of modern neurosurgical imaging, routinely employed for cranial assessment in emergency diagnostics, preoperative planning, and postoperative monitoring1,2 Its capacity to deliver high-resolution, three-dimensional (3D) anatomical detail is indispensable, particularly in the evaluation of complex intracranial pathologies. Nevertheless, CT imaging presents several challenges: it exposes patients to ionizing radiation, may be unavailable in rural or resource-limited settings, and often requires the transport of critically ill patients, such as those in intensive care units (ICUs), which introduces additional procedural risks.3–6
An emerging solution to these limitations is the generation of synthetic CTs (sCTs) derived from conventional biplanar radiographs. This technique leverages deep learning to reconstruct volumetric information from two or more X-ray projections, thereby reducing the need for repeated CT acquisitions. Intraoperatively, such sCTs could assist in surgical navigation by utilizing mobile C-arm imaging to generate volumetric reconstructions from minimal radiation exposure, while postoperatively, they could serve as a lower-dose alternative for follow-up imaging, for instance, after ventriculoperitoneal shunt placement.7–9 Cone beam CT, while increasingly available, remains limited to advanced surgical or interventional suites, results in far greater radiation doses, and requires significant time to acquire and generate 3D images, and will not replace biplanar C-arms for relatively simple, fast, low-dose procedures. Similarly, bedside imaging in ICUs could be facilitated by this approach, eliminating the risks associated with patient transfer for conventional CT scanning.6,10
Recent advances in deep learning have opened new avenues for this task.11,12 Various neural network architectures have been explored for volumetric image synthesis, including convolutional neural networks (CNNs), implicit neural representations (INRs), generative adversarial networks (GANs), and diffusion-based models.12–16 These frameworks have demonstrated remarkable potential across multiple imaging domains. However, to our knowledge, synthetic CT generation from biplanar radiographs has not yet been investigated with a specific focus on the reconstruction of intracranial and parenchymal structures rather than solely osseous anatomy. 11
The rationale behind this approach is that deep learning models can aggregate weak but physically measurable variations present in X-ray images, including attenuation- and geometry-related cues, which are difficult for human observers to integrate and can support inference of latent structural information. 17 In this proof-of-concept study, we directly compare two advanced neural architectures, a GAN-based model and a CNN–INR hybrid model, for the generation of cranial sCTs from orthogonal radiographs. The objective is to assess their relative performance in reconstructing volumetric cranial anatomy and determine which modeling strategy shows greater promise for future applications in neuroradiology and neurosurgical imaging.
Methods
Overview
Two neural network architectures were implemented and compared for the generation of sCT volumes from biplanar cranial radiographs: a GAN based on the X2CT-GAN framework and a CNN–INR model.14,18 Both models were trained and evaluated using non-contrast cranial CT datasets acquired at the University Hospital Zurich (USZ) and from a publicly available dataset CQ-500. 19 A second independent dataset from Emilia-Romagna, Italy served as external validation.
To ensure strict comparability, the two models were trained on identical datasets, under identical preprocessing and normalization conditions, and for the same number of epochs. Model performance was evaluated quantitatively using structural and perceptual metrics, complemented by qualitative assessment of anatomical fidelity.
Figure 1 illustrates a schematic view of the general pipeline for both models.

Overview of GAN and CNN–INR Training and Validation Pipelines for Cranial Synthetic CT Generation. Figure legend: Schematic representation of the training and validation workflow for the GAN (upper panel) and CNN–INR (lower panel) models used to generate cranial sCT volumes from biplanar DRRs. Both architectures are shown with their respective input, feature extraction, and reconstruction components, as well as the evaluation metrics applied during validation. CT, computed tomography; DRR, digitally reconstructed radiograph; Ant., anterior; Post., posterior; sCT, synthetic computed tomography; CNN, convolutional neural network; INR, implicit neural representation; GAN, generative adversarial network; SSIM, structural similarity index measure; PSNR, peak signal-to-noise ratio; CS, cosine similarity.
CNN-INR architecture
The CNN–INR model reconstructs a volumetric sCT by learning a continuous mapping between three-dimensional (3D) spatial coordinates and corresponding Hounsfield Unit (HU) intensities, conditioned on features extracted from two orthogonal input radiographs. The network design follows the image-conditioned implicit representation paradigm proposed by Yu et al. 20 and extended by Sun et al., 14 adapted here for cranial imaging.
Each input projection
For a queried 3D coordinate
This defines a continuous function
Model supervision was performed using paired CT and digitally reconstructed radiograph (DRR) data. Random coordinates were sampled from the training volumes, and the network was optimized by minimizing the voxel-wise mean squared error:
By conditioning the reconstruction on the projection geometry and encoding spatial information through sinusoidal positional embeddings, the CNN–INR captures high-frequency details of cranial anatomy while maintaining global spatial coherence.
Overview GAN architecture
For comparison, a GAN model was implemented following the X2CT-GAN
In the generator, two parallel encoder–decoder branches independently process the anterior–posterior and lateral radiographs. Each branch employs a DenseNet-inspired encoder and a symmetric up-convolutional decoder. 22 Three specialized modules bridge the dimensionality gap between the 2D inputs and 3D output: (i) Connection A: a fully connected 2D to 3D bottleneck that projects the latent feature maps into a volumetric space; (ii) Connection B: convolutional skip connections that replicate channels along the depth dimension, preserving fine spatial information; and (iii) Connection C: a fusion block that averages and aligns decoder features from both views before passing them to a 3D fusion decoder producing a 128 × 128 × 128 volume.
All layers use instance normalization. ReLU activations were used in the generator and LeakyReLU (α = 0.2) in the discriminator, following the reference design. 23
The discriminator is a conditional 3D PatchGAN, composed of four convolutional layers (kernel = 4; strides = 2, 2, 2, 1). 21 It assesses local 3D realism conditioned on the two input projections, emphasizing anatomical consistency and texture fidelity rather than only global HU statistics.
Training follows a least-squares conditional GAN formulation. The total objective combines adversarial, reconstruction, and projection-consistency losses:
The adversarial component enhances edge sharpness and perceptual realism, whereas the projection-consistency term enforces geometric alignment between generated and reference volumes.
Experimental training conditions
Both models were trained under identical conditions to ensure a fair comparison. Training was performed for 170 epochs using the Adam optimizer (batch size = 1). Learning rates were set to 3 × 10−5 for the CNN–INR and for the GAN.
All experiments were implemented in Python 3.10.6
Preprocessing and DRR generation
In a first step, to ensure spatial consistency across datasets, each CT volume was rigidly aligned to a standardized cranial CT template described by Rorden et al., 24 which was generated from 35 high-resolution, pathology-free head CTs. Rigid coregistration was performed using an intensity-based mutual information algorithm optimized over six degrees of freedom (three translations, three rotations) to align each CT to the template while preserving cranial geometry. A multi-resolution Gaussian pyramid accelerated convergence and prevented local minima. Registration accuracy was visually verified using Difference-of-Gaussian edge-enhanced overlays, ensuring precise anatomical correspondence across subjects.
Following registration, all scans were resampled to isotropic 1 × 1 × 1 mm³ voxel spacing and intensity-clipped to the range [−1000, 1000 HU]. The images were then cropped or padded to a matrix size of 256 × 256 × 256 voxels, yielding uniform dimensions across the entire dataset. A brain parenchyma window was applied with a window level of 40 HU and a window width of 40 HU, enhancing tissue contrast in preparation for synthetic data generation.
Because paired cranial X-rays and CTs are rarely available, DRRs were generated from the CTs to serve as model inputs. DRRs are widely used in synthetic imaging research because they provide standardized, geometry-consistent 2D projections that approximate true radiographic acquisition while avoiding patient- and scanner-specific variability. All DRRs were created using Plastimatch 25 with a fixed source-to-detector geometry that was kept consistent between training and evaluation. For each CT volume, two orthogonal projections, anteroposterior and lateral, were synthesized at a resolution of 128 × 128 pixels. For visualization purposes, DRRs were displayed with a window width of 80 HU and window level of 40 HU. After DRR generation, the CTs were also resampled to 128x128x128 voxel resolution.
Dataset
Three cranial CT datasets from independent centers were used for model development and evaluation.
The Zurich dataset comprised 114 non-contrast cranial CT scans from adult patients who underwent ventriculoperitoneal shunt implantation or revision for hydrocephalus at the USZ between September 2020 and December 2022. Inclusion required the availability of postoperative CT and skull X-ray imaging acquired within 48 h after surgery.
The CQ-500 dataset included 71 retrospectively collected head CTs from multiple institutions in New Delhi, covering a range of pathologies including intraparenchymal, subdural, epidural, and subarachnoid hemorrhages as well as calvarial fractures. Duplicate or poor-quality scans were removed, leaving one representative scan per patient. 19
The external validation dataset consisted of 50 non-contrast cranial CTs from patients with normal-pressure hydrocephalus who underwent shunt placement at hospitals in the Bologna region (Emilia-Romagna, Italy) between 2016 and 2019.
Model training was performed on a total of 150 CT–DRR pairs from Zurich and CQ-500 combined, with 34 cases reserved for internal validation and all 50 Bologna cases used exclusively for external validation.
Validation and evaluation metrics
Model performance was evaluated on both the internal and external validation cohorts. Quantitative image similarity was assessed using the structural similarity index measure (SSIM)
SSIM measures perceptual similarity based on luminance, contrast, and structural components, ranging from −1 to 1, with 1 indicating perfect similarity.26,27 PSNR quantifies signal-to-noise ratio in decibels, where higher values correspond to better fidelity. 27 CS evaluates spatial similarity by computing the cosine of the angle between voxel-intensity vectors, ranging from −1 to 1.
Results are reported as mean ± 95 % confidence interval (CI) and median ± interquartile ranges (IQRs). Statistical comparisons between models were performed using the Wilcoxon rank-sum test, with significance defined as p < 0.05.
Metric computation was performed in Python 3.10.6, and statistical analyses and visualizations were carried out in RStudio 2024.09.0 + 375.
Ethical considerations
Patient data were handled in accordance with the ethical standards outlined in the Declaration of Helsinki and its amendments. The use of this data received approval from the institutional review boards in Zurich (IRB, Cantonal Ethics Committee Zürich, BASEC 2023-00689). For data collected in Bologna, ethical approval was granted by the ethics committee of the greater area of Emilia-Romagna, Italy (No 94-2025-OSS-AUSLBO).
Results
Qualitative evaluation
Figure 2 illustrates representative sCTs from both models compared with the ground truth CT in axial, sagittal, and coronal planes from the internal validation dataset. Qualitatively, the GAN model produced reconstructions with higher visual realism than the CNN-INR model. The GAN model also generated several anatomically inaccurate structures and artifacts. A qualitative assessment of spatial fidelity cannot be fully performed based on visual inspection alone.

Qualitative Comparison of GAN- and CNN–INR–Generated Synthetic CTs for Cranial Reconstruction. Figure legend: Representative example of sCT volumes generated by the GAN and CNN–INR models. The far-left panels show the input DRRs from the lateral
Quantitative evaluation
Table 1 summarizes the quantitative evaluation metrics (SSIM, PSNR, and CS) for internal and external validation datasets. Results are reported as mean with 95% CIs and as median with IQR.
Performance of the Two Architectures on Generation of Synthetic Cranial Tomographic Imaging Using Three Different Quantitative Measurements on an Internal and External Validation Dataset
GAN, generative adversarial network; CNN, convolutional neural network; INR, implicit neural representation; SSIM, structural similarity index; PSNR, peak signal-to-noise ratio; CS, cosine similarity.
For the GAN model, internal validation showed a mean SSIM of 0.739 (±0.016), PSNR of 16.70 dB (±0.40), and CS of 0.826 (±0.011). External validation showed a mean SSIM of 0.730 (±0.012), PSNR of 15.61 dB (±0.27), and CS of 0.791 (±0.014).
For the CNN-INR model, internal validation reported a mean SSIM of 0.686 (±0.014), PSNR of 16.41 dB (±0.27), and CS of 0.834 (±0.010). External validation reported a mean SSIM of 0.673 (±0.012), PSNR of 15.23 dB (±0.16), and CS of 0.784 (±0.010).
Inter-model comparisons (Fig. 3) demonstrated statistically significant differences for SSIM in both internal and external validation (p < 0.001) and for PSNR in the external validation set (p = 0.007).

Quantitative Evaluation of GAN and CNN–INR Model Performance Across Validation Sets. Figure legend: Box plots illustrating the quantitative evaluation metrics for GAN and CNN–INR models on internal and external validation datasets.
Discussion
This study compared two fundamentally different deep learning frameworks, a GAN and a CNN–INR, for sCTs from biplanar radiograph inputs. Both networks successfully reconstructed volumetric anatomy from limited two-dimensional projections, but their underlying learning principles resulted in distinct output characteristics and performance profiles.
Architectural determinants of performance
The GAN-based approach demonstrated consistently higher SSIM and PSNR values across both internal and external validations, hinting to superior perceptual and structural fidelity. This finding aligns with prior evidence showing that adversarial training improves high-frequency detail reproduction and texture realism in medical image synthesis tasks.15,28 In GANs, the discriminator provides a powerful perceptual prior, enforcing anatomical plausibility beyond pixelwise similarity and counteracting the smoothing predisposition of purely regression-based losses. This mechanism has been shown to enhance local edge sharpness and tissue interface delineation in reconstruction and super-resolution applications, particularly in MRI and CT synthesis.28,29
By contrast, the CNN–INR model produced spatially coherent but perceptually smoother sCTs, particularly in regions of low-contrast parenchyma. This outcome is consistent with the known spectral bias of implicit neural representations, which prioritize low-frequency components and struggle to reconstruct sharp transitions or discontinuous boundaries without explicit high-frequency conditioning.30,31 The coordinate-based formulation of the CNN–INR enforces global continuity but lacks the localized contextual learning of convolutional GANs. As demonstrated by Li et al., 32 standard INRs tend to converge toward overly smooth solutions unless augmented by explicit regularization, such as Laplacian Dirichlet energy terms or, in our case, a CNN extraction model, to counteract their implicit low-rank bias.14,32
Spatial fidelity represents a critical aspect of synthetic imaging, particularly given potential clinical applications such as neuronavigation, where accurate coregistration is essential. In this study, spatial fidelity was primarily assessed through SSIM and CS metrics, which yielded comparable and only moderately reliable results. Although the GAN produced images with higher visual realism, this does not necessarily imply superior spatial fidelity (see subsection Quantitative–qualitative correspondence). Qualitatively, no consistent advantage in spatial accuracy could be determined between the two models. The quantitative findings support this observation but should be interpreted cautiously, as current evaluation methods may not fully capture the geometric precision required for neuronavigation. At present, it cannot be conclusively stated that one architecture achieves better spatial fidelity than the other.
Quantitative–qualitative correspondence
Quantitative results reflected the architectural divergence only partially. Although the numerical metrics showed differing trends between the two models, the visual differences were far more pronounced and were not proportionally captured by the quantitative measures. Quantitative evaluation metrics remain essential in synthetic image generation but cannot be interpreted as absolute indicators of image quality, as they fail to fully capture the clinical relevance of the generated images. SSIM, for instance, while reflecting global structural alignment, is highly sensitive to uniform background intensity, which may artificially inflate its absolute values. These inconsistencies were also noted in a previous study of ours. 33
Model-specific limitations
This proof-of-concept study has several constraints. First, the dataset size was modest for data-driven deep learning models; limited scale and heterogeneity are known to hinder generalization, particularly for architectures that depend on rich sampling to capture multi-scale structure. 15 Second, the relatively low input resolution of DRRs reduces the recoverable high-frequency content, which disproportionately affects soft-tissue depiction where X-ray contrast is intrinsically limited; INR-style models are especially sensitive to sparse or low-information sampling and tend to regress toward smoother solutions without additional priors.30,32 Third, the evaluation relied on SSIM, PSNR, and CS; while standard, these metrics incompletely reflect perceptual realism and clinically relevant structures. Finally, only CT-derived DRRs were used as inputs. Although DRRs enable strict standardization, translation to real radiographs will introduce domain shift (acquisition geometry, scatter, noise) that typically degrades performance without domain adaptation or adversarial translation strategies common in medical GAN pipelines. 34
Future research
Several directions follow directly from these limitations. Increasing sample size and diversity should improve robustness for both models; this need is well recognized for medical generative models and downstream tasks. Higher-resolution inputs and/or more angular views can supply the high-frequency constraints that sparse biplanar inputs lack. INR-based architectures are particularly well-suited for multi-view conditioning and have demonstrated improved reconstruction quality when trained with three or four input projections compared to only two biplanar radiographs. The field remains highly dynamic, with ongoing advancements in architectural design and the integration of hybrid frameworks, such as CNN–INR models in this study, as well as progressive deepening and optimization of existing network structures.14,31–33
Architecturally, hybrid models that integrate adversarial supervision (for perceptual sharpness and local texture fidelity) with implicit coordinate-based decoders (for continuous, resolution-independent volumes) merit systematic study; the literature indicates complementary strengths, GANs for perceptual realism and INRs for continuous field modeling, that could be combined within a single framework.28,30 Finally, evaluation should move beyond SSIM/PSNR/CS toward perceptual or task-oriented measures (e.g., radiologist scoring, detection/segmentation performance on sCT) that better track clinical fidelity.
Conclusion
This proof-of-concept study demonstrated the feasibility of generating sCT volumes from biplanar radiographs using GAN and CNN–INR architectures. The GAN model achieved higher perceptual realism and superior SSIM and PSNR values, while the CNN–INR model produced smoother yet spatially coherent reconstructions due to its low-frequency bias. Despite improved visual fidelity, GAN outputs showed anatomical inaccuracies, and spatial fidelity in both models could not be conclusively verified. Neither model achieved clinical-grade quality, but both show potential for refinement through larger datasets, multi-view inputs, and hybrid architectures to enhance future clinical applicability in neurosurgical imaging.
Data Availability Statement
Model architectures and analysis code can be made available upon reasonable request. Sharing of data that are not publicly available is restricted in accordance with the approvals of the relevant ethics committees.
Informed Consent
In accordance with institutional policies and approvals from the relevant ethics committees, the use of patient data for research was permitted under a waiver of individual informed consent due to the retrospective design and use of fully de-identified data. All data were anonymized at the source prior to analysis.
Ethical Statement
This retrospective study analyzed de-identified human imaging data in accordance with the Declaration of Helsinki and applicable institutional and national regulations. Ethical approval was obtained from the Cantonal Ethics Committee Zürich (BASEC 2023-00689) and the Emilia-Romagna Regional Ethics Committee (No. 94-2025-OSS-AUSLBO); use of the publicly available CQ-500 dataset complied with its data-use terms. All data were de-identified, and the requirement for individual informed consent was waived. No prospective interventions, randomization, animal experiments, unusual hazards, or patient-identifiable images were involved. The work is original and reported in line with accepted reporting standards. Public data are available under their respective licenses, and analysis code will be shared upon reasonable request, subject to ethical approval. All authors meet authorship criteria and approved the final article. No generative artificial intelligence tools were used to create, modify, or analyze data, figures, or results.
Authors’ Contributions
M.B.: Conceptualization, data curation, formal analysis, methodology, visualization, writing—original draft, writing—reviewing and editing. I.K.: Methodology, writing—original draft, writing—review and editing. O.Z.: Data curation, methodology, writing—review and editing. R.D.M.: Data curation, writing—review and editing. A.C.: Data curation, writing—review and editing. G.P.: Data curation, writing—review and editing. D.M.: Data curation, writing—review and editing. L.R.: Writing—review and editing. C.S.: Writing—review and editing. V.E.S.: Conceptualization, project administration, writing—original draft, writing—review and editing.
Footnotes
Author Disclosure Statement
The authors declare no conflicts of interest.
Funding Information
V.E.S. is supported by the Prof. Dr. Max Cloetta Foundation. The other authors declare that no funds, grants, or other support were received during the preparation of this article. No specific grant was provided to support this project.
