Enhancing intraoperative pedicle screw planning accuracy via diffusion-based synthetic CT

Abstract

Intraoperative cone beam computed tomography (CBCT) is critical for pedicle screw planning; however, image quality is frequently compromised by artifacts and low contrast, potentially leading to adverse clinical outcomes. To address these limitations, we propose the Spatiotemporal Adaptive Warm-Start Diffusion Model (STADW-M), a novel framework aimed to generate high-quality synthetic CT (sCT) images from CBCT data, thereby enhancing surgical precision. The STADW-M integrates an Artifact-Aware Adaptive Diffusion Module to mitigate localized artifact distributions and a Dually-Guided Structural Consistency Module to preserve anatomical integrity. Furthermore, we employ a CBCT Warm-Start strategy alongside composite loss functions to optimize textural fidelity and accelerate model convergence. Quantitative experiments demonstrated significant improvements over original CBCT images: with RMSE decreased from 890.1 to 152.9 HU, MAE decreasing from 859.7 to 102.6 HU, and PSNR increased from 13.6 to 27.9 dB. Crucially, the generated sCTs maintained high anatomical consistency with reference CTs. In clinical validation, automated screw planning based on sCTs achieved a 100% Grade A standard, with 94.7% of screws placed without cortical breach and 5.3% exhibiting only minor (<2 mm) erosion. The proposed method effectively synthesizes high-quality CT images, preserving vertebral anatomy and significantly improving the accuracy and safety of intraoperative pedicle screw planning.

Keywords

CBCT-generated CT imaging image quality improvement denoising diffusion probabilistic model pedicle screw path planning

Introduction

Different imaging modality can capture distinct information about tissues, thereby enabling a comprehensive evaluation of anatomical structures and functions in the human body. It, in turn, can help improve the accuracy and precision of subsequent diagnostic and therapeutic tasks.¹ Consequently, multi-modal imaging is well employed in therapeutic tasks to enhance diagnostic accuracy and treatment decisions, such as image-guided surgery and radiation therapy.² Cone-beam computed tomography (CBCT) is instrumental in image-guided diagnostic and therapeutic systems in clinical practice, owing to its capacity for on-site and high-resolution volumetric imaging for patients undergoing therapy.² A primary application of CT and CBCT image-guided treatments lies in robot-assisted spinal surgery. Numerous studies indicate that the surgery, guided by CT or CBCT images, enhances screw placement accuracy and diminishes radiation exposure time.^3–8 CBCT is typically a necessary imaging tool for image-guided or robot-assisted spinal surgeries, with screw paths typically planned by surgeons directly on intraoperative CBCT or by registering preoperative CT images to intraoperative CBCT images. Nonetheless, the physical imaging properties of CBCT frequently lead to a high degree of image artifacts, inadequate contrast, and compromised image quality.^9–12 Quality-reduced CBCT images may lead to increased uncertainty in screw planning, potentially causing improper screw placement and serious complications. Consequently, obtaining high-quality intraoperative CBCT images is crucial for high quality and reliable image-guided or robot-assisted spinal surgeries.

Recently, numerous model-based methods have been developed and studied to improve the image quality of CBCT.^13–17 Alongside these conventional model-based methods for enhancing image quality, techniques employing Convolutional Neural Networks (CNN) have also been widely used to augment CBCT image quality and to convert CBCT into CT. UNET is an easily implemented, stable generative network for synthesizing CT from CBCT. Kida, Chen, Thummerer, and Yuan et al.^12,18–20 utilized a UNET-based network model and trained it with paired data to generate two-dimensional synthesized CT images of various anatomical regions. Recently, Generative Adversarial Networks (GAN) models have been extensively employed in image generation tasks. Owing to the adversarial loss, GAN network models and Cycle GAN network models exhibit remarkable realism in image synthesis, while eliminating the need for strictly paired data.^21–23 Using GAN networks trained with paired data, Barateau, Zhang, and Dahiya et al.^24–26 successfully generated synthesized CT images for the head, neck, pelvis, and other anatomical locations, thereby significantly enhancing the quality of CBCT images. Kida, Kurz, Lei, Liu, Gao, Xue, and Lemus et al.^27–33 improved upon the Cycle GAN network model and created synthesized CT images for various regions, such as the pelvis, abdomen, brain, chest, and nasopharynx, from CBCT images. Their methods substantially increased the realism and observability of the synthesized CT images. Despite the promising experimental results achieved by these methods, they still possess limitations. The UNET-based generative model, while easy to implement, necessitates strict paired data and suffers from anatomical misalignment issues, which negatively impact model accuracy and image authenticity. The GAN generative model exhibits unstable convergence, slow training, and weak retention of anatomical structures for unpaired data. In a similar vein, the Cycle GAN model also struggles with convergence, slow training, and challenges in deployment and implementation.

Serving as a recent alternative to the aforementioned methods, deep diffusion models have gained extensive attention in generative modeling within the field of computer vision.^34,35 Initiating with pure noise samples, diffusion models create image samples from the target distribution through iterative denoising. This denoising is executed via trained neural network architectures, aiming to maximize data correlation. Owing to the incremental random sampling process and explicit likelihood features, diffusion models can offer enhanced sample quality and diversity. Recognizing this potential, diffusion-based approaches have lately been employed in tasks such as single-modal image generation^36–40 and unconditional image generation.⁴¹ Since diffusion models commence with pure noise samples and produce image samples through successive noise removal, it becomes apparent that distinct images are generated for various noise initializations. Different structures and pixel intensities in CT images possess explicit physiological significance, and random initialization noise may lead to alterations in generated CT structures, thereby introducing erroneous information.

In this study, we propose a novel method for synthesizing CT images from CBCT data, referred to as STADW-M. This model employs a multitask framework that integrates modules for deformable image registration, dually-guided structural consistency, and artifact-aware adaptive diffusion to generate high-quality synthetic CT images. By incorporating efficient attention mechanisms and a Warm-Start initialization, STADW-M improves both the accuracy and computational efficiency of the image generation process. Moreover, the model effectively mitigates artifacts and preserves anatomical consistency.

Materials and method

Materials

Dataset 1 (utilized for training the STADW-M network): We obtained intraoperative CBCT and preoperative CT images for 44 patients undergoing lumbar pedicle screw implantation from Beijing Jishuitan Hospital. The CBCT images have a pixel size of 0.5 mm $\times$ 0.5 mm, an axial dimension of 256 $\times$ 256, and a slice thickness of 0.5 mm. The CT images feature an axial dimension of 512 $\times$ 512, with pixel sizes of 0.5 mm $\times$ 0.5 mm and a slice thickness of 0.8 mm. Due to the differences in acquisition times, positions, and pixel sizes between the intraoperative CBCT and preoperative CT images, we resampled the original CT images to match the CBCT resolution, specifically 0.5 mm $\times$ 0.5 mm $\times$ 0.5 mm. Subsequently, we employed an affine transformation to align the original CT images with the CBCT images.

Dataset 2 (assessing the utility of STADW-M): We obtained intraoperative CBCT images for 36 patients from Beijing Jishuitan Hospital. The CBCT images have a pixel size of 0.5 mm $\times$ 0.5 mm, an axial dimension of 256 $\times$ 256, and a slice thickness of 0.5 mm.

Denoising diffusion probabilistic models

Denoising Diffusion Probabilistic Models (DDPM) are a type of generative model that utilizes a Markov diffusion process with T time steps to establish a mutual mapping between real images and pure noise. In the forward direction, small amounts of Gaussian noise are repeatedly added to the real image $x_{0} \sim q (x_{0})$

\begin{matrix} x_{t} = \sqrt{1 - β_{t}} x_{t - 1} + ϵ \sqrt{β_{t}}, ϵ \sim N (0, I) \\ q (x_{t} ∣ x_{t - 1}) : = N (x_{t}; \sqrt{1 - β_{t}} x_{t - 1}, β_{t} I) \\ q (x_{1 : T} ∣ x_{0}) : = \prod_{t = 1}^{T} q (x_{t} ∣ x_{t - 1}) \end{matrix}

(1)

The sample

x_{t}

is generated by adding i.i.d. Gaussian noise with a variance of

I \times β_{t}

at time step

t - 1

, and then scaling the previous sample

x_{t - 1}

\sqrt{1 - β_{t}}

based on the variance table

{β_{t} \in (0, 1)}_{t = 1}^{T}

.³⁵ One important feature of the forward process is its ability to sample

x_{t}

in a closed-form expression at any time step

t

. To express this, we introduce the notations

α_{t} : = 1 - β_{t}

and

{\bar{α}}_{t} : = \prod_{s = 1}^{t} α_{s}

and obtain:

q (x_{t} | x_{0}) = N (x_{t}; \sqrt{{\bar{α}}_{t}} x_{0}, (1 - \bar{α_{t}}) I)

(2)

The reverse process

p_{θ} (x_{0 : T})

is defined as a Markov chain that starts from

p (x_{T}) = N (x_{T}; 0, I)

\begin{matrix} p_{θ} (x_{0 : T}) : = p (x_{T}) \prod_{t = 1}^{T} p_{θ} (x_{t - 1} | x_{t}) \\ p_{θ} (x_{t - 1} | x_{t}) : = N (x_{t - 1}; μ_{θ} (x_{t}, t), Σ_{θ} (x_{t}, t)) \end{matrix}

(3)

Diffusion models generally portray each reverse diffusion step as a mapping by a neural network that provides estimates of

μ_{θ}

and/or

Σ_{θ}

. The network is then trained by minimizing the variational lower bound on the log-likelihood.

L_{v b} = E_{q} [\log p_{θ} (x_{0 : T}) - \log q (x_{1 : T} | x_{0})]

(4)

Here,

E_{q}

denotes the expectation over q, and

p_{θ}

is the neural network approximation of the reverse transition probability. The bound can be further decomposed as follows:

\begin{aligned} L_{v b} & = - \log (p_{θ} (x_{0} ∣ x_{1})) \\ + \sum_{t = 1}^{T - 1} K L (q (x_{t - 1} ∣ x_{t}, x_{0}) ‖ p_{θ} (x_{t - 1} ∣ x_{t})) \\ + K L (q (x_{T} ∣ x_{0}) ‖ p_{θ} (x_{T})) \end{aligned}

(5)

Here, KL refers to Kullback-Leibler divergence. Ho et al.³⁵ recommend predicting the cumulative noise $ϵ_{θ}$ added to the current intermediate image $x_{t}$ as the optimal method for parameterizing the model. Therefore, we obtain the following parameterization for the predicted mean value $μ_{θ} (x_{t}, t)$ :

μ_{θ} (x_{t}, t) = \frac{1}{\sqrt{α_{t}}} (x_{t} - \frac{β_{t}}{\sqrt{1 - \bar{α_{t}}}} ϵ_{θ} (x_{t}, t))

(6)

Consequently, the loss function has been redefined in a simpler form^35,42–45 as follows, where

c

is the conditional information:

L_{DDPM} = E_{q} [‖ ϵ - ϵ_{θ} (x_{t}, t, c) ‖^{2}]

(7)

For all experiments, the diffusion process uses $T = 200$ steps with a linear noise schedule $β_{t} \in [1 \times 10^{- 4}, 2 \times 10^{- 2}]$ .

Generating sCT from CBCT utilize spatiotemporal adaptive warm-start diffusion models

Model structure

The Spatiotemporal Adaptive Warm-Start Diffusion Models (STADW-M) constitutes a multitask model, comprising some modules for different tasks: The Deformable Image Registration (REG) module, the Dually-Guided Structural Consistency Module (DGSC), and the Artifact-Aware Adaptive Diffusion Module (ADM). The REG module mitigates the impact of misalignment on the results. The DGSC module combines a dual-domain guidance mechanism, incorporating spatial and Fourier domain regularization, to provide precise anatomical structure constraints. This ensures global structural integrity and low-frequency information consistency in the generated images, effectively mitigating registration errors. Meanwhile, the ADM module employs an artifact-aware adaptive diffusion process that dynamically adjusts the trajectories of forward noising and reverse denoising based on the spatial variation characteristics of artifacts in CBCT images, thereby focusing the model’s attention on the regions most in need of restoration.

In STADW-M, we adopt the U-Net⁴⁶ + Self-Attention(SA)⁴⁷ network architecture. This structure comprises four down sampling modules as the encoder and four UpSampling modules as the decoder. Each down sampling module includes a 3D convolution module, a MaxPooling module, and a SA module, while each UpSampling module is composed of a 3D convolution module, an UpSampling module, and a SA module. For an input image with size N, the computational complexity of self-attention is approximately $O (N^{2})$ , which can become extremely costly for the 3D high-resolution task of generating sCT from CBCT. Inspired by the Interlaced Sparse Self-Attention (ISSA) and Axial Attention (ASA) mechanisms,^48,49 we modify the original SA structure, resulting in a more concise and efficient SA scheme (as illustrated in Figure 1(b)). In this work, we combine the ISSA and ASA modules, replacing the SA structure in the ISSA module with a more concise ASA structure. To enable attention mechanisms to diffuse globally, we perform ASA operations in axial, sagittal, and coronal directions.

Figure 1.

The overall algorithm flow, (a) the model framework with an artifact characterization module (ACM) for estimating artifact severity, a dually guided structural consistency module (DGSC) for providing spatial and frequency-domain guidance, and an artifact-Aware Adaptive Diffusion Module (ADM) for progressive denoising and sCT generation, and (b) the self-attention structure, which combines interlaced Sparse Self-Attention (ISSA) and axial attention (ASA) to reduce computational complexity.

Artifact-aware adaptive diffusion module

To focus the model’s attention on the regions most in need of restoration, we propose an Artifact Characterization Module (ACM) in ADM module. This module employs a lightweight CNN network $f_{A C M}$ which takes the low-quality CBCT image $x_{C B C T}$ as input and outputs an Artifact Map $M_{d}$ .

\begin{matrix} M_{d} = f_{A C M} (x_{C B C T}), M_{d} \in R^{h \times w \times d} \end{matrix}

(8)

The value of each voxel in

M_{d}

represents the severity of the artifact at the corresponding spatial location, where

h, w, d

denote the image height, width, and depth, respectively.

We then employ this approach $M_{d}$ to modulate the forward noising process, such that artifact-ridden regions are subjected to corruption more rapidly. We adjust the global noise schedule $β_{t}$ to a spatially variant $β {^{'}}_{t} (p)$ , $p$ is the spatial coordinate:

β_{t}^{'} (p) = β_{t} (1 + γ M_{d} (p))

(9)

where

γ = 0.7

is a hyper-parameter used to control the strength of the adaptation. This results in a spatially-variant

\bar{α} {^{'}}_{t} (p)

, with

{\bar{α}}_{t}^{'} = \prod_{s = 1}^{t} (1 - β_{s}^{'})

. During training, the noising step is modified to:

x_{t} (p) = \sqrt{{\bar{α}}_{t}^{'} (p)} x_{C T} (p) + \sqrt{1 - {\bar{α}}_{t}^{'} (p)} ϵ (p)

(10)

where

x_{C T}

denotes the clean target volume (corresponding to

x_{0}

in the forward diffusion). In the reverse process, the denoising network

ϵ_{θ}

takes the Artifact Map

M_{d}

as an additional input to perceive the local artifact severity:

ϵ_{p r e d} = ϵ_{θ} (x_{t}, t, c, M_{d})

(11)

Dually-guided structural consistency module

To ensure the generated sCT maintains anatomical consistency with the patient, we introduce a dual-domain guidance mechanism spanning both the spatial domain and the frequency domain. First we employ a 3D U-Net architecture ( $f_{ϕ}$ ) to provide fine-grained anatomical structure constraints, which architecture same as denoising network’s (ADM). This module takes the low-quality CBCT as input and outputs a normalized weight feature map, $W_{f e a t}$ , which is used to explicitly enhance high-frequency fine structures, such as the bone cortex.

\begin{matrix} W_{r a w} = f_{ϕ} (x_{C B C T}) \\ W_{f e a t} = σ (γ_{ϕ} \cdot BN (W_{r a w}) + β_{ϕ}) \in (0, 1) \end{matrix}

(12)

Where

σ

denotes the Sigmoid activation function, BN refers to instance normalization over voxels, and

γ_{ϕ} = 0.5, β_{ϕ} = 0.5

are trainable scale/shift parameters.

To mitigate minor misregistration between the reference CT and the CBCT, we introduce a frequency-domain guidance term. In the Fourier domain, low-frequency components encode the global, macro-structural content of an image. At each sampling step $t$ , we first estimate the cleans CT image ${\hat{x}}_{0}$ corresponding to that step. We then compute the three-dimensional fast Fourier transforms (3D FFTs) of ${\hat{x}}_{0}$ and the reference $x_{r e f}$ , and define a guidance term that penalizes discrepancies confined to the low-frequency bands:

L_{f r e q} = ‖ M a s k_{l o w} (F ({\hat{x}}_{0})) - M a s k_{l o w} (F (x_{r e f})) ‖_{1}

(13)

Where

F

denotes the FFT operator and

M a s k_{l o w}

is a low-pass filter. During sampling, we steer the denoising update using the gradient

\nabla_{x_{t}} L_{f r e q}

, thereby enforcing global structural consistency.

Loss function

We optimize sCT synthesis with a composite objective:

\begin{aligned} L_{t o t a l} & = λ_{d d p m} L_{D D P M} + λ_{p e r c} L_{p e r c} \\ + λ_{e a g l e} L_{E A G L E} + λ_{c y c} L_{c y c} \end{aligned}

(14)

Here, $L_{D D P M}$ is the standard diffusion loss, while the remaining terms enforce volumetric perceptual coherence, edge sharpness, and cross-domain robustness, respectively; $λ_{d d p m}, λ_{p e r c}, λ_{e a g l e}, λ_{c y c}$ are scalar hyper-parameters tuned on the validation set and fixed to $1.0, 0.10, 0.10, 0.05$ in all experiments. Let ${\hat{x}}_{C T}$ denote the synthesized CT, and $x_{C T}$ be the ground-truth CT. To preserve 3D anatomical consistency, the perceptual loss operates on orthogonal slice stacks. For each anatomical view $v \in {a x, s a g, c o r}$ , let $S_{v} (\cdot)_{i}$ extract the $i$ -th 2D slice, where $i$ indexes the slice within view $v$ . With a pretrained 2D VGG feature extractor $ϕ$ , intermediate activations $ϕ_{ℓ} (\cdot)$ and nonnegative layer weights $η_{ℓ}$ , the 3D perceptual loss is:

\begin{aligned} L_{p e r c} & = \sum_{v \in {a x, s a g, c o r}} E [\sum_{ℓ \in L} η_{ℓ} \\ \cdot ‖ ϕ_{ℓ} (S_{v} ({\hat{x}}_{C T})_{i}) - ϕ_{ℓ} (S_{v} (x_{C T})_{i}) ‖_{1}] \end{aligned}

(15)

To enhance edge fidelity, we introduce the edge-aware gradient-spectrum loss $L_{E A G L E}$ . Given an image $x$ , define its gradient map $G (x)$ , partition $G (x)$ into non-overlapping patches with operator $U {G (x), n}$ , and compute the per-patch variance to obtain a variance map $V (x) = Var (U {G (x), n})$ , where $U {G (x), n}$ slices $G (x)$ into $n \times n$ patches and $HPF (\cdot)$ denotes a 3D high-pass filter. Applying a high-pass filter $HPF (\cdot)$ followed by a 3D FFT $F (\cdot)$ , the loss compares the magnitude spectra:

\begin{aligned} L_{E A G L E} & = ‖ | F (HPF (V ({\hat{x}}_{C T}))) | \\ - | F (HPF (V (x_{C T}))) | ‖_{1} \end{aligned}

(16)

Finally, To ensure the robustness of the mapping between the CBCT and CT domains, we incorporate a cycle consistency constraint. The CBCT $x_{C B C T}$ is first used by STADW-M to generate the sCT $x_{C T}^{'}$ . This $x_{C T}^{'}$ is then mapped back to the CBCT domain via a learnable degradation model $G_{d e g}$ , yielding $x_{C B C T}^{″} = G_{d e g} (x_{C T}^{'})$ . The loss function minimizes the distance between the reconstructed latent representation and the original:

L_{c y c} = ‖ x_{C B C T} - x_{C B C T}^{″} ‖_{1}

(17)

Together, these terms complement the diffusion objective by aligning high-level appearance across orthogonal views, sharpening structural boundaries in the frequency domain of gradient statistics, and regularizing the bidirectional mapping between CBCT and CT domains.

Generation of sCT from CBCT via warm-star

DDPM usually starts with a sample from $N (x_{T}; 0, I)$ and gradually transforms Gaussian noise into images through a series of reverse diffusion processes. It is evident that different noise initializations lead to distinct images, and for CT images, varying structures and pixel intensities have specific physiological meanings. Random initialization noise may cause variations in generated CT structures, resulting in erroneous information. To address this issue, we adopt a CBCT Warm-Start:

\begin{aligned} x_{T}^{(cbct)} & \sim N ({\bar{α}}_{T} x_{CBCT}, (1 - {\bar{α}}_{T}) I) \\ = q_{ws} (x_{T} ∣ x_{CBCT}) \end{aligned}

(18)

while keeping the reverse dynamics unchanged, i.e. the same learned conditional kernel

p_{θ} (x_{t - 1} ∣ x_{t}, c)

. Because the reverse kernel is fixed by training, changing only the initial law

p (x_{T})

leaves the target conditional distribution unaltered; the effect is confined to numerical transients such as convergence speed and sampling variance.

To quantify the benefit of the Warm-Start, let the true terminal random variable be:

x_{T}^{(*)} = {\bar{α}}_{T} x_{C T} + \sqrt{1 - {\bar{α}}_{T}} ϵ, ϵ \sim N (0, I)

(19)

For the Warm-Start initial distribution

x_{T}^{(cbct)}

, its expected squared error with respect to the true terminal state is:

E [{‖ x_{T}^{(cbct)} - x_{T}^{(*)} ‖}^{2}] = {\bar{α}}_{T} ‖ x_{CBCT} - x_{C T} ‖^{2}

(20)

In contrast, the error associated with the standard initial point

x_{T}^{(0)} \sim N (0, I)

is calculated as:

E [{‖ x_{T}^{(0)} - x_{T}^{(*)} ‖}^{2}] = D + {\bar{α}}_{T} ‖ x_{C T} ‖^{2} + D (1 - {\bar{α}}_{T})

(21)

In clinically reasonable preprocessing, where pre-processing steps like image registration typically ensure that the misalignment error $‖ x_{CBCT} - x_{C T} ‖^{2}$ is significantly smaller than $‖ x_{C T} ‖^{2} + 2 D / {\bar{α}}_{T}$ , $D$ denotes the number of voxels (dimensionality) of the CBCT/CT image.

An equivalent perspective is given by Gaussian KL divergences. Denote the true terminal law by $q * (x_{T}) = N ({\bar{α}}_{T} x_{C T}, (1 - {\bar{α}}_{T}) I)$ . The KL from $q *$ to the warm-start law $p_{ws} = N ({\bar{α}}_{T} x_{CBCT}, (1 - {\bar{α}}_{T}) I)$ has the closed form:

KL (q * ∥ p_{ws}) = \frac{{\bar{α}}_{T}^{2}}{2 (1 - {\bar{α}}_{T})} ‖ x_{C T} - x_{CBCT} ‖^{2},

(22)

whereas the KL to the standard prior:

\begin{aligned} KL (q * ∥ p_{0}) & = \frac{1}{2} (D (1 - {\bar{α}}_{T}) + {\bar{α}}_{T} ‖ x_{C T} ‖^{2} \\ - D - D \log (1 - {\bar{α}}_{T})) . \end{aligned}

(23)

Similarly, $KL (q * ∥ p_{ws})$ is typically much smaller than $KL (q * ∥ p_{0})$ . The warm start therefore reduces the initial error and the initial KL significantly, implying fewer reverse steps and improved numerical stability without altering the learned reverse dynamics or the target conditional distribution.

Evaluation criteria

In order to evaluate the performance of different synthetic image generation models in terms of pixel-level Hounsfield Units (HU) accuracy, noise level, and structural similarity, we utilize four evaluation metrics, namely Mean Absolute Error (MAE), Root Mean Square Error (RMSE), Peak Signal-to-Noise Ratio (PSNR), and Structural Similarity Index (SSIM). Given three images, $x_{s C T}$ , $x_{C B C T}$ , and $x_{C T}$ , each consisting of $M$ pixels, we define and provide comprehensive formulas for the aforementioned metrics.

MAE:

M A E = \frac{1}{M} \sum | x_{s C T} - x_{C T} |

(24)

RMSE:

R M S E = \sqrt{\frac{1}{M} \sum {| x_{s C T} - x_{C T} |}^{2}}

(25)

PSNR:

\begin{aligned} PSNR & = 20 \cdot \log_{10} MAX (x_{C T}) \\ - 10 \cdot \log_{10} (\frac{1}{M} \sum {| x_{s C T} - x_{C T} |}^{2}) \end{aligned}

(26)

SSIM:

S S I M = \frac{(2 μ_{sCT} μ_{CT} + c_{1}) (2 σ_{sCT, CT} + c_{2})}{(μ_{sCT}^{2} + μ_{CT}^{2} + c_{1}) (σ_{sCT}^{2} + σ_{CT}^{2} + c_{2})}

(27)

The $μ$ represents the mean of the image, $σ$ denotes the image’s standard deviation, $c_{1} = (0.01 L)^{2}$ , $c_{2} = (0.03 L)^{2}$ and $L$ signifies the image’s dynamic range.

Implementation details

During the experimental phase, the proposed model was trained on dataset 1 using 80% of the data for training, 10% for testing, and the remaining 10% for validation. The Adam optimization algorithm was used for training all networks with a learning rate of $10^{- 4}$ and 500 epochs. All experiments were conducted on an HP workstation with dual NVIDIA RTX 8000 GPUs and an Intel Xeon Gold 6240 CPU. The proposed model was implemented by using PyTorch V1.11.0+cu113, while all other programs were written in Python V3.7.

We established various algorithmic combinations to evaluate the impact of individual components on the resultant outputs: (1) DDPM+ADM+DGSC+WS (STADW-M); (2) DDPM+DGSC+WS (STDW-M); (3) DDPM+ADM+WS (STAW-M); (4) DDPM+ADM+DGSC (STAD-M).

We additionally compared the performance of medical image generation approaches based on CycleGAN improved by Dong et al.,⁵⁰ Conditional diffusion model,⁵¹ EGDiff,⁵² and our proposed method in the task of generating sCT from CBCT.

It is important to note that the CBCT volumes in this study are acquired and reconstructed using a vertebra-centered small-FOV protocol. Consequently, the native reconstructed CBCT field of view (FOV) inherently covers only the anatomical region around the vertebrae and contains no background air voxels. Therefore, evaluation on the native CBCT FOV is effectively equivalent to evaluation within an anatomical/body mask, and does not rely on any post-hoc truncation or ROI selection that could bias the quantitative metrics. For clarity, all metrics reported in this work are computed over the entire native CBCT FOV, i.e., the complete valid reconstructed anatomical volume. All images were visualized using the same CT window width (WW) $=$ 2000 and window level (WL) $=$ 0.

Results

Results of CT-CBCT registration

To achieve more accurate alignment between CT and CBCT, we adopted an unsupervised deformable fine-registration approach to mitigate the adverse impact of residual misregistration on downstream results. According to the quantitative evaluation, the mean surface distance (MSD) of the finely registered CT (rCT) in the sagittal, coronal, and axial planes was 1.61, 1.27, and 1.19 mm, respectively, while the average surface distance (ASD) was reduced to 0.75, 0.43, and 0.69 mm, respectively. In addition, the Dice similarity coefficient (DSC) was 0.93, 0.93, and 0.95, and the Jaccard similarity coefficient (JSC) was 0.87, 0.87, and 0.90 for the three planes, respectively (Table 1). Collectively, these objective metrics demonstrate that the proposed unsupervised deformable fine-registration method achieves high-precision registration. As shown in Figure 2, after unsupervised deformable fine registration, rCT and CBCT exhibit improved fine-scale correspondence, and the structural discrepancies between the two modalities are further reduced.

Figure 2.

Results of CT-CBCT registration. The columns show CT, CBCT, and registered CT (rCT) while the rows are the cross-sections at different axial levels. Following fine registration, rCT exhibits improved correspondence with CBCT with reduced local structural discrepancy.

Table 1.

Registration results for deformable CT-CBCT alignment.

	MSD (mm)	ASD (mm)	DICE	JSC
Sagittal	1.61	0.75	0.93	0.87
Coronal	1.27	0.43	0.93	0.87
Axial	1.19	0.69	0.95	0.90

Evaluation of sCT image quality with objective metrics

As depicted in Figure 3, both the image quality and spatial uniformity of sCT have experienced significant improvements in comparison to CBCT, while preserving the anatomical structure. Generally, the image quality of sCT is markedly superior to that of CBCT. Due to registration errors, the anatomical structures between CBCT and CT may differ; however, sCT’s anatomical structure aligns with CBCT, indicating that sCT effectively retains the anatomical information of CBCT, specifically the structure of the vertebrae. This approach can infer missing components based on the known vertebral information in the CBCT image (Figure 3, third row). sCT not only reduces artifacts and restores fine image structures but also reestablishes the CT values of CBCT. Figure 3 demonstrates that noise and artifacts throughout the image have been efficiently eliminated, with the discrepancy between sCT and CT being considerably smaller than that between CT and CBCT.

Figure 3.

Compares the image quality among CBCT, sCT, and CT. The left, middle, and right columns are the CBCT, sCT, and CT, respectively while the upper, middle, and bottom rows correspond to the slices at different axial levels. Compared with CBCT, sCT demonstrates superior image quality and enhanced spatial uniformity.

Table 2 presents the quantitative results for all test cases, illustrating the differences between CBCT and CT, as well as between sCT and CT. In comparison with CBCT and CT, the RMSE and MAE of sCT and CT have decreased to 152.9 $\pm$ 24.4 HU and 102.6 $\pm$ 23.6 HU, respectively, while the PSNR has increased to 27.9 $\pm$ 1.36 dB, suggesting a higher similarity between sCT and CT. Nonetheless, the increase in SSIM values is limited, as factors such as acquisition time, environment, and patient positioning typically vary between CT and CBCT images, resulting in minor discrepancies in the anatomical structures depicted on CT and CBCT images.

Table 2.

The mean and standard deviation of the objective evaluation results of CBCT-CT and sCT-CT for all test data.

	RMSE (HU)	SSIM	PSNR (dB)	MAE (HU)
sCT-CT	152.9 $\pm$ 24.4	0.96 $\pm$ 0.02	27.9 $\pm$ 1.36	102.6 $\pm$ 23.6
CBCT-CT	890.1 $\pm$ 81.4	0.41 $\pm$ 0.03	13.6 $\pm$ 2.01	859.7 $\pm$ 75.3

Evaluation of sCT image quality with CT values

Regarding CT value enhancement, the overall CT value distribution of sCT closely aligns with that of CT. Figure 4 illustrates the CT value distribution along two red line paths, assessing the improvement in CT values. Line 64 crosses soft tissue and bone tissue regions; with the exception of the red arrow position, sCT’s CT values at other locations remain highly consistent with CT’s values, demonstrating a significant difference compared to the relationship between CBCT and CT values. At the red arrow location, the original CBCT image is devoid of the anatomical structure of the CT image; thus, this structure is not reconstructed in sCT but preserves the anatomical information of the CBCT image. This inconsistency between sCT and CT emphasizes the superiority of the developed algorithm in preserving CBCT anatomical structures. The path (profiles) of line 32 solely passes through the soft tissue region, and as depicted in the figure, sCT’s CT values are not only corrected to align with CT’s values, but the corrected CT values also display smoothness comparable to CT’s values.

Figure 4.

HU profiles on the selected lines. The upper images show the axial slices from CBCT, sCT, and CT with two selected lines (line 32 and line 64). The lower figures show the corresponding HU profiles along the lines. Overall, sCT exhibits a CT-value distribution closer to that of CT than CBCT, in both soft-tissue and bone-containing regions, while remaining anatomically consistent with CBCT.

Figure 5 illustrates the HU distribution in the red rectangular area for CBCT, sCT, and CT. In comparison with CT and CBCT, the CT value difference between CT and sCT is extremely small. Within the ROI region, the CT value distribution of sCT resembles that of CT, demonstrating that the generated sCT and CT’s CT values exhibit high consistency.

Figure 5.

HU distribution within the ROI indicated by the red box for CBCT, sCT, and CT. The upper images show the selected ROIs, and the lower figures present the corresponding HU distributions. Compared with CBCT, the HU distribution of sCT is closer to that of CT.

Ablation experiment

We conducted a series of ablation studies to systematically evaluate the importance of the main elements in our algorithm. By comparing the results in Table 3, we found that the STADW-M method generates the highest quality sCT images. Since the diffusion model starts with pure noise samples and generates image samples through iterative denoising, the random initialization of noise may cause the CT values of the generated sCT to be biased, so the CT values of the sCT generated by the STAD-M method have some inaccuracies. The CBCT images used in this study have a lower quality, with significant noise in the CT values, and soft tissues and some cancellous bones are not clearly distinguished, resulting in errors between the CT values of sCT generated by the STA-M method and the actual CT values. Compared with STD-M and STA-M, the objective evaluation values of the sCT generated by the STAD-Mmethod with the incorporated ADM and DGSC module show a significant improvement, indicating that the those modules can enhance the performance of the model. Overall, compared to CBCT, the sCT image quality and visibility generated by our method have significantly improved, and the vertebrae and soft tissues can be clearly distinguished (as shown in Figure 6). Through a series of ablation studies, we have proven the impact of different elements in our algorithm on the sCT generation performance and chose the optimal settings to achieve the best sCT image quality and visibility.

Figure 6.

Comparison of ablation experiment results. Compared with the other ablation settings, the full STADW-M model produces sCT with clearer vertebral and soft-tissue depiction, reduced artifacts, and a visual appearance closer to CT.

Table 3.

Comparison of ablation test results.

	RMSE (HU)	SSIM	PSNR (dB)	MAE (HU)
DDPM	251.7 $\pm$ 46.6	0.79 $\pm$ 0.05	20.9 $\pm$ 1.63	171.6 $\pm$ 36.7
STD-M	225.2 $\pm$ 69.6	0.91 $\pm$ 0.03	23.7 $\pm$ 1.72	161.1 $\pm$ 82.1
STA-M	227.0 $\pm$ 18.9	0.93 $\pm$ 0.04	26.9 $\pm$ 1.59	205.1 $\pm$ 19.2
STAD-M	150.2 $\pm$ 26.6	0.96 $\pm$ 0.02	27.9 $\pm$ 1.50	86.9 $\pm$ 18.5
STADW-M	152.9 $\pm$ 24.4	0.96 $\pm$ 0.02	27.9 $\pm$ 1.36	102.6 $\pm$ 23.6
CBCT-CT	890.1 $\pm$ 81.4	0.41 $\pm$ 0.03	13.6 $\pm$ 2.01	859.7 $\pm$ 75.3

Table 4.

Comparison of results of different methods for generating sCT.

	RMSE (HU)	SSIM	PSNR (dB)	MAE (HU)
CycleGAN	$211.3 \pm 58.0$	$0.87 \pm 0.12$	$24.1 \pm 2.37$	$173.7 \pm 58.8$
Conditional DDPM	$191.1 \pm 47.5$	$0.92 \pm 0.13$	$23.9 \pm 2.38$	$149.5 \pm 43.6$
EGDiff	$172.3 \pm 33.0$	$0.90 \pm 0.03$	$26.1 \pm 1.43$	$131.9 \pm 41.2$
SMU-MedVision	$156.2 \pm 31.0$	$0.91 \pm 0.13$	$25.1 \pm 1.73$	$104.5 \pm 24.2$
Proposed Method	$152.9 \pm 24.4$	$0.96 \pm 0.02$	$27.9 \pm 1.36$	$102.6 \pm 23.6$
CBCT–CT	$890.1 \pm 81.4$	$0.41 \pm 0.03$	$13.6 \pm 2.01$	$859.7 \pm 75.3$

Comparison with state-of-the-art techniques

The aim of this study is to convert CBCT images into high-quality images resembling standard-dose CT (sCT) images to enhance the clinical diagnostic and therapeutic applications of CBCT. To this end, we compared the performance of three network models, namely CycleGAN,⁵⁰ Conditional DDPM,⁵¹ and EGDiff,⁵² for CBCT-to-CT conversion and then compared them with our proposed method. Moreover, motivated by the SynthRAD2023 CBCT-to-CT challenge report⁵³ and to ensure a fair reference to a strong challenge-derived baseline, we additionally reproduced the champion method SMU-MedVision on our dataset and included it in the comparison. The experimental results indicate that the sCT images generated by our method demonstrate significantly superior quality compared to those produced by the other three approaches, as shown in Table 4. Specifically, these methods experience issues such as soft tissue contrast noise, image blurring, and excessive image smoothing in the generated sCT images (including the champion SMU-MedVision, Figure 7). In contrast, the sCT images synthesized by our proposed method display lower noise and artifacts, higher authenticity, and better preservation of anatomical structures, particularly bone tissue. In summary, our proposed method demonstrates excellent performance in medical image conversion tasks, providing an effective conversion technique for low-dose CBCT images in clinical applications and offering practical value for medical image diagnosis and treatment, especially image-guided spinal surgery.

Figure 7.

Comparison of results from different methods for generating sCT.

Effect of sampling steps on performance and runtime

Fast inference is critical for intraoperative applications. For diffusion-based CBCT-to-CT synthesis, inference latency is largely determined by the number of sampling steps. Accordingly, we analyzed the speed–quality trade-off by varying the sampling steps $N \in {6, 12, 25, 50, 100, 200}$ while keeping all other settings unchanged. All experiments were conducted on a single NVIDIA RTX 3090 GPU (24 GB VRAM), and the reported time denotes the average inference time per case.

As shown in Table 5, increasing the number of sampling steps consistently improves image quality, as evidenced by lower RMSE/MAE and higher SSIM/PSNR. Specifically, MAE decreases from $116.7 \pm 24.5$ HU at $N = 6$ to $104.9 \pm 23.9$ HU at $N = 50$ , and further to $102.6 \pm 23.6$ HU at $N = 200$ . Meanwhile, SSIM increases from $0.90 \pm 0.12$ ( $N = 6$ ) to $0.95 \pm 0.11$ ( $N = 50$ ) and reaches $0.96$ at larger $N$ . PSNR exhibits a similar trend. These results indicate that performance improves monotonically with increasing $N$ , but the gains become marginal once $N$ reaches a moderate level. For example, increasing $N$ from 50 to 200 yields only an additional $\sim$ 2.3 HU reduction in MAE and a limited increase in PSNR, suggesting that performance is close to saturation (Figure 8).

Figure 8.

Qualitative comparison of CBCT-to-CT synthesis with different sampling steps.

Table 5.

Effect of sampling steps on CBCT-to-CT synthesis performance and runtime (mean $\pm$ std).

	RMSE (HU)	SSIM	PSNR (dB)	MAE (HU)	Time (s)
Steps=6	166.3 $\pm$ 32.7	0.90 $\pm$ 0.12	24.2 $\pm$ 1.70	116.7 $\pm$ 24.5	0.66 $\pm$ 0.14
Steps=12	164.0 $\pm$ 29.0	0.91 $\pm$ 0.13	24.7 $\pm$ 1.56	115.2 $\pm$ 23.8	1.33 $\pm$ 0.06
Steps=25	161.0 $\pm$ 27.5	0.93 $\pm$ 0.12	25.3 $\pm$ 1.55	109.1 $\pm$ 23.9	2.14 $\pm$ 0.07
Steps=50	157.9 $\pm$ 25.7	0.95 $\pm$ 0.11	26.9 $\pm$ 1.51	104.9 $\pm$ 23.9	3.82 $\pm$ 0.04
Steps=100	155.7 $\pm$ 25.2	0.96 $\pm$ 0.09	27.3 $\pm$ 1.64	103.6 $\pm$ 23.7	7.31 $\pm$ 0.05
Steps=200 (STADW-M)	152.9 $\pm$ 24.4	0.96 $\pm$ 0.02	27.9 $\pm$ 1.36	102.6 $\pm$ 23.6	14.0 $\pm$ 0.09
CBCT-CT	890.1 $\pm$ 81.4	0.41 $\pm$ 0.03	13.6 $\pm$ 2.01	859.7 $\pm$ 75.3	–

In contrast, runtime grows approximately linearly with $N$ : inference time increases from $0.66 \pm 0.14$ s at $N = 6$ to $3.82 \pm 0.04$ s at $N = 50$ , and reaches $14.0 \pm 0.09$ s at $N = 200$ . Balancing accuracy and efficiency, $N = 50$ provides a favorable operating point, achieving near-saturated image quality while substantially reducing latency compared with $N = 200$ . Unless otherwise specified, we use $N = 200$ as the default setting to maximize fidelity, and additionally report results at $N = 50$ to demonstrate the feasibility of faster intraoperative deployment.

Screw trajectory planning based on sCT

This study also performed screw trajectory planning tests to evaluate the practical application value of the algorithm. Trajectories were generated using a geometry-driven planning framework adapted from (Zhang et al., Phys. Med. Biol., 2023⁵⁴). Briefly, a binary vertebral mask is obtained to delineate anatomical and avoidance boundaries. Trajectory planning is then cast as a max-min clearance optimization, selecting the screw axis that maximizes the minimum distance to the avoidance boundary. The axis is estimated from orthogonal axial and sagittal projections: in each plane, boundaries are identified via contour detection, and the optimal line is solved using a maximum-margin (SVM-based) formulation. The final 3D trajectory is reconstructed by jointly enforcing the two orthogonal projection constraints.

In this test, we collected CBCT and CT data from 36 patients and 94 screws, comprising 6 screws from L1, 14 screws from L2, 16 screws from L3, 28 screws from L4, and 30 screws from L5. This study first converted CBCT images to sCT, and then conducted screw trajectory planning on CT, CBCT, and sCT. This paper evaluated the quality of the planned pedicle screw trajectory using the distance from the screw surface to the pedicle cortex (DSOS), with MGM representing DSOS > 0 mm, CAE representing $-$ 2 mm < DSOS < 0 mm, and Non-acceptable representing DSOS < -2 mm. The results dmonstrated that screw trajectory planning performed directly on CBCT images achieved only 68.0% (64/94) MGM cases, 27.7% (26/94) CAE, and 4.3% (4/94) classified as non-acceptable, underscoring the limitations of CBCT attributable to image artifacts and low soft-tissue contrast. Conversely, following conversion of CBCT to sCT using the proposed method, all planned trajectories were clinically acceptable ( $DSOS > - 2$ mm), with 94.7% (89/94) classified as MGM and 5.3% (5/94) as CAE (Figure 9 and Table 6). These outcomes closely approximate those obtained with high-quality CT imaging (95.7% MGM, 4.3% CAE, 0% non-acceptable), demonstrating that this method can provide sufficiently clear image information to support pedicle screw fixation surgery, thereby enhancing the accuracy of screw trajectory planning and placement (Figure 10).

Figure 9.

Distribution of DSOS for calculated screw trajectories. (a) Screw path planning on CBCT. (b) Screw path planning on CT. (c) Screw path planning on sCT.

Figure 10.

Screw trajectory planning based on sCT.

Table 6.

The MGM, CAE and non-acceptable for screw trajectory planning on sCT, CT and CBCT.

	MGM	CAE	Non-acceptable
sCT	89 (94.7%)	5 (5.3%)	0 (0.0%)
CT	90 (95.7%)	4 (4.3%)	0 (0.0%)
CBCT	64 (68.0%)	26 (27.7%)	4 (4.3%)

Discussion and conclusion

In this study, we proposed the Spatiotemporal Adaptive Warm-Start Diffusion Model (STADW-M), which integrates a registration module, dual-domain structural guidance, and an artifact-aware mechanism to synthesize high-quality sCT from intraoperative CBCT. Quantitative evaluations and screw trajectory planning assessments indicated that our method outperformed state-of-the-art techniques in both image quality and anatomical fidelity, achieving clinical utility approaching that of diagnostic CT.

Compared with traditional generative methods, the proposed STADW-M addresses limitations inherent in CBCT-to-CT synthesis. Whereas GAN-based approaches frequently exhibit training instability and structural hallucinations, and standard DDPM-based encounter challenges with random initialization resulting in anatomical inconsistencies, our method incorporates targeted innovations. Specifically, the Artifact-Aware Adaptive Diffusion Module and Dually-Guided Structural Consistency Module leverage prior feature maps to distinguish between anatomical structures and noise, effectively suppressing the inherent artifacts of CBCT while preserving bone cortex details. Crucially, to mitigate the stochastic deformation caused by pure Gaussian noise initialization in the reverse diffusion process, we implemented a Warm-Start strategy. By replacing the standard prior $p (x_{T}) = N (0, I)$ with a CBCT-informed distribution p $p (x_{T}) = N (\sqrt{{\bar{α}}_{T}} x_{C B C T}, (1 - {\bar{α}}_{T}) I)$ , we ensure the generative process remains anchored to the patient’s actual anatomy, thereby significantly enhancing structural fidelity. Furthermore, the incorporation of the Deformable Image Registrationn module mitigates the impact of non-strict alignment in training data. The clinical validity of these technical advancements is evidenced by the screw trajectory planning results: the success rate (MGM) improved from 68.0% on original CBCT to 94.7% on our sCT, closely approximating the 95.7% accuracy of gold-standard CT. This confirms that despite the inherent noise in soft tissues, our method reconstructs vertebral anatomy with sufficient precision for robot-assisted spinal surgery.

Although the algorithm proposed in this paper can generate high-quality sCT images, it has limitations. First, the DDPM requires paired CT and CBCT images, but such ideal CBCT-CT pairs are not easily obtained because they are not acquired simultaneously. To address this issue, we incorporated the Reg module into the algorithm to fully align CT and CBCT. However, due to the difficulty in completely aligning CT and CBCT resulting from registration errors, the generated sCT might exhibit slight deformations. Furthermore, CBCT has considerable errors in CT values and cannot differentiate between soft tissue regions and those obscured by noise, leading to disputes regarding the accuracy of soft tissue structures in the generated sCT.

Although the accuracy of the soft tissue regions in the generated sCT may be disputed due to the quality of CBCT images, the anatomical structure of the vertebrae is preserved completely and clearly. For pedicle screw implantation, a complete and accurate vertebral anatomical structure is more critical, and the results of screw trajectory planning demonstrate the clinical value of this method. In summary, the method proposed in this paper performs well in medical image transformation tasks, providing an effective conversion method for CBCT images in clinical applications, and has practical application value for robot-assisted spinal surgery.

Footnotes

Ethics statement

The study was ethically approved by the Ethical Committee of Capital Medical University, Beijing, China (No. Z2024SY064). Written consent from the participants was waived due to the retrospective design of the study.The experimental design abided by the principles of the Helsinki Declaration.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Beijing Natural Science Foundation, Grant Nos. 1S24093 and L241029; Beijing Municipal Health Commission, Grant No. 2024-2-2076; National Natural Science Foundation of China, Grant No. 61827809.

Declaration of competing interest

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

ORCID iD

Yunxian Zhang

References

Balakrishnan

Zhao

Sabuncu

, et al. VoxelMorph: a learning framework for deformable medical image registration. IEEE Trans Med Imaging 2019; 38: 1788–1800.

Guo

. Multi-modal image registration with unsupervised deep learning. PhD Thesis, Massachusetts Institute of Technology, 2019.

Mason

Paulsen

Babuska

, et al. The accuracy of pedicle screw placement using intraoperative image guidance systems: a systematic review. J Neurosurg Spine 2014; 20: 196–203.

Lee

Carass

Jog

, et al. Multi-atlas-based CT synthesis from conventional MRI with patch-based refinement for MRI-based radiotherapy planning. In Medical imaging 2017: image orocessing. Vol. 10133. SPIE, pp. 434–439.

Sorcini

Tilikidis

. Clinical application of image-guided radiotherapy, IGRT (on the Varian OBI platform). Cancer/Radiothérapie 2006; 10: 252–257.

Lee

. Image-guided pedicle screws using intraoperative cone-beam CT and navigation. A cost-effectiveness study. J Clin Neurosci 2020; 72: 68–71.

Dea

Fisher

Batke

, et al. Economic evaluation comparing intraoperative cone beam CT-based navigation and conventional fluoroscopy for the placement of spinal pedicle screws: a patient-level data cost-effectiveness analysis. Spine J 2016; 16: 23–31.

Rivkin

Yocom

. Thoracolumbar instrumentation with CT-guided navigation (O-arm) in 270 consecutive patients: accuracy rates and lessons learned. Neurosurg Focus 2014; 36: E7.

Cai

Wang

Liu

, et al. Automatic path planning for navigated pedicle screw surgery based on deep neural network. In 2019 WRC symposium on advanced robotics and automation (WRC SARA). IEEE, pp. 62–67.

10.

Harms

Lei

Wang

, et al. Paired cycle-GAN-based image correction for quantitative cone-beam computed tomography. Med Phys 2019; 46: 3998–4009.

11.

Jaju

Jain

Singh

, et al. Artefacts in cone beam CT. Open J Stomatol 2013; 3: 292.

12.

Schulze

Heil

Groβ

, et al. Artefacts in CBCT: a review. Dentomaxillofac Radiol 2011; 40: 265–273.

13.

Silbermann

Riese

Allam

, et al. Computer tomography assessment of pedicle screw placement in lumbar and sacral spine: comparison between free-hand and O-arm based navigation techniques. Eur Spine J 2011; 20: 875–881.

14.

Zhang

Han

, et al. Robotic navigation during spine surgery. Expert Rev Med Devices 2020; 17: 27–32.

15.

Zhu

Xie

Wang

, et al. Scatter correction for cone-beam CT in radiation therapy. Med Phys 2009; 36: 2258–2268.

16.

Bai

Yan

, et al. A practical cone-beam CT scatter correction method with optimized monte carlo simulations for image-guided radiation therapy. Phys Med Biol 2015; 60: 3567.

17.

Keil

Constantin

, et al. Metal artifact correction for x-ray computed tomography using KV and selective MV imaging. Med Phys 2014; 41: 121910.

18.

Bechara

Moore

McMahan

, et al. Metal artefact reduction with cone beam CT: an in vitro study. Dentomaxillofac Radiol 2012; 41: 248–253.

19.

Garrett

Chen

. Reduction of beam hardening artifacts in cone-beam CT imaging via SMART-RECON algorithm. In Proceedings of SPIE–the international society for optical engineering. Vol. 9783. p. 97830W.

20.

Kida

Nakamoto

Nakano

, et al. Cone beam computed tomography image quality improvement using a deep convolutional neural network. Cureus 2018; 10: e2548.

21.

Thummerer

De Jong

Zaffino

, et al. Comparison of the suitability of CBCT-and MR-based synthetic CTs for daily adaptive proton therapy in head and neck patients. Phys Med Biol 2020; 65: 235036.

22.

Yuan

Dyer

Rao

, et al. Convolutional neural network enhancement of fast-scan low-dose cone-beam CT images for head and neck radiotherapy. Phys Med Biol 2020; 65: 035003.

23.

Dar

Yurt

Karacan

, et al. Image synthesis in multi-contrast MRI with conditional generative adversarial networks. IEEE Trans Med Imaging 2019; 38: 2375–2388.

24.

Nie

Trullo

Lian

, et al. Medical image synthesis with deep convolutional adversarial networks. IEEE Trans Biomed Eng 2018; 65: 2720–2730.

25.

Armanious

Jiang

Fischer

, et al. MedGAN: medical image translation using GANs. Comput Med Imaging Graph 2020; 79: 101684.

26.

Barateau

De Crevoisier

Largent

, et al. Comparison of CBCT-based dose calculation methods in head and neck cancer radiotherapy: from hounsfield unit to density calibration curve to deep learning. Med Phys 2020; 47: 4683–4693.

27.

Zhang

Yue

, et al. Improving CBCT quality to CT level using deep learning with generative adversarial network. Med Phys 2021; 48: 2816–2826.

28.

Dahiya

Alam

Zhang

, et al. Multitask 3D CBCT-to-CT translation and organs-at-risk segmentation using physics-based data augmentation. Med Phys 2021; 48: 5130–5141.

29.

Kida

Kaji

Nawa

, et al. Visual enhancement of cone-beam CT by use of cyclegan. Med Phys 2020; 47: 998–1010.

30.

Kurz

Maspero

Savenije

, et al. CBCT correction using a cycle-consistent generative adversarial network and unpaired training to enable photon and proton dose calculation. Phys Med Biol 2019; 64: 225004.

31.

Liang

Chen

Nguyen

, et al. Generating synthesized computed tomography (CT) from cone-beam computed tomography (CBCT) using cyclegan for adaptive radiation therapy. Phys Med Biol 2019; 64: 125002.

32.

Liu

Lei

Wang

, et al. CBCT-based synthetic CT generation using deep-attention cyclegan for pancreatic adaptive radiotherapy. Med Phys 2020; 47: 2472–2483.

33.

Gao

Xie

, et al. Generating synthetic CT from low-dose cone-beam CT by using generative adversarial networks for adaptive radiotherapy. Radiat Oncol 2021; 16: 1–16.

34.

Xue

Ding

Shi

, et al. Cone beam CT (CBCT) based synthetic CT generation using deep learning methods for dose calculation of nasopharyngeal carcinoma radiotherapy. Technol Cancer Res Treat 2021; 20: 15330338211062415.

35.

Lemus

OMD

Wang

, et al. Dosimetric assessment of patient dose calculation on a deep learning-based synthesized computed tomography image for adaptive radiotherapy. J Appl Clin Med Phys 2022; 23: e13595.

36.

Dhariwal

Nichol

. Diffusion models beat GANs on image synthesis. Adv Neural Inf Process Syst 2021; 34: 8780–8794.

37.

Jain

Abbeel

. Denoising diffusion probabilistic models. Adv Neural Inf Process Syst 2020; 33: 6840–6851.

38.

Jalal

Arvinte

Daras

, et al. Robust compressed sensing MRI with deep generative priors. Adv Neural Inf Process Syst 2021; 34: 14938–14954.

39.

Chung

. Score-based diffusion models for accelerated MRI. Med Image Anal 2022; 80: 102479.

40.

Song

Shen

Xing

, et al. Solving inverse problems in medical imaging with score-based generative models. arXiv preprint arXiv:211108005 (2021).

41.

Güngör

Dar

Öztürk

, et al. Adaptive diffusion priors for accelerated MRI reconstruction. Med Image Anal 2023; 88: 102872.

42.

Chung

Lee

. Mr image denoising and super-resolution using regularized reverse diffusion. IEEE Trans Med Imaging 2022; 42: 922–934.

43.

Pinaya

WHL

Tudosiu

Dafflon

, et al. Brain imaging generation with latent diffusion models. in MICCAI workshop on deep generative models, 2022, pp. 117–126.

44.

Lugmayr

Danelljan

Romero

, et al. Repaint: Inpainting using denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 11461–11471.

45.

Sasaki

Willcocks

Breckon

. UNIT-DDPM: UNpaired image translation with denoising diffusion probabilistic models. arXiv preprint arXiv:210405358 (2021).

46.

Ronneberger

Fischer

Brox

. U-Net: convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, 5–9 October 2015, proceedings, part III 18. Springer, pp. 234–241.

47.

Vaswani

Shazeer

Parmar

, et al. Attention is all you need. Adv Neural Inf Process Syst 2017; 30: 5998–6008.

48.

Huang

Yuan

Guo

, et al. Interlaced sparse self-attention for semantic segmentation. arXiv preprint arXiv:190712273 (2019).

49.

Kalchbrenner

Weissenborn

, et al. Axial attention in multidimensional transformers. arXiv preprint arXiv:191212180 (2019).

50.

Dong

Zhang

Liang

, et al. A deep unsupervised learning model for artifact correction of pelvis cone-beam CT. Front Oncol 2021; 11: 686875.

51.

Peng

Qiu

Wynne

, et al. CBCT-based synthetic CT image generation using conditional denoising diffusion probabilistic model. Med Phys 2024; 51: 1847–1859.

52.

Cai

, et al. Energy-guided diffusion model for CBCT-to-CT synthesis. Comput Med Imaging Graph 2024; 113: 102344.

53.

Huijben

Terpstra

Galapon

, et al. Generating synthetic computed tomography for radiotherapy: SynthRAD2023 challenge report. Med Image Anal 2024; 97: 103276.

54.

Zhang

Liu

Zhao

, et al. Improving pedicle screw path planning by vertebral posture estimation. Phys Med Biol 2023; 68: 185011.