Enhanced X-ray image denoising via the synergy of linear attention and convolution

Abstract

X-ray imaging technology, as the core non-invasive inspection method, plays an irreplaceable role in industrial non-destructive testing and medical diagnosis. However, during signal acquisition, the imaging system faces multiple interferences, such as the quantum effect and electronic noise. This leads to a significant decrease in the image’s signal-to-noise ratio, seriously affecting the accuracy of hazardous material identification and lesion detection. Existing X-ray image denoising methods have two major limitations. First, in physical model-driven denoising methods, the existing noise models deviate significantly from realistic ones, resulting in poor denoising results. Second, in mainstream deep learning-based methods, Convolutional Neural Networks (CNNs) have limitations in capturing long-range dependencies, while the Transformer model with a global receptive field has high computational complexity. To address these challenges, a physically grounded noise model is designed for synthesizing realistic X-ray images, trained on the public mainstream X-ray image security inspection datasets and augmented with hybrid real-synthetic data. Based on this, a novel denoising model, XDenoiser, is proposed in this paper. It incorporates a linear attention complexity Receptance Weighted Key-Value (RWKV) into a Transformer-based image restoration structure and combines it with CNNs to support both global and local receptive fields. Experiments on the expanded mainstream X-ray image security inspection datasets demonstrate the reasonableness and effectiveness of the XDenoiser algorithm.

Keywords

image denoise x-ray image linear attention collaborative structure

Introduction

X-ray imaging is a mature non-destructive testing technique. It is widely used in medical diagnosis and industrial inspection due to its rapid imaging capability and non-invasive nature.^1–3 However, image quality is often compromised by complex noise in practical applications. This noise arises from multiple sources. These include quantum statistical fluctuations of X-ray photons, electronic noise from detectors, and the low signal-to-noise ratio inherent in low-dose imaging which is commonly employed to protect patients and operational staff. Such noise not only degrades visual quality but also impairs the accuracy of subsequent quantitative analysis and diagnosis. Therefore, effective denoising is an essential preprocessing step in X-ray image analysis.

At present, research on X-ray image denoising confronts several key challenges. The primary difficulty involves the complexity of the noise model. X-ray image noise does not follow a simple additive Gaussian model. Instead, it typically exhibits Poisson-Gaussian mixture characteristics.^4–6 This complexity substantially limits the effectiveness of traditional denoising algorithms designed for simpler noise models. A second challenge lies in the inherent trade-off between detail preservation and noise suppression. In medical imaging, excessive smoothing during denoising can obscure subtle structures indicative of early-stage lesions. Such obscuration increases the risk of misdiagnosis and missed diagnoses.⁷ Furthermore, the widespread adoption of low-dose imaging imposes additional demands on denoising algorithms. A critical practical problem involves reconstructing diagnostically useful images through post-processing while maintaining reduced radiation exposure.⁸ This challenge requires particular attention in clinical applications.

Given these challenges, the development of advanced X-ray image denoising techniques is critically needed. In medical applications, image quality directly influences diagnostic accuracy. Effective denoising methods are essential to unlock the clinical potential of low-dose imaging while ensuring patient safety. In industrial inspection, particularly in high-precision sectors such as aerospace and new energy, defect tolerance is extremely low. Reliable denoising algorithms serve as crucial safeguards against missing subtle internal defects in structural components.⁹

Recent advances in deep learning have demonstrated remarkable success in image denoising. These data-driven approaches learn complex nonlinear input-output mappings through extensive training on large datasets. However, their superior performance typically requires massive amounts of high-quality annotated data. For X-ray applications, the scarcity of such datasets significantly limits denoising model performance. Challenges in obtaining large-scale real X-ray images stem from medical privacy concerns, industrial data sensitivity, and high annotation costs. To address this limitation, researchers have developed synthetic data-driven methods that generate realistic noise-clean image pairs based on physical noise characteristics.¹⁰ The accuracy of these noise synthesis models is critical, and they must precisely simulate the noise distribution present in actual X-ray imaging. In real X-ray systems, image noise primarily originates from equipment physics and environmental conditions, including sensor quantum noise and electronic thermal noise.⁵ Poisson noise constitutes a major component, exhibiting photon-count-dependent behavior. This results in weak noise in low absorption regions and strong noise in high absorption areas, creating a significant non-uniform noise distribution across the image. Another dominant component is Gaussian noise, caused by random signal interference from irregular electron motion in electronic systems. This noise type commonly appears in physical systems, electronic devices, and signal transmission processes. Therefore, these noise patterns severely degrade image quality and compromise subsequent analysis accuracy.

Current X-ray noise modeling approaches often demonstrate limited accuracy, compromising the validity of data-driven denoising results. Most existing denoising methods based on physical models rely exclusively on Gaussian noise assumptions. While some studies incorporate Gaussian-Poisson mixture models, these still fail to capture the complete noise complexity in real-world scenarios, particularly neglecting key physical characteristics of X-ray noise.¹¹ Consequently, developing accurate noise representations remains a critical challenge in X-ray denoising. To address this limitation, a more realistic X-ray image noise modeling approach is needed. Previous work⁵ demonstrates that Poisson-Gaussian distributions can effectively approximate both signal-dependent and stationary noise components. Building upon this foundation, a novel noise model in this paper is developed through integration with the original datasets. This solution helps mitigate the detrimental effects of inadequate noise modeling on training performance.

Recent deep learning denoising methods primarily employ either convolutional neural networks (CNNs)^12–14 or transformer architectures^15,16 to learn noise-to-clean signal mappings through data-driven approaches. While CNNs utilize fixed convolution kernels for feature extraction, they struggle to capture long-range dependencies. In contrast, transformers excel at extracting global contextual features through self-attention mechanisms. Hybrid architectures combining both approaches¹⁷ demonstrate superior performance in balancing global denoising and local detail preservation. However, the quadratic computational complexity of standard self-attention presents significant limitations when handling complex noise distributions while maintaining computational efficiency. Recent advances in sequence modeling, particularly linear attention mechanisms like the Receptance Weighted Key Value (RWKV) model,¹⁸ have shown promise for efficient long-sequence processing in natural language tasks. This work investigates the integration of linear-complexity RWKV attention with image denoising models to achieve an optimal balance between computational efficiency and denoising performance. The key challenge lies in designing an effective hybrid architecture that optimally combines local feature extraction with global structure preservation. Inspired by Restore RWKV,¹⁹ a novel dual-branch RWKV-CNN architecture is proposed for X-ray image denoising. This framework combines convolutional networks with linear attention mechanisms, leveraging CNNs for local feature enhancement while maintaining RWKV’s global dependency modeling. A dynamic fusion module is introduced to maximize the integration of global and local information while providing effective training supervision. Experimental validation on the mainstream X-ray image security inspection datasets demonstrates the proposed XDenoiser’s effectiveness compared to state-of-the-art denoising methods.

The main contributions of this paper are summarized as follows:

The mainstream X-ray image security inspection datasets are expanded by integrating original and composite images using a newly proposed physical noise model. To the best of our knowledge, this represents the first work on X-ray image synthesis incorporating noise modeling.

A global receptive field denoiser, XDenoiser, is introduced to enhance denoising efficiency with linear computational complexity. This is the first X-ray image denoising model based on the RWKV architecture.

While pure global modeling may degrade local details, a dual-branch architecture combining RWKV and convolution is designed to strengthen local dependency modeling. This yields a novel denoising model with both global and local receptive fields. Additionally, a dynamic fusion module is proposed to effectively combine global and local information. By analyzing regional noise characteristics and structural complexity, the module adaptively adjusts the fusion weights of the global and local branches.

Related work

Traditional filtering methods

Spatial domain methods were the first widely adopted strategy for image denoising, suppressing noise by smoothing pixel neighborhood information. Representative techniques include the median filter²⁰ and Gaussian filter.²¹ The median filter effectively removes salt-and-pepper noise, while the Gaussian filter is suitable for smoothing Gaussian noise. Both methods benefit from computational simplicity and high efficiency. However, they often blur image details, particularly in edge regions, leading to structural information loss. Transform domain methods utilize sparse image representations in domains such as frequency or wavelet space to enhance significant information while suppressing noise. Wavelet transform²² is widely used in multi-scale image processing, effectively distinguishing between image details and noise for threshold-based denoising. Additionally, principal component analysis (PCA)²³ is frequently employed to reduce image redundancy, improving the separation between signal and noise.

Model-based denoising methods

Model-based image denoising methods construct interpretable optimization objectives by explicitly modeling the statistical properties or physical mechanisms of noise and signals. These methods can be categorized into four main approaches: (1) Total Variation (TV) Models: Based on gradient sparsity, TV methods suppress noise by penalizing the total gradient magnitude while preserving edge structures. The classical TV denoising method proposed by Rudin et al.²⁴ is a representative example. (2) Matrix Modeling Methods (Low-Rank & Sparse Decomposition): The Non-Local Means (NLM) algorithm²⁵ improves texture and structure preservation by searching for structurally similar patches across the image and performing weighted averaging. However, NLM is noise-sensitive and prone to matching failures in complex textures or high-noise conditions, degrading reconstruction quality. (3) Block-Matching Methods: BM3D²⁶ (Block-Matching and 3D Filtering) achieves strong denoising performance while preserving details through patch matching and collaborative 3D filtering. It is widely used in image processing, particularly for scenes with repetitive structures. However, BM3D suffers from performance degradation under non-uniform noise or blurred edges, and its multi-stage pipeline incurs high computational costs, limiting efficiency. (4) Sparse Representation & Dictionary Learning: Methods like K-SVD²⁷ employ trained dictionaries for sparse coding, performing well under additive white Gaussian noise (AWGN). However, they heavily depend on training data, and their iterative optimization processes are computationally intensive.

In summary, while model-based methods provide strong theoretical guarantees and interpretability, they face challenges in handling complex real-world noise patterns and meeting efficiency requirements.

Deep learning based methods

Recent advances in deep learning-based image denoising can be broadly categorized into supervised and self-supervised approaches. Supervised methods utilize paired noisy-clean images to train end-to-end networks. CNN-based techniques like RED-CNN¹² and FFDNet²⁸ employ local convolutional operations for efficient feature extraction, offering fast convergence but limited performance on long-range structures and weakly textured regions due to constrained receptive fields. To address this limitation, Transformer architectures have been introduced, leveraging self-attention mechanisms for improved global modeling. Methods such as Restormer¹⁵ demonstrate superior performance in natural image denoising while preserving spatial details. However, Transformer-based approaches face challenges including high computational costs, significant memory requirements, and slow inference speeds, hindering deployment on resource-constrained devices.

Self-supervised denoising methods eliminate the need for clean reference images, making them suitable for real-world scenarios where ground truth data is unavailable. Noise2Noise²⁹ trains models using pairs of independently noisy images, leveraging statistical averaging of noise patterns. Noise2Void³⁰ generates pseudo-supervised signals from single noisy images by masking central pixels and predicting their values from surrounding neighborhoods. However, these approaches typically require specific noise assumptions (e.g., zero-mean, independent noise) and demonstrate limited effectiveness for structural or signal-dependent noise. Their reconstruction fidelity generally underperforms supervised methods. Consequently, integrating explicit noise modeling with strongly supervised frameworks represents a critical research direction for enhancing X-ray image denoising performance.

Methods

Modelling X-ray image noise

Recently, deep learning methods have made significant progress in image denoising, particularly through CNN and Transformer-based networks. Guo et al.³¹ demonstrate that accurate noise modeling plays a critical role in enhancing denoising effectiveness. When noise models fail to precisely characterize actual noise distributions, even sophisticated neural networks may achieve suboptimal results. This understanding has prompted increasing research into hybrid approaches combining physical modeling with deep learning, including Poisson-Gaussian distribution-based neural networks and self-supervised methods that adapt to real noise patterns. Effective noise modeling not only improves denoising quality but also ensures more reliable data for downstream X-ray image analysis tasks.

Conventional denoising methods, such as DnCNN,³² are mainly for the world image of visible light imaging. The imaging process is different from X-ray imaging, and the noise model based on it is also different. This difference makes the method applied to visible light image denoising less effective in X-ray image denoising. As shown in Figure 1, it is the result of directly using the Gaussian noise model, the area of complex structure information shown by the green ellipse may become too smooth. Most X-ray image denoising methods¹¹ mainly focus on Gaussian or Poisson noise with a single fixed parameter. However, in X-ray images, the noise level of the output image is not constant, which is caused by random physical processes. In X-ray imaging, image noise originates from two independent physical processes within the imaging process. First, the arrival of X-ray photons at the detector follows a random process governed by Poisson statistics from quantum physics. The resulting photon-counting fluctuations produce Poisson noise. A key characteristic of this noise type is its signal dependence—noise intensity correlates with signal intensity. Second, the electronic readout circuitry of the imaging system introduces additive Gaussian noise. This noise component arises from electronic disturbances such as thermal noise and dark current. Unlike Poisson noise, it remains independent of the signal. Consequently, an accurate noise model for X-ray image degradation must account for both of these fundamental noise sources. The Poisson-Gaussian mixture noise model employed in this paper builds upon this physical mechanism. Its validity is established at two levels. In X-ray imaging, the model has been directly verified and adopted through multiple physics-based studies and experiments.^4–6 Furthermore, this model extends beyond X-ray imaging. As a classical framework for handling mixed signal-dependent and signal-independent noise, the Poisson-Gaussian model has been widely validated across various imaging domains. These include fluorescence microscopy³³ and general digital imaging.³⁴ Collectively, this evidence confirms the model’s effectiveness and accuracy in characterizing real noise behavior in X-ray imaging.

Figure 1.

Results of the denoising method based on a simplified noise model.

In X-ray imaging, Poisson noise arises from the quantum nature of photon detection. The number of incident photons at each pixel location adheres to Poisson statistics. For a pixel with an expected photon count of $λ$ , the actual number of detected photons $k$ is a Poisson random variable. Its distribution is described by the following probability mass function:

P (k) = \frac{λ^{k} e^{- λ}}{k!}

(1)

where,

k = 0, 1, 2 \dots

denotes the actual number of photons detected at a pixel, and

λ

(

λ > 0

) represents the expected photon count at that pixel. The symbol eis the base of the natural logarithm. In the imaging model, the expected photon count

λ (x, y)

at a pixel position

(x, y)

is linearly proportional to the incident X-ray intensity.

Although the true photon count $k$ cannot be directly measured, the signal intensity can be estimated from the original image pixel values $I (x, y)$ . Physically, the noise-free pixel value $I (x, y)$ is proportional to the expected number of incident photons $λ (x, y)$ at that location as follows:

I (x, y) = g \cdot λ (x, y)

(2)

where

g

represents the system gain coefficient. To construct the algorithmic model, the following assumption is made: the original image data

I (x, y)

is normalized, and its value directly corresponds to the expected photon count. This is equivalent to setting the gain

g = 1

, yielding the relation:

I (x, y) = λ (x, y)

(3)

Therefore, the observed image $I_{p}$ is a random variable corrupted by Poisson noise. It is described by the following imaging model:

I_{p} (x, y) \sim P o i s s o n (I (x, y))

(4)

The expected value $E [I_{p}]$ is defined as the average outcome from infinitely repeated measurements under identical imaging conditions—that is, with a fixed $I (x, y)$ . This is expressed as:

E [I_{p} (x, y)] = I (x, y)

(5)

To enable finer control over noise intensity, a dimensionless scaling factor $α (x, y)$ is introduced. This factor adjusts the strength of Poisson noise within local regions. It is computed based on local image signal characteristics and modulates the expected parameter of the Poisson distribution.

I_{p} (x, y) \sim P o i s s o n (α (x, y) λ I (x, y))

(6)

where,

α (x, y)

is a dimensionless scalar coefficient. At any pixel position

(x, y)

, its value is determined by the surrounding image signal. This coefficient directly controls the absolute noise strength in the model. The value of

α

is computed by analyzing the normalized average intensity of local image patches. It is inversely proportional to the signal strength, thereby amplifying noise effects in weak-signal regions and suppressing noise in strong-signal areas. A normalization step is introduced to maintain the expected signal value while adjusting noise intensity.

I_{p} (x, y) = \frac{P o i s s o n (α (x, y) \cdot I (x, y))}{α (x, y)}

(7)

This modeling approach preserves the signal-dependent characteristics of noise while enabling locally adaptive intensity control through

α (x, y)

In addition to Poisson noise, X-ray imaging systems also exhibit electronic noise that primarily follows an additive Gaussian distribution. This noise typically originates from three sources: sensor thermal noise, amplifier circuit noise, and quantization noise during digitization. These components collectively demonstrate uniform distribution characteristics. Let $μ$ represents the mean and $σ$ the standard deviation of the Gaussian noise. The resulting pixel value $I (x, y)$ affected by Gaussian noise can then be expressed as:

I_{g} (x, y) = I (x, y) + N (x, y)

(8)

where

N (x, y) \sim N (μ, σ)

, the distribution follows a normal distribution with mean

μ

and standard deviation

σ

. The standard deviation

σ

of Gaussian noise is adaptively adjusted based on imaging conditions to simulate varying noise levels across different detectors and environments. By integrating both Poisson and Gaussian noise models, a joint noise model is established. This model produces the final degraded X-ray image:

I_{g} (x, y) = I_{p} (x, y) + N (x, y)

(9)

This noise modeling approach effectively captures the combined influence of signal-dependent Poisson noise and additive electronic noise in X-ray imaging. The resulting synthetic data provides more realistic noise representations for developing noise reduction algorithms. Figure 2 demonstrates the effect of three noise models.

Figure 2.

The results of different noises on the original image. (a) Original image, (b) Poisson noise, (c) Gaussian noise, (d) Combined noise.

Network structure

During X-ray image acquisition, image quality is often affected by the presence of Poisson noise and Gaussian noise. To address these two types of noise, the XDenoiser network proposed in this paper employs a dual-branch parallel structure to model degradation characteristics caused by different noise types. Figure 3 illustrates the XDenoiser architecture, which employs a dual-branch parallel structure comprising three core modules: (1) Global Linear Attention Branch (GLAB), (2) Local Convolution Branch (LCB), and (3) Dynamic Fusion (DF). The GLAB centers around the rwkv module with linear complexity. It has global modeling capabilities similar to a transformer and can capture long-range dependencies in images within a large receptive field. This feature gives it an inherent advantage in restoring the global structure damaged by Poisson noise. For instance, in weak signal regions where Poisson noise masks the complete object contour, Glab can reconstruct missing structural features using context information. The LCB is based on the CNN architecture. It uses multilayer local perception convolution operations to enhance image details. Since Gaussian noise is a local spatial disturbance, LCB can effectively remove high-frequency noise and restore texture details with its efficient local smoothing and edge-preserving abilities. The DF module is designed to coordinate complementary information from the two branches. By learning a pixel-level weight map related to the content, the output ratio of GLAB and LCB is dynamically adjusted to handle different regional noise types. For example, in areas with a clear global structure but local interference, the system automatically favors LCB output. In areas with a fuzzy structure, it relies more on GLAB for structural restoration.

Figure 3.

The Overview architecture of the XDenoiser.

Global linear attention branch

The attention mechanism has demonstrated strong performance in computer vision (CV) and natural language processing (NLP). However, its scalability is constrained by the quadratic computational complexity of the self-attention mechanism in Transformers. Recent studies have explored linear-complexity operators^18,35 to optimize global attention mechanisms. The Receptance Weighted Key Value (RWKV) model,¹⁸ originally developed for NLP, serves as an efficient alternative to Transformers. Unlike standard Transformers,³⁶ RWKV combines the linear complexity of recurrent neural networks (RNNs) with the parallel processing benefits of Transformers. Recently, Vision RWKV³⁷ extended this architecture from NLP to visual tasks, achieving superior performance over Vision Transformers while maintaining lower computational complexity. The linear complexity enables activation across a broader range of pixels, making it well-suited for image denoising tasks. The proposed GLAB framework is illustrated in Figure 4. The global branching follows a three-level U-shaped encoder-decoder architecture. Initially, shallow base features are extracted using a $3 \times 3$ convolution. Subsequently, spatial information is progressively compressed through a three-level downsampling encoder, while global dependencies are modeled. The x-RWKV block substantially reduces computational complexity by replacing the conventional attention mechanism with a recurrent weight kernel. In the decoder, a symmetrical upsampling structure gradually restores fine details. Multi-scale encoder features are fused via cross-level skip connections to enhance high-frequency reconstruction. Finally, deep features are mapped into a residual output, and the restored result is generated through global residual learning. the structure of the X-RWKV block is illustrated in Figure 5. The X-RWKV block follows the original RWKV architecture,¹⁹ consisting of two key components: a spatial mixing block and a channel mixing block. The spatial mixing block enables feature interaction across spatial dimensions, while the channel mixing block performs feature fusion along channel dimensions. Two residual connections further enhance feature representation. The enhanced features are computed as:

\begin{aligned} I_{s} = & I_{i n} + S p a t i a l M i x (I_{i n}) \\ I_{X - R W K V} = & I_{s} + C h a n n e l M i x (I_{s}) \end{aligned}

(10)

where

I_{i n}

denotes the input feature map,

I_{s}

represents the spatially mixed features, and

I_{X - R W K V}

is the final output. The operators

S p a t i a l M i x (\cdot)

and

C h a n n e l M i x (\cdot)

perform spatial and channel mixing operations, respectively.

Figure 4.

The global linear attention branch architecture.

Figure 5.

The X-RWKV block structure.

Local convolution branch

Restore RWKV¹⁹ employs a single-branch encoder-decoder architecture. Its core components include RWKV spatial mixing and channel mixing modules. Efficient recurrent attention computation is achieved through custom CUDA kernels. A key advantage of this model is its use of RWKV units for global dependency modeling and efficient feature propagation. Additionally, an OmniShift module enables multi-scale convolutional reparameterization to expand the spatial receptive field. However, three main limitations were identified. First, the single-path design requires the RWKV module to handle both global features and local noise modeling, lacking specialized mechanisms for noise estimation and detail recovery. Second, the large parameter count reduces computational efficiency. Third, training instability can occur with larger batch sizes. To address these limitations, significant improvements were made to the Restore RWKV framework. A parallel convolutional branch was introduced, comprising a noise perception module and a detail enhancement denoising module. This branch specializes in estimating input image noise distribution and restoring local textures and high-frequency details. Combined with the original RWKV backbone, this forms a global-local dual-branch architecture that effectively compensates for the original model’s shortcomings in local noise detail modeling. Additionally, a computationally efficient and easily optimizable convolutional structure is adopted to balance denoising performance and computational cost, ensuring better global-local information trade-offs. Consequently, a Convolutional Denoising Network is proposed, as shown in Figure 6, comprising a Noise-Aware Module and a Detail-Enhanced Denoising Module, designed to improve both local feature extraction and overall denoising efficacy.

Figure 6.

Overview of the local convolution branch architecture.

The Detail-Enhanced Denoising Module employs a U-Net architecture to extract multi-scale features. By integrating skip connections and local convolutional operations, it strengthens local dependencies while preserving texture details, enabling high-quality denoising. A key challenge in denoising is the variability of noise levels across input images. Fixed-parameter denoising methods often struggle to adapt to different noise intensities, resulting in either insufficient denoising (noise residue) or over-smoothing (texture loss). To address this, a Fully Convolutional Network (FCN) is introduced, utilizing a lightweight pixel-wise regression network to dynamically estimate noise levels for adaptive denoising.

N = f_{F C N} (I)

(11)

where

I

is the input noisy image,

f_{F C N}

represents the noise perception network, and

N

is the estimated noise level map. Then, we concatenate

N

with the original input

I

along the channel dimension to generate the enhanced input:

I^{'} = C o n C a t (N, I)

(12)

Thus, the approach enables the network to adaptively adjust its processing strategy during denoising. Regions with higher noise levels receive more aggressive denoising, while low-noise areas preserve finer details through gentler processing.

The detail-enhanced denoising branch employs a modified U-Net architecture for denoising. Unlike conventional U-Net designs, this implementation uses only two downsampling stages to minimize high-resolution information loss while maintaining an adequate receptive field. In the encoder pathway, local features are extracted through $3 \times 3$ convolutions, complemented by pooling operations to capture global context. The decoder pathway utilizes transposed convolutions for upsampling, with skip connections preserving fine-grained details from shallow layers. This design ensures the direct influence of low-level local features on the final denoising output. The complete U-Net computational flow operates as follows:

\begin{aligned} F_{e n c} & = f_{e n c} (X^{'}) \\ F_{d e c} & = f_{d e c} (F_{e n c}) \end{aligned}

(13)

where $f_{e n c}$ and $f_{d e c}$ denote the mapping functions of the U-Net encoder and decoder, respectively. The denoising output follows a residual learning paradigm, where the denoising branch’s output $D$ is added to the original input $X$ :

Y = D + X

(14)

where $D = f_{U - N e t} (X^{'})$ , where $f_{U - N e t}$ denotes the U-Net denoising branch. The residual learning approach minimizes alterations to low-frequency components, enabling the network to concentrate primarily on noise removal. This strategy effectively preserves the original image’s structural details.

Dynamic fusion module

To fully utilize the complementary advantages of different feature representations in the dual branch structure, a lightweight adaptive fusion module was designed for weighted fusion of the output results of the two denoising branches, as shown in Figure 7. This module can adaptively generate the weight map based on the input image features and achieve fine dynamic feature integration. Compared to the fixed convolutional fusion in Restore RWKV, the proposed fusion mechanism significantly enhances model generalization and stability across diverse noise scenarios. First, the outputs of the two branches are concatenated into a channel tensor $F_{c o n c a t}$ .

F_{c o n c a t} = C o n C a t (F_{G L A B, F_{L C B}}) \in R^{2 C \times H \times W}

(15)

Subsequently, the tensor is fed into a three-layer convolutional network for feature extraction and weight prediction. A fusion weight map for the two channels is then output.

\begin{aligned} W = & C o n v_{1 \times 1} (S i L U (C o n v_{3 \times 3} \\ (S i L U (C o n v_{3 \times 3} (F_{C o n C a t}))))) \\ W = & S o f t m a x (W) \end{aligned}

(16)

w_{1}

and

w_{2}

are obtained by splitting

W

along the channel dimension, serving as dynamic weighting coefficients for the global and local branch outputs, respectively. The final fused output is obtained by weighting the outputs of the two branches by pixels:

F_{f u s e d} = w_{1} \cdot F_{G L A B} + w_{2} \cdot F_{L C B}

(17)

The proposed fusion mechanism enables dynamic balancing between detail enhancement and global structure restoration, while simultaneously improving adaptation to diverse noise patterns. Experimental results demonstrate significant performance gains across multiple denoising metrics after incorporating this module.

Figure 7.

The flow chart of dynamic fusion module which takes the outputs of the global and local branches as input and generates weighted fusion weights for the two branches.

Loss function

To optimize the network’s denoising performance, the loss function incorporates errors from the global branch, local branch, and final fused output. The total loss is defined as follows:

L = L_{G L A B} + L_{L C B} + L_{f u s e d}

(18)

where

L = ‖ X_{G L A B} - X_{g t} ‖_{1}

denotes the

L_{1}

loss of the global linear attention branch, which enhances global denoising capability.

L_{f u s e d} = ‖ X_{f u s e d} - X_{g t} ‖_{1}

optimizes the fused output’s denoising ability.

L_{L C B}

consists of three components: pixel-level error, adaptive noise estimation error, and a gradient regularization term, expressed as:

L_{L C B} = ‖ X_{f u s e d} - X_{G T} ‖_{2}^{2} + 0.5 L_{a s y m} + 0.05 L_{t v}

(19)

where

L_{a s y m}

adaptively weights the noise estimation error. In practice, underestimating noise levels may result in insufficient denoising, leaving residual noise in the output. Thus, the loss function imposes a higher penalty on underestimation errors, defined as:

\begin{aligned} L_{asym} = & E [γ \cdot | 0.3 - I (X_{gt\_noise} < X_{est\_noise}) | \cdot \\ (X_{gt\_noise} - X_{est\_noise})^{2}] \end{aligned}

(20)

where

X_{gt\_noise}

and

X_{est\_noise}

represent the true and estimated noise, respectively.

I

is an indicator function that returns

1

if the estimated noise exceeds the true noise, and

0

otherwise. A balancing factor ensures stronger penalties for underestimation, encouraging the network to avoid underestimating noise levels during training and improving denoising performance. Additionally,

L_{t v}

is a total variation (TV) regularization term that enhances output smoothness:

\begin{aligned} L_{tv} = & \frac{1}{N_{h}} \sum_{i, j} (x_{est_noise, i + 1, j} - x_{est_noise, i, j})^{2} \\ + \frac{1}{N_{w}} \sum_{i, j} (x_{est_noise, i, j + 1} - x_{est_noise, i, j})^{2} \end{aligned}

(21)

where

N_{h}

and

N_{w}

denote the image height and width, respectively. The TV constraint promotes smooth noise estimation by penalizing gradient variations in horizontal and vertical directions, reducing high-frequency noise artifacts and improving the denoised image’s visual quality.

Experimental results and analyses

Dataset setting

Benchmark dataset:

This paper utilizes three publicly available datasets, including GDXRay,³⁸ HiXray³⁹ and SIXray.⁴⁰ The GDXRay is a widely recognized benchmark library for X-ray images in industrial applications. GDXRay offers diverse cross-domain samples suitable for multi-target detection tasks in security screening scenarios. The dataset comprises five distinct categories: castings, welds, baggage, natural objects, and scenes. The baggage subset was selected for training, consisting of $8, 150$ X-ray images organized into $77$ series. The HiXray dataset focuses on prohibited item detection and identification in security screening scenarios. It offers higher image resolution than GDXray and contains more complex occlusions and multi-object scenes. These characteristics better represent the noise distributions and texture features encountered in real security inspection environments. The SIXray constitutes a large-scale security X-ray benchmark comprising over one million images. This dataset exhibits significant class imbalance and complex overlapping structures. Such properties present substantial challenges for verifying model robustness under high-noise conditions and complex occlusions.

Training and test sets: For the GDXRay: the training set comprises $8, 002$ randomly selected baggage X-ray images from the Baggages catalog. These images cover diverse angles, densities, and occlusion conditions. To enhance model robustness, the training data underwent augmentation via rotation and scaling techniques. All images were then resized to a uniform resolution of $256 \times 256$ pixels. Test set contains $179$ randomly chosen casting defect detection images from the Castings catalog. Due to the pseudocolor mapping in the HiXray and SIXray, all images are converted to grayscale during data preprocessing. For the HiXray, $7, 993$ images are extracted from the original training set. These images cover diverse luggage combinations involving different angles, densities, and occlusion conditions. A center crop is applied to achieve a resolution of $256 \times 256$ pixels, removing redundant white borders. The test set consists of $500$ images selected from the original test directory. For the SIXray, $8, 325$ images are extracted from the raw data which is divided into $8, 000$ images for the training set and $325$ for the test set. All images are resized to a uniform resolution of $256 \times 256$ pixels.

Noise settings: As shown in Figure 8, six different noise configurations are used to simulate the complex noise characteristics commonly found in real X-ray images. Each configuration includes a combination of Gaussian noise and Poisson noise. The Gaussian noise follows a normal distribution with a mean of 0 and standard deviations of $σ = 10$ , $25$ , and $50$ , representing three noise levels. The intensity of the Poisson noise is controlled by the photon count, denoted as $n$ . Two typical values are selected for $n$ , namely $0.05$ and $5$ , representing high-intensity and low-intensity Poisson noise, respectively. A smaller $n$ corresponds to a lower photon count and thus higher Poisson noise intensity. Specifically, $n = 0.05$ simulates a low-photon statistical condition, while $n = 5$ corresponds to a conventional condition. According to the Poisson noise calculation formula described in Section “Modelling X-ray image noise”, an excessively small normalization parameter $λ$ may cause severe image distortion, negatively impacting both training and evaluation. Conversely, an excessively large $λ$ results in images that are nearly noise-free, offering limited value for denoising tasks. The values $0.05$ and $5$ strike a balance between noise intensity and image structure preservation, enabling the model to perform robustly across a variety of noise scenarios. Six distinct noise environments are constructed by combining three Gaussian noise levels with two Poisson noise levels. Synthetic noisy X-ray images are generated using the hybrid Poisson-Gaussian noise model described in Section “Modelling X-ray image noise”.

Figure 8.

Examples of noise setting.

Evaluation metrics

Objective evaluation metrics play a critical role in quantifying the performance of image denoising algorithms. This paper employs three core metrics: Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index (SSIM), and Root Mean Square Error (RMSE). These metrics assess denoising performance from different perspectives, including pixel-level accuracy, structural fidelity, and perceptual quality.

Peak Signal-to-Noise Ratio (PSNR): PSNR quantifies the pixel-level difference between a denoised image and its clean reference by calculating the Mean Squared Error (MSE). It is defined as:

\begin{aligned} M S E & = \frac{1}{N} \sum_{i = 1}^{N} (x_{i} - y_{i})^{2} \\ PSNR & = 10 \cdot \log_{10} (\frac{M A X^{2}}{M S E}) \end{aligned}

(22)

where

M A X

denotes the maximum possible pixel value (e.g.,

255

for

8

-bit images), and

x_{i}

and

y_{i}

represent the pixel values of the denoised image and the reference image, respectively. A higher PSNR value (measured in decibels, dB) indicates lower image distortion and better denoising performance.

Structural Similarity Index (SSIM): SSIM measures the similarity between a denoised image and a reference image by evaluating luminance, contrast, and structural information. This metric is designed to align more closely with human visual perception. It is defined as:

SSIM (x, y) = \frac{(2 μ_{x} μ_{y} + C_{1}) (2 σ_{x y} + C_{2})}{(μ_{x}^{2} + μ_{y}^{2} + C_{1}) (σ_{x}^{2} + σ_{y}^{2} + C_{2})}

(23)

where

μ_{x}

and

μ_{y}

are the local means of the two images,

σ_{x}

and

σ_{y}

are the standard deviations, and

σ_{x y}

is the covariance.

C_{1}

and

C_{2}

are stability constants that prevent division by zero. SSIM ranges from

0

1

, where values closer to

1

indicate higher structural similarity. This metric is particularly sensitive to high-frequency information such as edges and textures, making it effective for evaluating perceptual image quality. However, SSIM has relatively high computational complexity compared to simple pixel-based metrics.

Root Mean Square Error (RMSE): RMSE quantifies the overall pixel-wise error by computing the square root of the mean squared differences between the denoised image and the reference image. It is mathematically defined as:

RMSE (X, Y) = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} (x_{i} - y_{i})^{2}}

(24)

where

N

represents the total number of pixels in the image. The variables

x_{i}

and

y_{i}

denote the pixel values at corresponding positions within the two images being compared. Lower RMSE values indicate that the denoising result is closer to the ground truth. This metric provides an intuitive measure of the global error magnitude. However, the RMSE cannot distinguish between different types of errors, such as blurring and residual noise. Therefore, it should be combined with other metrics for comprehensive evaluation.

Experimental setup

The model is trained with an initial learning rate of $1.6 \times 10^{- 3}$ and a batch size of $64$ . The learning rate decreases progressively during training. Xdenoiser is optimized using the AdamW optimizer for $100$ epochs. All experiments are conducted on a server equipped with an NVIDIA A800 80G GPU and an Intel Xeon Gold 6346 CPU.

Ablation study

A set of ablation studies was designed to evaluate the impact of the local branch and fusion modules on model performance. Three key modules—the Noise-Aware Module (NAM), Detail-Enhanced Denoising Module (DEDM), and dynamic fusion modules (DFM)—were sequentially removed. The performance of each ablated model was then compared against the complete model and a benchmark. As shown in Table 1, the absence of any module leads to a decline in overall performance, confirming the indispensability of each component. The most significant drop occurs when the Detail-Enhanced Denoising Module is removed, with performance falling slightly below the benchmark level. This result highlights the distinct roles and collaborative mechanism of the modules. In the proposed architecture, the Noise-Aware Module predicts noise distribution but does not directly reconstruct the image. The Detail-Enhanced Denoising Module then uses this prior noise information to actively restore key texture and detail components. Removing this module causes the local convolution branch to lose its core denoising capability. The remaining noise estimate from the perception module can no longer be utilized effectively and instead interferes with the global branch’s output via the fusion module, leading to significant performance degradation. This finding underscores the critical role of the detail enhancement module within the denoising pipeline. Furthermore, the Noise-Aware Module enables a more targeted detail enhancement process, as evidenced by the performance drop following its removal. Finally, ablating the dynamic fusion module also resulted in a performance decline, emphasizing its importance as an adaptive coordinator that effectively balances and integrates features from different branches to achieve superior global denoising.

Table 1.

Ablation study of the GDXray dataset under the conditions of noise level $σ = 10$ , $n = 0.05$ , $e p o c h = 45$ .

Base model	NAM	DEDM	DFM	PSNR $↑$	SSIM $↑$	RMSE $↓$
✓				35.85	0.9417	4.3691
✓	✓		✓	35.77	0.9346	4.4077
✓		✓	✓	36.24	0.9417	4.2016
✓	✓	✓		36.17	0.9414	4.2229
✓	✓	✓	✓	$36.65$	$0.9424$	$3.9860$

In summary, these ablation studies clearly demonstrate that the Noise-Aware Module, Detail-Enhanced Denoising Module, and dynamic fusion module perform distinct yet complementary functions, forming the foundation of XDenoiser’s effective denoising capability.

To validate the design of the loss function, ablation experiments were conducted as summarized in Table 2. The results show that removing any loss component—whether $L_{G L A B}$ , $L_{L C B}$ , or $L_{f u s e d}$ —degrades performance on objective metrics such as PSNR and SSIM. These findings confirm that the three loss terms are functionally complementary and mutually reinforcing. Together, they form an integrated framework that enables the model to produce denoised outputs suitable for both visual interpretation and downstream analytical tasks.

Table 2.

Ablation study of the loss functions.

$L_{G L A B}$	$L_{L C B}$	$L_{f u s e d}$	PSNR $↑$	SSIM $↑$	RMSE $↓$
✓	✓		36.24	0.9418	4.1833
✓		✓	36.41	0.9417	4.1030
	✓	✓	36.23	0.9395	4.2054
✓	✓	✓	$36.65$	$0.9424$	$3.9860$

Experimental analyses

Xdenoiser is compared with five representative methods, including both model-driven and deep learning-based approaches. For the model-driven method, BM3D²⁶ is selected. For deep learning-based methods, CBDNet,³¹ Restormer,¹⁵ DCANet,⁴¹ and Restore-RWKV¹⁹ are chosen. CBDNet represents convolutional denoising, Restormer is a widely used self-attention-based method, DCANet combines convolution and attention, and Restore-RWKV adopts linear attention mechanisms. Tables 3 to Table 5 present a quantitative comparison of the proposed Xdenoiser against those methods on the GDXray dataset, HiXray dataset and Hixray dataset, respectively. The best results are highlighted in bold, while the second-best are shown in underline.

Table 3.

Comparative results of denoising methods on the GDXray.

Case	Index	Noisy	BM3D²⁶	CBDNet³¹	DCANet⁴¹	Restormer¹⁵	Restore-RWKV¹⁹	Ours
$σ = 10$ $n = 0.05$	PSNR $↑$	25.93 $\pm 3.3$	30.52 $\pm 4.8$	36.54 $\pm 3.2$	35.77 $\pm 2.6$	35.58 $\pm 2.8$	36.30 $\pm 3.0$	36.68 $\pm 2.9$
	SSIM $↑$	0.4984 $\pm 0.2$	0.7952 $\pm 0.3$	0.9396 $\pm 0.04$	0.9466 $\pm 0.07$	0.9526 $\pm 0.04$	0.9446 $\pm 0.04$	0.9438 $\pm 0.04$
	RMSE $↓$	14.05 $\pm 7.2$	9.3343 $\pm 8.2$	4.0629 $\pm 1.5$	4.3573 $\pm 1.6$	4.4410 $\pm 1.4$	4.1464 $\pm 1.5$	3.9552 $\pm 1.4$
$σ = 10$ $n = 5$	PSNR $↑$	28.02 $\pm 0.49$	38.18 $\pm 2.0$	38.30 $\pm 2.2$	38.34 $\pm 5.1$	39.59 $\pm 2.9$	40.15 $\pm 2.9$	40.52 $\pm 2.9$
	SSIM $↑$	0.5216 $\pm 0.07$	0.9507 $\pm 0.05$	0.9412 $\pm 0.05$	0.9474 $\pm 0.05$	0.9534 $\pm 0.04$	0.9582 $\pm 0.03$	0.9589 $\pm 0.03$
	RMSE $↓$	10.14 $\pm 0.54$	3.2316 $\pm 0.77$	3.2061 $\pm 0.82$	3.8524 $\pm 3.6$	2.8168 $\pm 0.89$	2.6416 $\pm 0.84$	2.5417 $\pm 0.84$
$σ = 25$ $n = 0.05$	PSNR $↑$	17.01 $\pm 1.1$	22.37 $\pm 2.0$	34.52 $\pm 2.8$	34.59 $\pm 2.80$	34.04 $\pm 2.60$	34.40 $\pm 2.80$	34.60 $\pm 2.70$
	SSIM $↑$	0.1360 $\pm 0.05$	0.5306 $\pm 0.2$	0.9263 $\pm 0.05$	0.9355 $\pm 0.04$	0.9224 $\pm 0.05$	0.9272 $\pm 0.04$	0.9293 $\pm 0.05$
	RMSE $↓$	36.24 $\pm 4.5$	19.92 $\pm 4.4$	5.0468 $\pm 1.7$	5.0127 $\pm 1.7$	5.2584 $\pm 1.7$	5.1222 $\pm 1.7$	4.9847 $\pm 1.6$
$σ = 25$ $n = 5$	PSNR $↑$	20.94 $\pm 0.58$	33.91 $\pm 2.6$	36.55 $\pm 2.6$	36.83 $\pm 2.7$	37.07 $\pm 2.8$	36.80 $\pm 2.7$	37.11 $\pm 2.7$
	SSIM $↑$	0.2293 $\pm 0.06$	0.9181 $\pm 0.07$	0.9370 $\pm 0.04$	0.9378 $\pm 0.05$	0.9405 $\pm 0.05$	0.9400 $\pm 0.04$	0.9413 $\pm 0.04$
	RMSE $↓$	22.93 $\pm 1.5$	5.3764 $\pm 1.7$	3.9615 $\pm 1.2$	3.8575 $\pm 1.2$	3.7516 $\pm 1.2$	3.8672 $\pm 1.2$	3.7307 $\pm 1.2$
$σ = 50$ $n = 0.05$	PSNR $↑$	14.15 $\pm 0.75$	25.68 $\pm 2.4$	32.60 $\pm 2.8$	32.58 $\pm 2.8$	32.50 $\pm 2.7$	32.45 $\pm 2.7$	32.76 $\pm 2.6$
	SSIM $↑$	0.0821 $\pm 0.03$	0.7744 $\pm 0.1$	0.9055 $\pm 0.06$	0.9149 $\pm 0.05$	0.9143 $\pm 0.05$	0.9040 $\pm 0.07$	0.9107 $\pm 0.06$
	RMSE $↓$	50.19 $\pm 4.3$	13.74 $\pm 3.6$	6.2894 $\pm 2.1$	6.3269 $\pm 2.2$	6.3446 $\pm 2.0$	6.3958 $\pm 2.1$	6.1531 $\pm 1.9$
$σ = 50$ $n = 5$	PSNR $↑$	15.50 $\pm 0.59$	28.07 $\pm 3.0$	33.82 $\pm 2.8$	33.56 $\pm 2.8$	32.35 $\pm 2.4$	33.59 $\pm 2.7$	33.82 $\pm 2.7$
	SSIM $↑$	0.1032 $\pm 0.04$	0.8636 $\pm 0.08$	0.9150 $\pm 0.06$	0.9269 $\pm 0.04$	0.9029 $\pm 0.06$	0.9122 $\pm 0.06$	0.9157 $\pm 0.06$
	RMSE $↓$	42.90 $\pm 2.9$	10.654 $\pm 3.4$	5.4603 $\pm 1.8$	5.6396 $\pm 1.9$	6.4000 $\pm 1.8$	5.5992 $\pm 1.8$	5.4457 $\pm 1.7$

As shown in Tables 3 to Table 5, the proposed method demonstrates superior performance across various noise levels compared to mainstream algorithms. It achieves the best or near-optimal results in most test configurations for PSNR, SSIM, and RMSE metrics, indicating strong capabilities in structural consistency preservation and detail restoration. Standard deviation results further confirm the model’s output stability. Across most experimental settings, the proposed method exhibits significantly lower performance fluctuations than other approaches, demonstrating insensitivity to input noise variations and robust generalization characteristics. To validate the reliability of performance improvements, paired t-tests were conducted under identical noise configurations. The performance advantage proves statistically significant ( $p < 0.05$ ) in the vast majority of experimental scenarios, aligning with the observed trends in quantitative results. This confirms that the improvements stem from the model architecture and loss design rather than random variations. As shonw in Table 3, in a few low-noise scenarios ( $σ = 10, σ = 25$ or $σ = 50$ , and $n = 0.05$ ), differences between the proposed method and DCANet in PSNR/SSIM metrics are minimal, with statistical tests showing non-significant differences (p¿0.05). This likely occurs because existing methods approach the performance upper bound under mild degradation conditions, making statistical distinctions challenging. Nevertheless, the proposed method consistently achieves results comparable to the best performers even in these extreme scenarios. Experimental results on the SIXray and HiXray datasets, presented in Tables 4 and 5, involve substantially increased scene complexity and sample diversity compared to GDXray. These characteristics introduce additional challenges, requiring models to suppress noise while maintaining structural consistency amid complex textures and overlapping objects. Quantitative results show that the proposed method achieves leading performance across almost all noise levels and evaluation metrics (PSNR, SSIM, RMSE), with advantages becoming more pronounced in complex scenarios. This trend indicates enhanced feature separation and detail preservation capabilities when processing actual security inspection images with complex structures and significant signal interference. The method effectively balances global structural consistency with local texture details, maintaining stable output even under high noise and overlapping interference conditions. Collectively, these results demonstrate that the proposed method not only delivers superior denoising performance on simpler datasets but also effectively addresses challenges posed by complex backgrounds and object overlaps in security inspection imagery. This provides a more reliable image quality foundation for subsequent automated recognition and detection tasks in security screening scenarios.

Table 4.

Comparative results of denoising methods on the sixray.

Case	Index	Noisy	BM3D²⁶	CBDNet³¹	DCANet⁴¹	Restormer¹⁵	Restore-RWKV¹⁹	Ours
$σ = 10$ $n = 0.05$	PSNR $↑$	18.03 $\pm 0.36$	18.28 $\pm 0.4$	29.03 $\pm 1.7$	28.95 $\pm 1.7$	28.98 $\pm 1.7$	28.91 $\pm 1.7$	29.12 $\pm 1.7$
	SSIM $↑$	0.2382 $\pm 0.06$	0.2439 $\pm 0.06$	0.8809 $\pm 0.04$	0.8787 $\pm 0.04$	0.8780 $\pm 0.04$	0.8805 $\pm 0.04$	0.8874 $\pm 0.04$
	RMSE $↓$	31.99 $\pm 1.4$	31.119 $\pm 1.5$	9.1838 $\pm 1.7$	9.2726 $\pm 1.7$	9.2405 $\pm 1.7$	9.3099 $\pm 1.8$	9.0865 $\pm 1.7$
$σ = 10$ $n = 5$	PSNR $↑$	27.43 $\pm 0.03$	34.11 $\pm 1.4$	34.89 $\pm 1.6$	35.09 $\pm 1.6$	35.25 $\pm 1.6$	34.96 $\pm 1.5$	35.24 $\pm 1.6$
	SSIM $↑$	0.5639 $\pm 0.07$	0.9519 $\pm 0.02$	0.9532 $\pm 0.01$	0.9557 $\pm 0.01$	0.9576 $\pm 0.01$	0.9554 $\pm 0.02$	0.9575 $\pm 0.01$
	RMSE $↓$	10.833 $\pm 0.04$	5.0836 $\pm 0.8$	4.6645 $\pm 0.8$	4.5597 $\pm 0.8$	4.4784 $\pm 0.8$	4.6227 $\pm 0.8$	4.4873 $\pm 0.8$
$σ = 25$ $n = 0.05$	PSNR $↑$	16.75 $\pm 0.33$	22.30 $\pm 1.3$	28.10 $\pm 1.7$	28.11 $\pm 1.7$	28.15 $\pm 1.7$	28.02 $\pm 1.7$	28.28 $\pm 1.7$
	SSIM $↑$	0.2052 $\pm 0.06$	0.5423 $\pm 0.04$	0.8593 $\pm 0.04$	0.8638 $\pm 0.04$	0.8639 $\pm 0.05$	0.8641 $\pm 0.05$	0.8725 $\pm 0.04$
	RMSE $↓$	37.069 $\pm 1.4$	19.774 $\pm 3.0$	10.220 $\pm 1.9$	10.206 $\pm 1.9$	10.159 $\pm 1.9$	10.313 $\pm 1.9$	10.010 $\pm 1.9$
$σ = 25$ $n = 5$	PSNR $↑$	20.50 $\pm 0.09$	29.69 $\pm 1.5$	30.78 $\pm 1.7$	30.77 $\pm 1.7$	30.84 $\pm 1.7$	30.76 $\pm 1.7$	30.95 $\pm 1.7$
	SSIM $↑$	0.3128 $\pm 0.08$	0.8950 $\pm 0.03$	0.9087 $\pm 0.03$	0.9086 $\pm 0.03$	0.9087 $\pm 0.03$	0.9098 $\pm 0.03$	0.9142 $\pm 0.03$
	RMSE $↓$	24.053 $\pm 0.27$	8.4849 $\pm 1.5$	7.5026 $\pm 1.4$	7.5198 $\pm 1.4$	7.4534 $\pm 1.4$	7.5237 $\pm 1.4$	7.3661 $\pm 1.4$
$σ = 50$ $n = 0.05$	PSNR $↑$	14.11 $\pm 0.25$	23.79 $\pm 0.9$	26.40 $\pm 1.6$	26.35 $\pm 1.6$	26.31 $\pm 1.6$	26.41 $\pm 1.6$	26.60 $\pm 1.7$
	SSIM $↑$	0.1461 $\pm 0.05$	0.7960 $\pm 0.06$	0.8197 $\pm 0.05$	0.8265 $\pm 0.06$	0.8260 $\pm 0.05$	0.8325 $\pm 0.05$	0.8345 $\pm 0.05$
	RMSE $↓$	50.254 $\pm 1.5$	16.559 $\pm 1.6$	12.412 $\pm 2.3$	12.488 $\pm 2.2$	12.537 $\pm 2.2$	12.398 $\pm 2.2$	12.138 $\pm 2.2$
$σ = 50$ $n = 5$	PSNR $↑$	15.48 $\pm 0.20$	25.20 $\pm 1.1$	27.50 $\pm 1.7$	27.44 $\pm 1.7$	27.34 $\pm 1.7$	27.43 $\pm 1.6$	27.61 $\pm 1.7$
	SSIM $↑$	0.1781 $\pm 0.05$	0.8156 $\pm 0.05$	0.8506 $\pm 0.05$	0.8498 $\pm 0.05$	0.8434 $\pm 0.05$	0.8550 $\pm 0.05$	0.8578 $\pm 0.05$
	RMSE $↓$	42.912 $\pm 0.98$	14.122 $\pm 1.8$	10.946 $\pm 2.0$	11.031 $\pm 2.1$	11.148 $\pm 2.1$	11.023 $\pm 2.0$	10.800 $\pm 2.0$

To further validate the denoising effectiveness of each method, visual results are presented in Figure 9, Figures 10 and 11, which shown some results of GDXRay, SIXray and Hixray, respectively. In the black-box region highlighted in Figure 9, numeric characters are displayed. This area is highly susceptible to blurring due to Gaussian-Poisson hybrid noise, making it a challenging region for detail recovery. The comparison shows that BM3D, CBDNet, and Restore-RWKV fail to preserve the character shapes effectively. Their denoised outputs exhibit distorted contours and broken strokes, resulting in illegible digits.

Figure 9.

Noise removal results of the GDXRay dataset at Poisson noise level $n = 5$ and Gaussian noise level $σ = 10$ .

Figure 10.

Noise removal results of the SIXray dataset at Poisson noise level $n = 0.05$ and Gaussian noise level $σ = 25$ .

Figure 11.

Noise removal results of the Hixray dataset at Poisson noise level $n = 5$ and Gaussian noise level $σ = 50$ .

Although DCANet improves edge clarity to some extent, it introduces pseudo-structures and false strokes at certain pixel locations. This adds high-frequency information that was not originally present, compromising the structural authenticity of the image. In contrast, the proposed Xdenoiser accurately restores the digit contours and stroke structures without introducing artifacts or false details, while effectively suppressing noise. The numeric regions appear clear and natural, exhibiting higher readability and realism. This improvement is attributed to the dynamic fusion mechanism, which allows global structure modeling and local detail enhancement to complement each other, resulting in more robust feature reconstruction in high-noise areas.

Figure 10 presents denoising results on the SIXray dataset with Poisson noise level $n = 0.05$ and Gaussian noise level $σ = 25$ . The red box highlights a region containing both background and a luggage compartment corner. Comparative analysis reveals that BM3D fails to effectively suppress noise under high Poisson noise conditions, retaining substantial residual noise. Both CBDNet and DCANet exhibit noticeable artifacts in background regions, with the contour of the luggage appearing incomplete and blurred after denoising. While Restormer and Restore RWKV demonstrate moderate improvement over previous methods, the contour of the luggage remains discontinuous. More critically, these methods introduce artificial structures by incorporating high-frequency components absent from the original image, compromising structural authenticity. The edges of two chopstick-like objects appear particularly distorted and irregular. In contrast, the proposed XDenoiser effectively suppresses noise while accurately recovering regional contours and structures. The method introduces neither artifacts nor false details, yielding superior readability and structural fidelity. This performance advantage stems from the dynamic fusion mechanism, which successfully integrates global structural modeling with local detail enhancement to achieve robust feature reconstruction in high-noise regions.

Figure 11 presents denoising results on the HiXray dataset with Poisson noise level $n = 5$ and Gaussian noise level $σ = 50$ . The red box highlights a region containing zipper-like objects. Comparative analysis reveals that BM3D, DCANet, Restormer, and Restore RWKV completely eliminate the original zipper structure during denoising. CBDNet shows moderate improvement over other methods, though the zipper structure still suffers substantial information loss compared to the original image. Overall, all comparison methods exhibit distorted contours and residual noise in various regions. In contrast, the proposed XDenoiser effectively suppresses noise while maximally preserving structural and contour information. This demonstrates superior denoising capability, particularly in maintaining fine structural details under challenging noise conditions.

To systematically assess differences in computational resource consumption and inference efficiency, five image denoising models are compared based on the number of parameters (Params, in millions), which is the total number of learnable weights and biases in the model, measured in millions (M). The number of parameters influences the model’s capacity, storage requirements, training cost, and generalization ability. Computational complexity (FLOPs, in gigaflops), which is the total count of floating-point operations required during model inference, measured in gigaflops (G). Higher FLOPs indicate greater computational complexity and increased processing demands. and Average Inference Time per Image (AIT, in milliseconds), which is the average time taken to perform inference on a single image, measured in milliseconds (ms). The reported AIT values exclude data transfer time between CPU and GPU. Timing starts when the input tensor becomes available in GPU memory and ends upon completion of model computation. All AIT measurements were obtained at an input resolution of $256 \times 256$ pixels with a fixed batch size of $1$ during testing. The hardware platform used was an NVIDIA A800 80G GPU. The results are presented in Table 6. Regarding parameter size, both Restore-RWKV and Restormer are typical large-scale attention-based architectures, containing $27.80$ M and $26.10$ M parameters, respectively. These counts are significantly higher than those of other models. While they demonstrate strong global modeling capabilities and effectively capture long-range dependencies affected by Poisson noise, their computational and memory overheads are considerably greater. Traditional convolutional denoising methods like CBDNet offer faster inference times but suffer from a limited receptive field, making it challenging to effectively restore the global structure in images corrupted by Poisson noise. Conversely, DCANet has a very small parameter count but exhibits high FLOPs, resulting in a less competitive inference speed. This suggests that its model architecture has limitations in computational efficiency.

Table 5.

Comparative results of denoising methods on the hixray.

Case	Index	Noisy	BM3D²⁶	CBDNet³¹	DCANet⁴¹	Restormer¹⁵	Restore-RWKV¹⁹	Ours
$σ = 10$ $n = 0.05$	PSNR $↑$	18.20 $\pm 1.4$	18.69 $\pm 1.8$	28.76 $\pm 3.4$	28.89 $\pm 3.8$	28.68 $\pm 3.5$	28.78 $\pm 3.8$	28.98 $\pm 3.6$
	SSIM $↑$	0.2677 $\pm 0.06$	0.3176 $\pm 0.06$	0.8034 $\pm 0.08$	0.8096 $\pm 0.09$	0.7955 $\pm 0.09$	0.8066 $\pm 0.09$	0.8073 $\pm 0.09$
	RMSE $↓$	31.725 $\pm 4.7$	30.234 $\pm 5.7$	9.9098 $\pm 3.0$	9.8631 $\pm 3.2$	10.033 $\pm 3.1$	9.9587 $\pm 3.1$	9.7096 $\pm 3.0$
$σ = 10$ $n = 5$	PSNR $↑$	27.84 $\pm 0.4$	33.44 $\pm 2.0$	34.57 $\pm 3.1$	34.51 $\pm 2.8$	34.34 $\pm 2.6$	—	34.80 $\pm 4.4$
	SSIM $↑$	0.6482 $\pm 0.08$	0.9277 $\pm 0.03$	0.9305 $\pm 0.03$	0.9286 $\pm 0.03$	0.9313 $\pm 0.03$	—	0.9320 $\pm 0.03$
	RMSE $↓$	10.348 $\pm 0.5$	5.5657 $\pm 1.2$	5.0182 $\pm 1.4$	5.0228 $\pm 1.4$	5.0738 $\pm 1.3$	—	4.9568 $\pm 1.4$
$σ = 25$ $n = 0.05$	PSNR $↑$	16.75 $\pm 1.1$	20.95 $\pm 2.1$	28.39 $\pm 3.5$	28.33 $\pm 3.5$	28.21 $\pm 3.3$	28.32 $\pm 3.4$	28.52 $\pm 3.9$
	SSIM $↑$	0.2171 $\pm 0.06$	0.5927 $\pm 0.17$	0.7928 $\pm 0.09$	0.7887 $\pm 0.09$	0.7819 $\pm 0.09$	0.7947 $\pm 0.09$	0.7998 $\pm 0.09$
	RMSE $↓$	37.352 $\pm 4.5$	23.496 $\pm 5.3$	10.350 $\pm 3.2$	10.431 $\pm 3.2$	10.521 $\pm 3.2$	10.411 $\pm 3.2$	10.240 $\pm 3.1$
$σ = 25$ $n = 5$	PSNR $↑$	20.88 $\pm 0.6$	28.46 $\pm 1.4$	30.66 $\pm 3.2$	30.69 $\pm 3.6$	30.58 $\pm 3.0$	30.61 $\pm 3.9$	30.84 $\pm 4.0$
	SSIM $↑$	0.3593 $\pm 0.09$	0.8388 $\pm 0.07$	0.8604 $\pm 0.06$	0.8574 $\pm 0.06$	0.8589 $\pm 0.06$	0.8549 $\pm 0.06$	0.8640 $\pm 0.06$
	RMSE $↓$	23.075 $\pm 1.6$	9.7446 $\pm 1.6$	7.9215 $\pm 2.4$	7.9509 $\pm 2.4$	7.9313 $\pm 2.3$	8.0397 $\pm 2.4$	7.8316 $\pm 2.3$
$σ = 50$ $n = 0.05$	PSNR $↑$	13.93 $\pm 0.7$	22.48 $\pm 0.9$	26.72 $\pm 3.0$	26.76 $\pm 3.2$	26.69 $\pm 3.2$	26.80 $\pm 3.3$	27.00 $\pm 3.6$
	SSIM $↑$	0.1408 $\pm 0.05$	0.6922 $\pm 0.1$	0.7347 $\pm 0.1$	0.7365 $\pm 0.1$	0.7313 $\pm 0.1$	0.7434 $\pm 0.1$	0.7522 $\pm 0.1$
	RMSE $↓$	51.453 $\pm 4.3$	19.246 $\pm 1.9$	12.298 $\pm 3.6$	12.384 $\pm 3.7$	12.476 $\pm 3.7$	12.329 $\pm 3.6$	12.102 $\pm 3.5$
$σ = 50$ $n = 5$	PSNR $↑$	15.43 $\pm 0.6$	24.03 $\pm 1.0$	27.78 $\pm 3.2$	27.69 $\pm 3.2$	27.63 $\pm 3.1$	27.76 $\pm 3.1$	27.90 $\pm 3.5$
	SSIM $↑$	0.1800 $\pm 0.06$	0.7327 $\pm 0.1$	0.7759 $\pm 0.09$	0.7705 $\pm 0.1$	0.7671 $\pm 0.1$	0.7806 $\pm 0.1$	0.7845 $\pm 0.09$
	RMSE $↓$	43.267 $\pm 3.1$	16.126 $\pm 1.8$	11.035 $\pm 3.3$	11.136 $\pm 3.3$	11.194 $\pm 3.3$	11.0251 $\pm 3.2$	10.902 $\pm 3.2$

— Due to training instability in Restore RWKV, characterized by gradient explosion, the model failed to achieve convergence. Consequently, no test results are available for this method.

Table 6.

Comparative analysis of model computational complexity.

Model	Params(M)	FLOPs(G)	AIT(ms)
CBDNet	4.365	10.07	4.64 $\pm 0.04$
DCANet	1.391	18.94	11.79 $\pm 0.08$
Resormer	26.10	35.25	71.6 $\pm 0.05$
Restore-RWKV	27.80	37.46	132.6 $\pm 1.09$
Ours	7.936	23.87	58.05 $\pm 0.16$

Xdenoiser, the method proposed in this paper, integrates global modeling with local structure enhancement. It employs a linear attention mechanism to address Poisson noise typical of low-dose imaging, while lightweight local branches suppress detail perturbations dominated by Gaussian noise. Although its inference time is slightly higher than that of pure CNN models, Xdenoiser significantly reduces model parameters and FLOPs compared to large Transformer architectures. This demonstrates a favorable balance between computational efficiency and denoising performance.

Conclusion

In this paper, a Poisson-Gaussian hybrid noise model is designed to reflect the noise characteristics of X-ray images, addressing the bias found in traditional synthetic noise distributions. Additionally, a two-branch cooperative denoising model is proposed, which balances global noise modeling and local detail preservation by integrating a linear-complexity RWKV module with a local convolutional branch. The model’s effectiveness in handling the complex noise distributions typical of X-ray images is demonstrated. Extensive experiments confirm that the proposed method achieves significant improvements in both quantitative metrics and the visual quality of reconstructed images.

Footnotes

Acknowledgements

This work was supported by the Key R&D Program Project of Zhejiang Province (2024C01232).

ORCID iDs

Xiaolong Zheng

Liang Zheng

Ethical approval

Not applicable.

Authors’ contributions

Yue Fei: Validation, Data curation, Formal analysis, Investigation, Writing - original draft. Xiaolong Zheng: Methodology, Investigation, Formal analysis, Visualization, Writing - original draft & review & editing. Wangyang Tong: Formal analysis, Investigation. Ji Hu: Data curation, Investigation. Huanhuan Wu: Data curation, Investigation. Liang Zheng: Conceptualization, Investigation, Formal analysis. All authors reviewed the manuscript.

Funding

This work was supported by the Key R&D Program Project of Zhejiang Province (2024C01232).

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Availability of data and materials

The datasets employed in this study were obtained from publicly available repositories, accessible via their corresponding references.

References

Chen

, et al. Recent development in x-ray imaging technology: Future and challenges. Research 2021; 1: 1–18.

Pang

, et al. Reconfigurable perovskite X-ray detector for intelligent imaging. Nat Commun 2024; 15: 1769.

Mery

Saavedra

Prasad

. X-ray baggage inspection with computer vision: A survey. IEEE Access 2020; 8: 145620–145633.

Lee

Kang

. Poisson-gaussian noise reduction for X-ray images based on local linear minimum mean square error shrinkage in nonsubsampled contourlet transform domain. IEEE Access 2021; 9: 100637–100651.

Lee

Kang

. Poisson-gaussian noise analysis and estimation for low-dose X-ray images in the NSCT domain. Sensors 2018; 18: 1019.

Juneja

Minhas

Singla

, et al. Denoising techniques for cephalometric X-ray images: A comprehensive review. Multimed Tools Appl 2024; 83: 49953–49991.

Hariharan

Kaethner

Strobel

, et al. Learning-based X-ray image denoising utilizing model-based image simulations. In: International conference on medical image computing and computer-assisted intervention, 2019, pp.549–557.

Chandra

Verma

. Analysis of quantum noise-reducing filters on chest X-ray images: A review. Measurement 2020; 153: 107426.

Dong

Taylor

Cootes

. A random forest-based automatic inspection system for aerospace welds in X-ray images. IEEE Trans Autom Sci Eng 2020; 18: 2128–2141.

10.

Levesque

Merritt

Flippo

, et al. Neural network denoising of X-ray images from high-energy-density experiments. Rev Scient Instrum 2024; 95: 063508.

11.

Liu

Sun

Liu

, et al. Enhanced data augmentation for denoising and super-resolution reconstruction of radiation images. IEEE Trans Nucl Sci 2023; 70: 2183–2190.

12.

Chen

Zhang

Kalra

, et al. Low-dose CT with a residual encoder-decoder convolutional neural network. IEEE Trans Med Imaging 2017; 36: 2524–2535.

13.

Yang

Yan

Zhang

, et al. Low-dose CT image denoising using a generative adversarial network with wasserstein distance and perceptual loss. IEEE Trans Med Imaging 2018; 37: 1348–1357.

14.

Kang

. Wavelet domain residual network (WavResNet) for low-dose X-ray CT reconstruction. arXiv preprint arXiv:1703.01383 2017.

15.

Zamir

Arora

Khan

, et al. Restormer: Efficient transformer for high-resolution image restoration. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp.5728–5739.

16.

Liang

Cao

Sun

, et al. SwinIR: Image restoration using Swin transformer. In: Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp.1833–1844.

17.

Gao

Zhou

, et al. Hybrid convolutional and attention network for hyperspectral image denoising. IEEE Geosci Remote Sens Lett 2024; 21: 1–5.

18.

Peng

Alcaide

Anthony

, et al. RWKV: Reinventing RNNs for the transformer era. arXiv preprint arXiv:2305.13048, 2023.

19.

Yang

Zhang

, et al. Restore-RWKV: Efficient and effective medical image restoration with RWKV. IEEE J Biomed Health Inform 2026; 30: 513–526.

20.

Huang

Yang

Tang

. A fast two-dimensional median filtering algorithm. IEEE Trans Acoust 1979; 27: 13–18.

21.

Canny

. A computational approach to edge detection. IEEE Trans Pattern Anal Mach Intell 2009; 679–698.

22.

Donoho

Johnstone

. Adapting to unknown smoothness via wavelet shrinkage. J Am Stat Assoc 1995; 90: 1200–1224.

23.

Malladi

SRS

Ram

Rodríguez

. Image denoising using superpixel-based PCA. IEEE Trans Multimedia 2020; 23: 2297–2309.

24.

Rudin

Osher

Fatemi

. Nonlinear total variation based noise removal algorithms. Physica D: Nonlinear Phenomena 1992; 60: 259–268.

25.

Buades

Coll

Morel

J-M

. Non-local means denoising. Image process on line 2011; 1: 208–212.

26.

Dabov

Foi

Katkovnik

, et al. Image denoising by sparse 3-D transform-domain collaborative filtering. IEEE Trans Image Process 2007; 16: 2080–2095.

27.

Elad

Aharon

. Image denoising via sparse and redundant representations over learned dictionaries. IEEE Trans Image Process 2006; 15: 3736–3745.

28.

Zhang

Zuo

Zhang

. FFDNet: Toward a fast and flexible solution for CNN-based image denoising. IEEE Trans Image Process 2018; 27: 4608–4622.

29.

Lehtinen

Munkberg

Hasselgren

, et al. Noise2Noise: Learning image restoration without clean data. arXiv preprint arXiv:1803.04189, 2018.

30.

Krull

Buchholz

T-O

Jug

. Noise2void-learning denoising from single noisy images. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp.2129–2137.

31.

Guo

Yan

Zhang

, et al. Toward convolutional blind denoising of real photographs. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp.1712–1722.

32.

Zhang

Zuo

Chen

, et al. Beyond a Gaussian denoiser: Residual learning of deep CNN for image denoising. IEEE Trans Image Process 2017; 26: 3142–3155.

33.

Yang

Lee

B-U

. Poisson-gaussian noise reduction using the hidden markov model in contourlet domain for fluorescence microscopy images. PLoS ONE 2015; 10: 0136964.

34.

Foi

Trimeche

Katkovnik

, et al. Practical poissonian-gaussian noise modeling and fitting for single-image raw-data. IEEE Trans Image Process 2008; 17: 1737–1754.

35.

Dao

. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.

36.

Vaswani

Shazeer

Parmar

, et al. Attention is all you need. Adv Neural Inf Process Syst 2017; 30: 6000–6010.

37.

Duan

Wang

Chen

, et al. Vision-RWKV: Efficient and scalable visual perception with RWKV-like architectures. arXiv preprint arXiv:2403.02308, 2024.

38.

Mery

Riffo

Zscherpel

, et al. GDXray: The database of X-ray images for nondestructive testing. J Nondest Eval 2015; 34: 42.

39.

Tao

Wei

Jiang

, et al. Towards real-world X-ray security inspection: A high-quality benchmark and lateral inhibition module for prohibited items detection. In: Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp.10923–10932.

40.

Miao

Xie

Wan

, et al. Sixray: A large-scale security inspection x-ray benchmark for prohibited item discovery in overlapping images. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp.2119–2128.

41.

Duan

, et al. Dual convolutional neural network with attention for image blind denoising. Multimedia Syst 2024; 30: 263.