Compound feature attention network with edge enhancement for low-dose CT denoising

Abstract

BACKGROUND:

Low-dose CT (LDCT) images usually contain serious noise and artifacts, which weaken the readability of the image.

OBJECTIVE:

To solve this problem, we propose a compound feature attention network with edge enhancement for LDCT denoising (CFAN-Net), which consists of an edge-enhanced module and a proposed compound feature attention block (CFAB).

METHODS:

The edge enhancement module extracts edge details with the trainable Sobel convolution. CFAB consists of an interactive feature learning module (IFLM), a multi-scale feature fusion module (MFFM), and a joint attention module (JAB), which removes noise from LDCT images in a coarse-to-fine manner. First, in IFLM, the noise is initially removed by cross-latitude interactive judgment learning. Second, in MFFM, multi-scale and pixel attention are integrated to explore fine noise removal. Finally, in JAB, we focus on key information, extract useful features, and improve the efficiency of network learning. To construct a high-quality image, we repeat the above operation by cascading CFAB.

RESULTS:

By applying CFAN-Net to process the 2016 NIH AAPM-Mayo LDCT challenge test dataset, experiments show that the peak signal-to-noise ratio value is 33.9692 and the structural similarity value is 0.9198.

CONCLUSIONS:

Compared with several existing LDCT denoising algorithms, CFAN-Net effectively preserves the texture of CT images while removing noise and artifacts.

Keywords

LDCT edge enhancement interactive feature learning multi-scale feature fusion joint attention

1 Introduction

Computed tomography (CT) is a widely used screening and diagnostic tool. It enables clinical medicine to obtain high-resolution, volume images of internal structures in a non-invasive manner and plays a very important role in modern medical diagnosis. However, ionizing radiation during CT imaging has potential health risks to the human body [1, 2]. To reduce the risk, we can achieve lower radiation dose CT imaging by reducing the tube current or shortening the exposure time of the X-ray tube, but the low-dose CT (LDCT) image after imaging will produce serious noise and artifacts, affecting the doctor’s diagnosis.

To solve this problem, many scholars have studied many algorithms to improve the quality of LDCT images. Among them, the classical image post-processing algorithms include the non-local mean algorithm [3–5], block-matching algorithm [6–8] and dictionary learning [9, 10]. Although these algorithms play a certain role in image quality, the denoised image is excessively smooth and may lose key local details. In recent years, with the development of deep learning in computer vision, various deep learning methods have been proposed for LDCT image denoising [11–14], and remarkable results have been achieved. In addition, some methods can simultaneously denoise and deblur [15–17]. At present, the deep neural network has been widely applied to the field of LDCT image denoising and has become the mainstream method with better performance than traditional methods. For CT denoising, convolutional neural networks (CNN) show competitive performance [18–21]. Wang et al. [22] proposed the first low-dose CT denoising framework based on convolution, and then proposed various deep-learning methods [13, 23]. Chen et al. [24] proposed a residual encoder-decoder convolutional neural network (RED-CNN) to suppress noise and artifacts. The directional component of artifacts is extracted by directional wavelet transform. Kang et al. [25] built a residual network in the wavelet domain (WavResNet) based on wavelet transform to completely restore texture. Due to the irregular distribution of noise and artifacts in LDCT images, the wavelet-based network will have blurred image edges and details after denoising. Yang et al. [26] used Wasserstein distance (WGAN) and perceptual loss [27] to improve the quality of denoised images. Due to the excellent performance of WGAN in generating real CT images and the role of perceptual loss in structural fidelity, the model alleviates the excessive smoothing in denoised images. Kulathilake et al. [28] proposed a generative adversarial network (InNetGAN) with initial network modules, which reduces LDCT image noise while preserving its texture and fine structure. However, the unsatisfactory noise description capability and unstable training process still surround the GAN framework [29]. Fan et al. [30] constructed an LDCT image denoising autoencoder based on secondary neurons, which provides a better fitting ability, higher robustness, and efficiency than primary convolution. This is the first autoencoder based on new neurons. Inspired by the successes of the transformer in the image domain, some studies [31, 32] applied the transformer model to the field of video recognition and achieve superior performance. Wang et al. [33] used Transformer for LDCT image denoising for the first time based on the encoder-decoder structure, which effectively eliminated noise and artifacts, but the parameters were too large and the training time was long. Geng et al. [34] proposed a content-noise complementary learning (CNCL) strategy for medical image denoising based on the GAN network, which can effectively remove noise and artifacts and has good generalization ability, but its training time is too long.

Although there are many models and algorithms, the task of low-dose CT image denoising has not been completely solved. The existing models also face some problems, such as the results being too smooth, and losing edge and detailed information. The features extracted by the convolution kernel are limited to small local areas. Even if the neural network can obtain a larger receptive field by stacking the convolution layer of small kernels, the number of irrelevant pixels introduced into the image will increase, resulting in a loss of computational and statistical efficiency. This defect makes the edge information of the image lost, resulting in image quality reduction. Therefore, it is necessary to add the design of edge enhancement to the network. Gou et al. [35] proposed a new gradient regularization method to enhance LDCT images, and achieved good visual and quantitative results in experiments. Yi et al. [36] proposed a definition-aware low-dose CT denoising method based on conditional generative adversarial networks. Experiments on simulated data sets and real data sets show that the results of this method have a small resolution loss. Gholizadeh et al. [37] proposed a low-dose CT denoising method based on perceptual loss and edge detection layer. This method introduces an untrainable edge detection layer to extract edges from horizontal, vertical, and diagonal directions. Liang et al. [38] used the trainable Sobel convolution layer to extract edge details and proposed an EDCNN network based on edge enhancement and dense connection, which significantly improved the quality of the processed image.

Inspired by EDCNN [38], to better preserve the fine structure and details of the image after noise reduction, we propose a compound feature attention network with edge enhancement for LDCT denoising (CFAN-Net). The network applies the residual structure and cascade mode and can effectively realize the denoising of LDCT images by the way of post-processing. The experimental results show that compared with several other noise reduction methods, this method can achieve better output results. The main contributions of this paper are as follows:

We propose an interactive feature learning module (IFLM), which uses the long-distance feature correlation in the channel and spatial dimensions to perform cross-latitude interactive judgment learning. It preliminarily removes a large amount of noise and can improve the representation ability of the model.

We propose a multi-scale feature fusion module (MFFM), which embeds the proposed residual pixel attention block (RPAB) and explores different levels of image features in a multi-scale manner. It removes fine noise and preserves detailed texture information.

We propose an improved joint attention module (JAB), which forces the network to focus on key information, extract useful image features, and improve learning efficiency.

Comparative experiments and ablation experiments show that our proposed CFAN-Net is superior to several advanced LDCT denoising methods in terms of accuracy and visual effects, and each improved module has a positive contribution.

The structure of this paper is as follows: Section II describes the related work. Section III introduces the design of CFAN-Net in detail and explains the contribution of each module to this paper. In Section IV, we present the experimental configuration and the corresponding experimental analysis and results. Finally, Section V is the summary of this paper and the direction of future work.

2 Related work

2.1 Noise reduction model

The denoising method proposed in this paper belongs to the image post-processing method in low-dose CT image denoising. Therefore, this experiment only involves CT image denoising in the image domain. Given a standard dose CT (NDCT) image I_ND ∈ R^w×h, the corresponding LDCT image I_LD ∈ R^w×h can be expressed as: $I_{LD} = T (I_{ND}),$ (1) where T : R^w×h → R^w×h denotes the degradation of the image. The denoising process can be transformed into the objective function f: $arg min_{f} {∥ f (I_{LD}) - I_{ND} ∥}_{2}^{2},$ (2) where f represents the best approximation mapping T^-1 obtained by deep learning.

2.2 Sobel trainable convolution block

The basic idea of the edge detection operator is generally to represent the brightness change of the image by calculating the differential of the local area of the image, to detect the edge of the image. Edge detection is essentially a filtering algorithm, the difference is the choice of filter. The Sobel operator for extracting the image edge is calculated by the first derivative operator. It contains two 3×3 detection templates. One template is to extract the edge in the horizontal direction of the image, which has a great effect on the edge in the horizontal direction. The other is to extract the edge in the vertical direction of the image, which has a great effect on the edge in the vertical direction. The detection template is shown in Fig. 1.

Fig. 1

Sobel operator template.

To extract richer image edge feature information, the trainable Sobel operator is designed in EDCNN [38], as shown in Fig. 2. Unlike traditional convolutions, the kernel of the Sobel convolution is predefined by four types of operators: vertical, horizontal, and diagonal. It should be noted that the Sobel operator will be adjusted by multiplying a learnable parameter “a”. The initial value of “a” is set to 1, and its value can be adjusted adaptively with network training through the network back propagation mechanism. The Sobel convolution shares the same learnable factor “a” on a single channel, but the “a” on different channels is no longer the same after training. Sobel convolution acts on the input image channel by channel. Because the “a” of each channel is different, the output feature map contains different intensity edge information.

Fig. 2

Trainable Sobel convolution.

2.3 Attention module

Channel attention block (CAB) [39] aims to exploit the inter-channel dependence of the convolutional features, as shown in Fig. 3(a). It first performs a squeeze operation to encode the spatial global context, which is then followed by an excitation operation to fully capture channel-wise relationships. The squeeze operation is realized by applying global average pooling (GAP) on feature maps M, thus yielding a descriptor s ∈ R^1×1×C. The excitation operator is generated using two convolution layers to recalibrate the descriptor s and then generate the channel attention map S ∈ R^1×1×C by sigmoid. Finally, the channel attention feature is obtained by multiplying S and M.

Fig. 3

(a) CAB, and (b) SAB.

As shown in Fig. 3(b), the spatial attention block (SAB) [40] uses the relationship between feature Spaces to calculate the spatial attention map D and then uses the map to readjust the incoming feature M. To generate the spatial attention map, we first independently apply global average pooling and max pooling operations on features M along the channel dimensions and concatenate the output maps to form a spatial feature descriptor d ∈ R^H×W×2. This is followed by a convolution and sigmoid activation to obtain the spatial attention map D ∈ R^H×W×1. Finally, the spatial attention feature is obtained by multiplying D and M.

3 The proposed CFAN-Net

3.1 CFAN-Net architecture

The LDCT image denoising model proposed in this paper is shown in Fig. 4, which is called compound feature attention network with edge enhancement for Low-dose CT denoising (CFAN-Net). The whole model consists of four main modules: the edge enhancement module (Sobel_Edge enhancement), the shallow feature extraction module (SF), the deep feature extraction module (DF) based on the proposed compound feature attention block (CFAB), and the image reconstruction module (RC).

Fig. 4

The architecture of CFAN-Net.

Suppose x is the input LDCT image, and y is the denoised output image. Firstly, the edge enhancement module is used to extract the shallow edge feature e₀ from the input LDCT image x:

$e_{0} = E (x),$ (3) where E (·) is the trainable Sobel convolution operation. Next, e₀ and x are connected in series on the channel, denoted as f_i, as the input of SF to extract the feature f_p of the initial image: $f_{i} = Cat (x, e_{0}),$ (4) $f_{p} = M_{SF} (f_{i}),$ (5) where M_SF (·) performs 1×1 convolution and Relu activation operations. The f_p is then sent to cascading CFABs for further extraction, termed as M_DF (·), which can be formulated as: $f_{g} = M_{DF} (f_{p}),$ (6) where f_g is the learned feature. To make full use of the extracted edge information and original input, f_i and f_g are connected in series on the channel, denoted as f_c, as the input of the reconstruction module. The module first fuses f_c by 1×1 point-by-point convolution and then is sent to the 3×3 convolution layer through the Relu activation function for image reconstruction, which is called M_RC. $f_{c} = Cat (f_{i}, f_{g}),$ (7) $f_{r} = M_{RC} (f_{c}),$ (8) where f_r is the output of the reconstruction module.

To accelerate the convergence rate of the model and simplify the task of the main structure of the model, we let the model learn the noise distribution directly. Therefore, the output f_r of the reconstruction module is added to the original LDCT image x to obtain the final denoised image y. $y = f_{r} + x,$ (9)

3.2 The proposed compound feature attention block (CFAB)

Inspired by [41], we introduce the residual pattern on the residual structure to construct the proposed compound feature attention block (CFAB). This mode can improve the denoising performance by deepening the network depth. However, in this paper, we will only cascade four CFABs, which repeat the coarse-to-fine denoising process to construct a finer image. The residual structure is used to avoid information loss, ensure the smooth flow of information, and enhance the learning ability of the network. Each CFAB is mainly composed of an interactive feature learning module (IFLM), a multi-scale feature fusion module (MFFM), and a joint attention module (JAB). The following sections introduce its components in detail.

3.2.1 Interactive Feature Learning Module (IFLM)

To remove many noise and artifacts in image features, inspired by [42, 43], we design an interactive feature learning module (IFLM), as shown in Fig. 5(d). The module uses the long-distance feature correlation in the channel and spatial dimensions to make the network focus on features with more information. It performs cross-latitude interactive discriminant learning to achieve the initial removal of noise.

Fig. 5

(a) 3d_CA, (b) 3d_SA, (c) CARB, and (d) IFLM.

Channel attention Residual block (CARB) is composed of a residual block and a three-dimensional channel attention module (3d_CA), as shown in Fig. 5(c). Convolutional blocks are used to extract features, and 3d_CA is used to enhance the ability of feature learning. 3d_CA take channel-wise attention ((H, W) dimension) as an example. The feature descriptors are first obtained in the (H, W) dimension using the averaging pooling operation. The descriptor is then forwarded to the shared multilayer awareness (MLP) and the Sigmoid functions to generate the channel attention map A_c. Finally, the channel attention feature F_c is obtained by multiplication between the attention map and the input feature F_in. A similar process is used to generate the row-wise attention features F_w of the (C, H) dimension and the column-wise attention features F_h of the (C, W) dimension. Finally, the convolution and Sigmoid functions are used to generate a three-dimensional channel attention feature map C_3d.

$C_{3 d} = η (C_{3 \times 3} (F_{c} + F_{w} + F_{h} + F_{in})),$ (10) where η denotes the Sigmoid function. The whole process of CARB can be expressed as: $F_{CARB} = M_{f} (F_{in}) \times C_{3 d} + F_{in},$ (11) where M_f (·) denotes the convolutional feature extraction process.

The interactive feature learning module (IFLM) is composed of the channel attention residual block (CARB) and the three-dimensional spatial attention module (3d_SA) and applies the same structure as the channel attention residual block (CARB), as shown in Fig. 5(d). The three-dimensional spatial attention module (3d_SA) and the three-dimensional channel attention module (3d_CA) learn the interdependence between features through similar cross-latitude interactions, but 3d_SA can provide different feature information for the discriminative representation of 3d_CA. In 3d_SA, we still take channel-wise attention ((H, W) dimension) as an example. We first apply the average-pooling operation along the channel axis to generate an efficient feature descriptor. The convolutional layer and sigmoid function are then applied to generate a spatial attention map A_cha of the locations of the emphasized or suppressed feature information. Finally, we perform a multiplication operation between A_cha and F_in to obtain spatial-wise attentive features F_cha. A similar process is applied in the (C, H) and (C, W) dimensions to generate the row-to-row attention features F_row and the column-to-column attention features F_col. Then, we add three-dimensional spatial attention features and input F_in, using 3×3 convolution and Sigmoid function to generate the output S_3d of 3d_SA. $S_{3 d} = η (C_{3 \times 3} (F_{cha} + F_{row} + F_{col} + F_{in})),$ (12) The whole process of IFLM can be expressed as: $F_{out} = M_{fc} (F_{in}) \times S_{3 d} + F_{in},$ (13) where M_fc (·) denotes the operation process of two CARBs and a 3×3 convolution.

3.2.2 Multi-scale Feature Fusion Module (MFFM)

The internal structure information of the image, such as the outer contour, lines and curves of the object, is an important factor in evaluating image quality. To improve the quality of the image, we design a multi-scale feature fusion block (MFFM), as shown in Fig. 6(c). Based on retaining the original information, MFFM uses three groups of the proposed residual pixel attention block (RPAB) (as shown in Fig. 6 (b)) to explore multi-scale image feature information. It extracts rich structural texture information, reconstructs image details, and improves image quality. It achieves the removal of fine noise.

Fig. 6

(a) PAB, (b) RPAB, and (c) MFFM.

Assume that the input feature is F_in and the number of channels is 64. Before each calculation by RPAB, the feature maps with the last 16 channels are reserved in advance on the channels, until there are no additional feature maps, as shown in the dashed box in Fig. 6(c). After that, the feature maps with 16 channels retained previously are combined in series on the channel to form new features, which are fused by 1×1 convolution, activated by the Relu function, and then reconstructed by 3×3 convolution. Finally, it is added with the input feature F_in for feature compensation to obtain the output F_out. The whole process can be expressed as: $F_{out} = C_{3 \times 3} (R (C_{1 \times 1} ([f_{0}, f_{1}, f_{2}, f_{3}]))) + F_{in},$ (14) where f_i are the feature maps with 16 channels, which are retained each time. [·] means in series on the channel, and R is for Relu.

3.2.3 Joint Attention Module (JAB)

To quickly extract more useful local details from many denoised image features to construct high-quality images, we propose a joint attention block (JAB), as shown in Fig. 7(b). The goal of JAB is to focus on key information and ignore irrelevant information. It refines useful information and improves learning efficiency. JAB first extracts local features sufficiently and effectively by a residual block consisting of two shift-conv [44] and a simple Relu activation. The shift-conv consists of a set of shift operations and a 1×1 convolution, as shown in Fig. 7(a). Specifically, we divide the input features into five groups. The first four groups of features move along different spatial dimensions, including left, right, top and bottom, with 12 layers in each group. The remaining features are the last group. Next, 1×1 convolution can use the information of adjacent pixels to extract local features more effectively without increasing a lot of calculations. Then, the image features are recalibrated by using CAB and SAB attention mechanisms respectively, and the results are concatenated on the channel. Finally, the calibrated feature maps are fused by 1×1 convolution. so that it is summed with the input features to obtain the final output. The operation process of CAB and SAB is shown in the relevant part of this paper. The overall process of JAB is:

Fig. 7

(a)shift-conv, and (b) JAB.

$M = W_{l} (F_{in}) + F_{in},$ (15) $F_{out} = c_{1 \times 1} ([CAB (M), SAB (M)]) + F_{in},$ (16) where F_in is the input feature, and W_l (·) is the residual block for extracting local features.

3.3 Loss function

In this paper, we directly refer to the loss function of MS-SSIM+L1 proposed in [45] as the main loss, which is defined as: $L_{Maj} = α \cdot L_{MS - SSIM} + (1 - α) \cdot G_{σ_{G}^{M}} \cdot L_{1},$ (17) where α = 0.84, and $G_{σ_{G}^{M}}$ is the Gaussian weight. Image denoising neural networks using perceptual loss are more robust to many potential problems (such as over-smoothing and distortion) [46]. Perceptual loss measures the similarity between images by preprocessing the input images through a convolutional neural network. In this paper, we introduce the autoencoder sensing loss proposed in SACNN [47], defined as: $L_{AE} (g) = E_{(I_{LD}, I_{ND})} [\frac{{∥ φ (g (I_{LD})) - φ (I_{ND}) ∥}_{F}^{2}}{CHW}],$ (18) where φ is a pre-trained encoder network, and C, H, W represent the depth, height, and width of the CT image.

The total loss L_total in this paper is composed of L_Maj and L_AE (g): $L_{total} = L_{Maj} + β L_{AE} (g),$ (19) where β is the weight of the perceived loss, in this paper we set β = 0.1.

4 Experimental results and analysis

4.1 Experimental settings

In our research experiment, we used the 2016 NIH AAPM-Mayo Clinic Low-Dose CT Grand Challenge dataset [48], which contains paired CT images of 10 anonymous patients. Each LDCT image has a corresponding NDCT image, and each image is 512×512. The dataset was segmented before training, and the CT images of 9 patients (812 pairs) were randomly selected as the training set, and the images of the remaining 1 patient (35 pairs) were used as the test set. During training, patches with a size of 64×64 are randomly cropped for training, and the network is optimized using the default parameter Adam optimizer. In the experiment, the batch size is 4, the learning rate is 0.00004, and the number of iterations is 400. This experiment is based on the Pytorch framework and uses a computer configured as NVIDIA GeForce RTX 2080 SUPER GPU to train and test the network. The parameters of the comparative experiment are set according to the suggestions of the original paper. The dimensions of the variables in Figs. 3–7 are shown in Table 1.

Table 1
Dimensions of the variables in Figs. 3–7 (C: channel, H: height, W: width)

Variables Dimensions(C×H×W)

Fig. 3 / M 64×64×64

s/S 64×1×1

d 2×64×64

D 1×64×64

Fig. 4 X 1×64×64

e0 64×64×64

fi 65×64×64

fp/fg 64×64×64

fc 129×64×64

fr 1×64×64

y 1×64×64

Fig. 5 Fin/Fc/Fw/Fh/C3d/ 64×64×64

Fcha/Fcol/Frow/S3d/Fout 64×64×64

Fig. 6 Fin 64×64×64

f0/f1/f2/f3 16×64×64

Fout 64×64×64

Fig. 7 Fin/M/Fout 64×64×64

	Variables	Dimensions(C×H×W)
Fig. 3 /	M	64×64×64
	s/S	64×1×1
	d	2×64×64
	D	1×64×64
Fig. 4	X	1×64×64
	e0	64×64×64
	fi	65×64×64
	fp/fg	64×64×64
	fc	129×64×64
	fr	1×64×64
	y	1×64×64
Fig. 5	Fin/Fc/Fw/Fh/C3d/	64×64×64
	Fcha/Fcol/Frow/S3d/Fout	64×64×64
Fig. 6	Fin	64×64×64
	f0/f1/f2/f3	16×64×64
	Fout	64×64×64
Fig. 7	Fin/M/Fout	64×64×64

To test the effectiveness of the proposed CFAN-Net, we compare the proposed algorithm with REDCNN [24], EDCNN [38], QAE [30], CTformer [33], and CNCL [34] in terms of visual effects and quantitative indicators. Among them, the objective evaluation indexes in the quantitative analysis include structural similarity (SSIM) based on structural difference, peak signal-to-noise ratio (PSNR) based on pixel gray level difference, gradient magnitude similarity deviation (GMSD) based on gradient value change, feature similarity index Meature (FSIM) based on feature difference and variance information fidelity (VIFs) based on visual perception. Except for GMSD, higher values of the remaining metrics indicate better quality of the denoised CT images.

4.2 The Mayo experiment

4.2.1 Visual effect

The processing results of two representative slices (represented as Case 1 and Case 2) in the Mayo test data set through different methods are shown in Figs. 8 and 9. All CT images in the axial view are displayed in the [-160HU, 240HU] window. To show the noise reduction effect more clearly, we enlarged the region of interest (ROI) and marked the artifacts and lesions with red arrows, yellow dotted ellipses and blue circles, as shown in Figs. 10 and 11. There are obvious noise and artifacts in LDCT images (Figs. 8(a) and 9(a)) compared with NDCT images (Figs. 8(b) and 9(b)) where clear damage and tissue structure can be seen. From the overall noise reduction effect and the locally enlarged ROI image, the six comparison methods have a certain removal effect on the noise and artifacts in the LDCT image.

Fig. 8

Comparison of Case 1.

Fig. 9

Comparison of Case 2.

Fig. 10

The zoomed ROIs in Fig. 8.

Fig. 11

The zoomed ROIs in Fig. 9.

Although REDCNN eliminates some noise, the image is blurred (as shown in Figs. 10(c1) and 11(c4)), which can be attributed to the use of mean square error (MSE) as a loss function. QAE and CTformer remove most of the speckle noise, but still cannot effectively remove artifacts (as shown in Figs. 10(e1) – (f1)). CNCL effectively removes noise and artifacts in LDCT images, but there are problems such as image smoothing and detail loss (as shown in Figs. 10(g1) and 11(g3)). Comparing the small lesions marked by blue circles in Fig. 11, it is found that the lesion contours of EDCNN and CFAN-Net are the clearest, and other methods have obvious blurring. This result confirms the role of the edge enhancement module. From the other markers in Figs. 10 and 11, CNCL and CFAN-Net have the strongest ability to remove noise and artifacts, and other methods have obvious residual artifacts. In contrast, our CFAN-Net achieves better results in noise, artifact removal and tissue detail protection. In the ROI image, it shows that after the noise and artifacts are removed, the overall organizational details (as shown in Figs. 10(h1) and 11(h3)) and texture (as shown in Fig. 11(h4)) are the clearest. This result shows that the compound feature attention module has strong noise and artifact removal capabilities. In summary, CFAN-Net has an excellent ability in removing noise and artifacts and restoring image details and textures.

4.2.2 Quantitative assessment

To illustrate the effectiveness of the proposed algorithm, we use five objective indicators to evaluate and compare six denoising algorithms. The values are marked in the upper-left and upper-right corners of Figs. 8, 9. In addition, Table 2 shows a summary table of the average PSNR, SSIM, GMSD, FSIM, and VIFs values obtained by different methods after processing 35 LDCT images in the test dataset. The optimal and suboptimal values are represented by red and blue, respectively. It can be seen from Table 2 that compared with REDCNN, EDCNN and CTformer, except that PSNR in CTformer is slightly higher and SSIM in EDCNN is the lowest, the score gap of other indicators is very small. QAE achieved sub-optimal in PSNR, GMSD and FSIM, but the scores of GMSD and FSIM were not much different from REDCNN, EDCNN and CTformer. CNCL performed worst on PSNR, GMSD, and FSIM, but its SSIM and VIFs values were suboptimal. The comprehensive comparison shows that the scores of the proposed method are optimal on the five indicators, which indicates that CFAN-Net can achieve good noise and artifact suppression (PSNR), feature information preservation (FSIM and VIFs), and its visual effect is closer to NDCT images (GMSD and SSIM).

Table 2
Quantitative index values (mean±standard deviation) of noise reduction results of different algorithms on the Mayo testing set

PSNR↑ SSIM↑ GMSD↓ FSIM↑ VIFs↑

LDCT 30.2805±0.5197 0.8585±0.0124 0.0876±0.0063 0.9452±0.0039 0.4643±0.0157

REDCNN 33.3298±0.5819 0.9131±0.0087 0.0585±0.0025 0.9610±0.0029 0.5168±0.0255

EDCNN 33.3322±0.4042 0.9103±0.0079 0.0600±0.0032 0.9622±0.0024 0.5256±0.0135

QAE 33.6436±0.4001 0.9161±0.0074 0.0581±0.0030 0.9625±0.0023 0.5302±0.0137

CTformer 33.3980±0.5546 0.9141±0.0087 0.0589±0.0031 0.9624±0.0024 0.5259±0.0189

CNCL 33.1521±0.5604 0.9186±0.0064 0.0636±0.0040 0.9598±0.0022 0.5371±0.0107

CFAN-Net 33.9692±0.4049 0.9198±0.0070 0.0580±0.0030 0.9638±0.0023 0.5420±0.0130

	PSNR↑	SSIM↑	GMSD↓	FSIM↑	VIFs↑
LDCT	30.2805±0.5197	0.8585±0.0124	0.0876±0.0063	0.9452±0.0039	0.4643±0.0157
REDCNN	33.3298±0.5819	0.9131±0.0087	0.0585±0.0025	0.9610±0.0029	0.5168±0.0255
EDCNN	33.3322±0.4042	0.9103±0.0079	0.0600±0.0032	0.9622±0.0024	0.5256±0.0135
QAE	33.6436±0.4001	0.9161±0.0074	0.0581±0.0030	0.9625±0.0023	0.5302±0.0137
CTformer	33.3980±0.5546	0.9141±0.0087	0.0589±0.0031	0.9624±0.0024	0.5259±0.0189
CNCL	33.1521±0.5604	0.9186±0.0064	0.0636±0.0040	0.9598±0.0022	0.5371±0.0107
CFAN-Net	33.9692±0.4049	0.9198±0.0070	0.0580±0.0030	0.9638±0.0023	0.5420±0.0130

Through the boxplot of PSNR and SSIM, the distribution characteristics of different methods on the test dataset are compared, as shown in Fig. 12. The boxplot summarizes a set of data through five statistics: maximum, minimum, lower and upper quartiles, and median. By observing the width of the frame in Fig. 12 and the distribution range of PSNR and SSIM, we can conclude that CFAN-Net has good robustness. From the gray line (median) in the boxplots, the order of different denoising methods is as follows: (PSNR) CNCL < REDCNN<EDCNN<CTformer<QAE<CFAN-Net, (SSIM) EDCNN < REDCNN<CTformer<QAE<CNCL<CFAN-Net. The highest median value confirms the average quantization performance of CFAN-Net.

Fig. 12

Boxplot of denoised results using different denoising methods on AAPM testing set.

Since doctors are more inclined to focus on ROI in practical clinical applications, we quantify the ROI results obtained by different methods using PSNR and SSIM, as shown in Fig. 13. Comparing the PSNR and SSIM of each method in the same ROI, it can be found that CFAN-Net always has leading quantitative performance. Overall, the quantitative indicators of EDCNN perform poorly, while other methods are not quite different from each other. In comparison of PSNR, CNCL performs worst in ROI1, and CFAN-Net is significantly ahead in all ROIs. In the comparison of SSIM, EDCNN is significantly the worst in ROI1 and ROI3, and CFAN-Net is significantly ahead in ROI1 and ROI2. When evaluating image quality, we consider the performance of different algorithms on various quantitative indicators. In summary, CFAN-Net is superior to other comparison methods in terms of visual effects and quantitative index analysis.

Fig. 13

Quantitative performance of ROI in Figs. 10, 11.

4.3 Real clinic experiments

The research in this section is mainly based on the use of real clinical data and has been approved by the ethics committee of our hospital. In this paper, a representative slice (Case 3) is selected to verify the robustness of real clinical CT images. Two ROI regions (marked with the rectangular box in Fig. 14(a)) are selected from the slice for better comparison. They are placed in the upper left and upper right corners of the slice, and the artifacts and lesions are marked with red arrows and blue circles, as shown in Fig. 14. We can find that there are serious noise and artifacts in LDCT images, which have a great impact on the observation of texture and damage. REDCNN and QAE can suppress noise and artifacts to a certain extent, but the processed image is blurred (as shown in Fig. 14(b) and 14(d)). Compared with CTformer, there are more artifacts in the image processed by EDCNN (as shown in Fig. 14(c5)), but the contour of the lesion in the blue circle is clearer (as shown in Fig. 14(c6)). CNCL and CFAN-Net can produce better results, but there are still obvious artifacts in the image processed by CNCL (as shown in Fig. 14 (f5)) and the brightness still needs to be improved. CFAN-Net has the main noise suppression effect, and the lesion and tissue texture are well preserved. In summary, the images processed by CFAN-Net are convenient for doctors to diagnose and have practical application value.

Fig. 14

Comparison of Case 3.

4.4 Ablation studies

In this section, ablation experiments are performed on the Mayo dataset to analyze the impact of all contributions in CFAN-Net. In short, the four ablation experiments are ‘w/o the compound loss’ (the model is obtained by replacing the compound loss with the L1 loss), ‘w/o the interactive feature learning module’ (the model is obtained by subtracting the interactive feature learning module), ‘w/o the multi-scale feature fusion module’ (the model is obtained by subtracting the multi-scale feature fusion module), and ‘w/o the joint attention module’ (the model is obtained by subtracting the joint attention module).

As shown in Fig. 15, we select another representative slice (represented as Case 4) from the Mayo dataset for a better comparison, with a window level of [-160HU, 240HU]. It can be seen from Fig. 15(c) and (d) that if there is no interactive feature learning module or multi-scale feature fusion module, the image processing results are affected by noise residues to varying degrees. If the joint attention module is not used, it can be seen from Fig. 15(e) that the results are blurred and there are artifacts. Although the visual difference between Fig. 15(f) and 15(g) is subtle, CFAN-Net is slightly better in artifact suppression and detail preservation.

Fig. 15

The denoised results obtained by performing the ablation experiments for Case 4 on the Mayo testing set.

Table 3 shows the average index values of each model in the ablation experiment when dealing with 35 pairs of the testing set. The optimal and sub-optimal values are expressed in red and blue, respectively. We can see that the proposed CFAN-Net achieves the best score for each metric. The model without the multi-scale feature fusion module performs the sub-optimal value except for PSNR, while the model without the interactive feature learning module performs the worst score. It shows that the interactive feature learning module has a positive effect on the model. The model with only the L1 loss performs sub-optimal values except for SSIM and VIFs. Compared with CFAN-Net, the PSNR, SSIM, FSIM and VIFs of the model without the joint attention module are lower, and the GMSD value is unchanged, which shows the influence of the joint attention module. In summary, the improved network in this paper is effective and meaningful for improving the quality of LDCT denoising images.

Table 3

Quantitative performance of ablation studies (mean)

	PSNR↑	SSIM↑	GMSD↓	FSIM↑	VIFs↑
w/o the compound loss	33.8766	0.9190	0.0581	0.9636	0.5394
w/o the interactive feature learning module	33.4569	0.9182	0.0586	0.9630	0.5380
w/o the multi-scale feature fusion module	33.6976	0.9192	0.0581	0.9636	0.5407
w/o the joint attention module	33.8465	0.9188	0.0580	0.9634	0.5392
CFAN-Net	33.9692	0.9198	0.0580	0.9638	0.5420

5 Conclusion

In this paper, we propose a compound feature attention network with edge enhancement for Low-dose CT denoising (CFAN-Net), which solves the problems of texture detail loss and image over-smoothing after LDCT image denoising. Combining the advantages of residual structure and attention, we propose a compound feature attention block (CFAB), which includes an interactive feature learning module (IFLM), a multi-scale feature fusion module (MFFM) and a joint attention module (JAB). It removes image noise from coarse to fine and applies cascade mode to deepen the work. The experimental results show that the proposed CFAN-Net has an obvious removal effect on noise and artifacts in LDCT. In qualitative and quantitative aspects, compared with several other advanced LDCT image denoising networks, it has achieved leading results. However, CFAN-Net does not perfectly restore the organizational details of the image, and its training process requires the participation of paired labels. In the future, we will further improve the performance of the model and make it develop towards the unsupervised or semi-supervised direction.

Footnotes

Acknowledgments

We would like to thank the editors and reviewers for improving the content of this article and thank the Mayo Clinic and the Ethics Committee for providing the data used. This work was supported in part by the National Nature Science Foundation of China (61801438), in part by the Science and Technology Innovation Project of Colleges and Universities of Shanxi Province (2020L0282), in part by the Natural Science Foundation of Shanxi Province of China (201901D111153, 202103021224204).

References

Smith-Bindman

, et al., Radiation dose associated with common computed tomography examinations and the associated lifetime attributable risk of cancer, Archives of Internal Medicine 169(22) (2009), 2078–2086, doi: 10.1001/archinternmed.2009.427.

de Gonzalez

A. B.

, et al., Projected cancer risks from computed tomographic scans performed in the United States in 2007, Archives of Internal Medicine 169(22) (2009), 2071–2077. doi: 10.1001/archinternmed.2009.440.

, et al., Low-dose computed tomography image restoration using previous normal-dose scan, Medical Physics 38(10) (2010), 5713–5731. doi: 10.1118/1.3638125.

, et al., Adaptive nonlocal means filtering based on local noise level for CT denoising, Medical Physics 41(1) (2014), 011908. doi: 10.1118/1.4851635.

Kelm

Z. S.

, et al., Optimizing non-local means for denoising low dose CT, in 2009 IEEE International Symposium on Biomedical Imaging: From Nano to Macro, pp. 662–665, 2009. doi: 10.1109/ISBI.2009.5193134.

Ourselin

, et al., Image denoising of low-radiation dose coronary CT angiography by an adaptive block-matching 3D algorithm, Proc SPIE, 86692G 2013. doi: 10.1117/12.2006907.

Feruglio

P. Fumene

, et al., Block matching 3D random noise filtering for absorption optical projection tomography, Physics in Medicine and Biology 55(18) (2010), 5401–5415. doi: 10.1088/0031-9155/55/18/009.

Sheng

, et al., Denoised and texture enhanced MVCT to improve soft tissue conspicuity Medical Physics 41(10) (2014), 101916. doi: 10.1118/1.4894714.

Aharon

, et al., K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation, IEEE Transactions on Signal Processing 54(11) (2006), 4311–4322. doi: 10.1109/tsp.2006.881199.

10.

Chen

, et al., Improving abdomen tumor low-dose CT images using a fast dictionary learning based processing, Physics in Medicine and Biology 58(16) (2013), 5803–5820. doi: 10.1088/0031-9155/58/16/5803.

11.

Chen

, et al., Low-dose CT via convolutional neural network, Biomedical Optics Express 8(2) (2017), 679–694. doi: 10.1364/boe.8.000679.

12.

Shan

, et al., 3-D Convolutional Encoder-Decoder Network for Low-Dose CT via Transfer Learning From a 2-D Trained Network, IEEE Transactions on Medical Imaging 37(6) (2018), 1522–1534. doi: 10.1109/tmi.2018.2832217.

13.

Wang

, et al., Image reconstruction is a new frontier of machine learning, IEEE Transactions on Medical Imaging 37(6) (2018), 1289–1296. doi: 10.1109/tmi.2018.2833635.

14.

Wang

, et al., Deep learning for tomographic image reconstruction, Nature Machine Intelligence 2(12) (2020), 737–748. doi: 10.1038/s42256-020-00273-z.

15.

Chung

, et al., MR image denoising and super-resolution using regularized reverse diffusion, IEEE Transactions on Medical Imaging 42(4) (2023), 922–934. doi: 10.1109/tmi.2022.3220681.

16.

Hou

, et al., CT image quality enhancement via a dual-channel neural network with jointing denoising and super-resolution, Neurocomputing 492 (2022), 343–352. doi: 10.1016/j.neucom.2022.04.040.

17.

Villar-Corrales

, et al., Deep learning architectural designs for super-resolution of noisy images, presented at the ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021.

18.

Zhang

, et al., Beyond a Gaussian denoiser: Residual learning of deep CNN for image denoising, IEEE Transactions on Image Processing 26(7) (2017), 3142–3155. doi: 10.1109/tip.2017.2662206.

19.

Zhang

, et al., Learning deep CNN denoiser prior for image restoration, Presented at the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.

20.

Zhang

, et al., Residual dense network for image super-resolution, Presented at the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018.

21.

Zhang

, et al., Residual dense network for image restoration, IEEE Transactions on Pattern Analysis and Machine Intelligence 43(7) (2021), 2480–2495. doi: 10.1109/tpami.2020.2968521.

22.

Wang

, A perspective on deep imaging, IEEE Access 4(2016) 8914–8924. doi: 10.1109/access.2016.2624938.

23.

Yan

, et al., Image denoising for low-dose CT via convolutional dictionary learning and neural network, IEEE Transactions on Computational Imaging 9 (2023), 83–93. doi: 10.1109/tci.2023.3241546.

24.

Chen

, et al., Low-dose CT with a residual encoder-decoder convolutional neural network, IEEE Transactions on Medical Imaging 36(12) (2017), 2524–2535. doi: 10.1109/tmi.2017.2715284.

25.

Kang

, et al., Wavelet domain residual network (WavresNet) for low-dose X-ray CT reconstruction, arXiv, 2017.

26.

Yang

, et al., Low-dose CT image denoising using a generative adversarial network with Wasserstein distance and perceptual loss, IEEE Transactions on Medical Imaging 37(6) (2018), 1348–1357. doi: 10.1109/tmi.2018.2827462.

27.

Simonyan

, Zisserman

, Very deep convolutional networks for large-scale image recognition, Presented in 3rd International Conference on Learning Representations, ICLR 2015.

28.

Kulathilake

K.A.S.H.

, et al., InNetGAN: Inception network-based generative adversarial network for denoising low-dose computed tomography, , Journal of Healthcare Engineering 2021 (2021), 9975762. doi: 10.1155/2021/9975762.

29.

Xiong

, et al., Artifact and detail attention generative adversarial networks for low-dose CT denoising, IEEE Transactions on Medical Imaging 40(12) (2021), 3901–3918. doi: 10.1109/tmi.2021.3101616.

30.

Fan

, et al., Quadratic autoencoder (Q-AE) for low-dose CT denoising, IEEE Transactions on Medical Imaging 39(6) (2020), 2035–2050. doi: 10.1109/tmi.2019.2963248.

31.

Arnab

, et al., ViViT: A Video Vision Transformer, in 18th IEEE/CVF International Conference on Computer Vision, ICCV 2021, pp. 6816–6826. doi: 10.1109/ICCV48922.2021.00676.

32.

Bertasius

, et al., Is space-time attention all you need for video understanding? arXiv, 2021.

33.

Wang

, et al., CTformer: convolution-free Token2Token dilated vision transformer for low-dose CT denoising, Physics in Medicine and Biology 68 (2023), 065021. doi: 10.1088/1361-6560/acc000.

34.

Geng

, et al., Content-noise complementary learning for medical image denoising, IEEE Trans Med Imaging 41(2) (2022), 407–419. doi: 10.1109/TMI.2021.3113365.

35.

Gou

, et al., Gradient regularized convolutional neural networks for low-dose CT image enhancement Physics in Medicine and Biology 64(16) (2019), 165017. doi: 10.1088/1361-6560/ab325e.

36.

, Babyn

, Sharpness-aware low-dose CT denoising using conditional generative adversarial network, Journal of Digital Imaging 31(5) (2018), 655–669. doi: 10.1007/s10278-018-0056-0.

37.

Gholizadeh-Ansari

, et al., Deep learning for low-dose CT denoising using perceptual loss and edge detection layer, Journal of Digital Imaging 33(2) (2020), 504–515. doi: 10.1007/s10278-019-00274-4.

38.

Liang

, et al., EDCNN: Edge enhancement-based densely connected network with compound loss for low-dose CT denoising, Presented in 15th IEEE International Conference on Signal Processing, ICSP 2020, pp. 193–198, (2020). doi: 10.1109/ICSP48669.2020.9320928.

39.

, et al., Squeeze-and-excitation networks, IEEE Transactions on Pattern Analysis and Machine Intelligence 42(8) (2020), 2011–2023. doi: 10.1109/tpami.2019.2913372.

40.

Woo

, et al., CBAM: Convolutional block attention module, Presented in 15th European Conference on Computer Vision, ECCV2018 pp. 3–19 2018, doi: 10.1007/978-3-030-01234-2_1.

41.

Zhang

, et al., Image super-resolution using very deep residual channel attention networks, Presented in 15th European Conference on Computer Vision, ECCV 2018, pp. 294–310, 2018, doi: 10.1007/978-3-030-01234-2_18.

42.

, et al., Deep residual learning for image recognition, Presented in 29th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, pp. 770–778, 2016, doi: 10.1109/CVPR.2016.90.

43.

Zhang

, et al., A two-stage attentive network for single image super-resolution, IEEE Transactions on Circuits and Systems for Video Technology 32(3) (2022), 1020–1033. doi: 10.1109/tcsvt.2021.3071191.

44.

Zhang

, et al., Efficient long-range attention network for image super-resolution, Presented in 17th European Conference on Computer Vision, ECCV 2022, pp. 649–667, 2022. doi: 10.1007/978-3-031-19790-1_39.

45.

Zhao

, et al., Loss functions for image restoration with neural networks, IEEE Transactions on Computational Imaging 3(1) (2017), 47–57. doi: 10.1109/tci.2016.2644865.

46.

Johnson

, et al., Perceptual losses for real-time style transfer and super-resolution, Presented in 14th European Conference on Computer Vision, ECCV 2016, pp. 694–711, 2016, doi: 10.1007/978-3-319-46475-6_43.

47.

, et al., SACNN: Self-attention convolutional neural network for low-dose CT denoising with self-supervised perceptual loss network, IEEE Transactions on Medical Imaging 39(7) (2020), 2289–2301. doi: 10.1109/tmi.2020.2968472.

48.

McCollough

, Overview of the low dose CT grand challenge, Medical Physics 43(6) (2016), 3759–3760. doi: 10.1118/1.4957556.