Abstract
BACKGROUND:
Low-dose computed tomography (LDCT) is an effective method for reducing radiation exposure. However, reducing radiation dose leads to considerable noise in the reconstructed image that can affect doctor’s judgment.
OBJECTIVE:
To solve this problem, this study proposes a local total variation and improved wavelet residual convolutional neural network (LTV-WRCNN) denoising model.
METHODS:
The model first introduces local total variation (LTV) to decompose the LDCT image into cartoon and texture image. Next, the texture image is filtered using the non-local mean (NLM). Then, the cartoon image is added to the filtered texture image to obtain the preprocessing image. Finally, the pre-processed image is fed into the improved wavelet residual neural network (WRCNN) to obtain an improved image. Additionally, we also introduce a compound loss in wavelet domain that combines mean squared error loss and directional regularization loss to separate the structural details from noise more thoroughly.
RESULTS:
Compared with state-of-the-art methods, the peak-signal-to-noise ratio (PSNR) value and the structure similarity (SSIM) value of the processed CT images using the new proposed model are 33.4229 dB and 0.9158. Study also shows that applying new model obtains better results visually and numerically, especially in terms of the preservation of structural details.
CONCLUSIONS:
The proposed new model is feasible and effective in improving the quality of LDCT images.
Keywords
Introduction
Computed tomography (CT) is a commonly used medical imaging method [1]. It uses X-rays to scan the surroundings of patients, producing cross-sectional images of the human body to detect pathological abnormalities such as tumors, pulmonary nodules, and vascular diseases [2]. However, ionizing radiation from X-ray CT may induce cancer and other genetic diseases in patients [3]. Therefore, patients are suggested to be scanned with lower radiation dose. However, there are many streak artifacts in the low-dose computer tomography (LDCT) images, thus lead to the degradation of image quality. Therefore, special image reconstruction and processing algorithms are required to address the noise and artifacts in the LDCT images.
Currently, there are three main categories for removal of noise and artifacts from LDCT images [4, 5]: (1) sinogram domain filtration, (2) iterative reconstruction, and (3) image domain post-processing. Sinogram filtering directly processes raw data before reconstruction (i.e., filtered back projection [FBP]). Iterative reconstruction usually integrate prior knowledge into the objective function as a penalty term, in order to smooth the noise in the image. Both algorithms have the problems of projection data acquisition and algorithm portability. Moreover, the iterative reconstruction algorithm is computationally intensive and requires a large storage space. In contrast, the image domain post-processing algorithm does not depend on the projection data, which effectively avoids the shortcomings of the above methods. Therefore, this kind of algorithm has become a popular research topic in the field of LDCT image noise reduction, such as the block-matching and three-dimensional filtering (BM3D) [6], non-local mean (NLM) [7], K-singular value decomposition (K-SVD) [8], and an improved weighted nuclear norm minimization (WNNM) [9]. These methods have achieved good results, but there are still problems such as missing texture details or introducing new artifacts.
In recent years, researchers have proposed many Convolutional Neural Network (CNN)-based and Generative Adversarial Network (GAN)-based methods for LDCT image denoising because of their powerful feature learning and mapping capabilities [10–15]. However, compared to CNN models, the training process of GAN-based deep learning models is unstable and has the problems of poor generalization and inadequate feature extraction [16]. To be specific, Chen et al. [17] utilized a deep CNN to map LDCT images towards its corresponding normal-dose counterparts in a patch-by-patch fashion, which has great potential in reducing artifacts and protecting structures. Subsequently, Chen et al. proposed the residual encoder-decoder convolutional neural network (RED-CNN) [18] to improve LDCT images. Maryam et al. [19] proposed a deep neural network for LDCT denoising that used dilated convolution instead of standard convolution to help capture more context information in fewer layers. Although these methods can effectively remove noise, they will lead to blurred edges. Kang et al. [20] performed wavelet transform on noisy images in the wavelet domain and proposed the cascaded convolutional framework [21], which significantly improved the network performance and preserved detailed textures. To solve a series of problems with an increase in the number of network layers, Kang et al. proposed a wavelet domain residual network (WavResNet) [22] structure to recover LDCT images. Chen et al. [23] combine wavelet transform and residual network to process noise images, which can achieve better denoising effect. Jifara et al. [24] utilized an efficient deep CNN model by combining residual learning and batch normalization for denoising images. To retain more details, Liang et al. [25] proposed an edge enhancement-based densely connected convolutional neural network (EDCNN) that designates an edge enhancement module using a Sobel convolution and constructs a dense connections model to fuse the extracted edge information for LDCT image denoising.
In recent decades, noise reduction methods based on image cartoon-texture decomposition have been widely studied. For LDCT images, normal tissue structures often exist in cartoon parts, while information including image edges, noise, artifacts, etc. is distributed in texture parts. Therefore, Fu et al. [26] proposed a medical image denoising algorithm based on total variation cartoon-texture decomposition. Wang et al. [27] proposed a medical image filtering algorithm based on local total variation (LTV) cartoon-texture decomposition. These image decomposition-based noise reduction methods can maintain the structural integrity of the cartoon part and the fine structure of the texture part while reducing noise. Since noise and artifacts in low-dose CT are related with projections, image decomposition based denoising methods are difficult to remove noise and artifacts by only processing decomposed texture image.
To solve the above problems, we also remove noise and artifacts from the decomposed cartoon images. Based on the study of image decomposition and wavelet residual CNN [23], we propose a denoising model combining LTV and improved wavelet residual convolutional neural network (LTV-WRCNN). Specifically, LTV model first is used to decompose the LDCT image into cartoon and texture image. Secondly, NLM is used to process the texture image. Then, the pre-processed LDCT image is obtained by adding the cartoon image and the filtered texture image. Finally, the pre-processed image and NDCT image are input into improved wavelet residual convolutional neural network (WRCNN) to obtain the improved predictive image. The main contributions of this study are as follows: The LTV model is used to image decomposition and NLM is used to process the texture image. An LTV-WRCNN framework is proposed, which combines the LTV model image decomposition, CNN learning ability, and wavelet transform feature extraction ability. The directional regularization loss function is defined, which can effectively describe the smoothing of streak noise in LDCT images and further preserve the texture information.
Related works
Image decomposition
The main concept of image decomposition refers to the decomposition of images into cartoon and texture components through mathematical tools such as spatial characterization and sparse representation [28]. It is described as follows: the known image f (x, y) seeks two components u (x, y) and v (x, y), satisfying f (x, y) = u (x, y) + v (x, y), where u (x, y)represents the cartoon component, and v (x, y) represents the texture component. Usually, the general form of the image decomposition model can be described as:
Among them, the LTV image decomposition method has received more and more attention. The method is to use a pair of nonlinear low-pass and high-pass filters to quickly solve the approximate solution of the original variational problem. That is, the LTV around each image pixel is calculated as the local indicator function of the pixel and compared with the LTV after the low-pass filter. It can be determined that the pixel belongs to the cartoon or the texture component [33]. According to the theory of variation and scale space, the cartoon portion has a smaller TV, whereas the texture part has a smaller norm owing to the higher oscillation frequency. In the cartoon region, the LTV of a pixel point changes slightly after convolution with the low-pass filter and is classified as a cartoon point. In the texture region, the LTV of a pixel point decays rapidly after the low-pass filter owing to its oscillation property and is classified as a texture point. The texture component v can be obtained by subtracting the cartoon component u from the original image f. In this study, the LTV model was used for image decomposition.
Discrete wavelet transformation (DWT) is a discretization of the scales and translations of the fundamental wavelet [34]. An image can be decomposed by a two-dimensional discrete wavelet to obtain a low-frequency component with high-frequency components in three different directions: horizontal, vertical, and diagonal.
Residual learning was originally proposed to address the problem of performance degradation as the number of layers increases [35]. The residual network learns the residual mapping, which effectively alleviates the gradient disappearance problem and significantly improves the depth of the training network. Figure 1 shows the residual learning structure. Assuming that the input is X and the output is Y, the original identity mapping can be expressed as H (X) = Y and the residual mapping as F (X) = Y - X, where F (X) can be obtained from the residual unit. The original identity mapping can then be expressed as H (X) = Y = F (X) + X. For the image denoising algorithm based on wavelet transform and CNN proposed in [23], the combined advantages of the wavelet transform and the residual learning are used for denoising. In the training process, the decomposed image is input into the residual network, and back propagation by the mean squared error (MSE) loss function, which is ultimately the network converges. Residual learning can transfer image detail information during the noise reduction process, which is helpful for image detail preservation and image recovery, and greatly improves the efficiency of the algorithm [36].

Illustration of the residual learning structure.
In this section, the LTV-WRCNN framework is introduced. Figure 2 shows the denoising flowchart. First, the LDCT image is decomposed using the LTV model to obtain cartoon and texture images, and the texture image is denoised by NLM. Then, the pre-processed LDCT image is obtained by combining denoised texture image with cartoon image. Subsequently, the pre-processed image and the NDCT image are subjected to DWT to obtain four sub-band coefficients, which are concatenated into a single input tensor as the input of the proposed WRCNN. Subsequently, four sub-band coefficients of artifact suppression are obtained using the wavelet coefficient prediction network. Finally, wavelet noise reduction results are reconstructed using the IDWT in the spatial domain. The LTV model and the proposed WRCNN are respectively described in Section 3.1 and 3.2.

Illustration of the LTV-WRCNN framework.
Since LTV can effectively extract the texture component of image and keep edges clear. It is suitable for cartoon texture separation of noisy images. The specific process of cartoon-texture decomposition based on the LTV is as follows: for any pixel x in image f, its LTV is defined as:
Thereafter, x → λ
σ
(x) is defined as the relative rate of change of the LTV at point x in the image.
Equation (4) shows that when λ
σ
→ 0, the reduction of LTV after low-pass filtering is small and the low-pass filter has little effect on the LTV of the point, then the point of the part belongs to the cartoon component u; conversely, when λ
σ
→ 1, the relative rate of change is large and the reduction is fast, at which time the point can be judged to be the texture component v. Therefore, a pair of fast nonlinear high-pass and low-pass filters are obtained from the weighted average of L
σ
* f and f according to the relative rate of change of the LTV of the points in the image, denoted as
It should be noted that both important detailed information and artifacts are contained in the texture part. Therefore, the texture part needs to be denoised again to remove noise and artifacts, and subsequently added to the decomposed cartoon component u to obtain the final pre-processed image. Since the NLM method considers the self-similar nature of image, and makes full use of the redundant information, it is utilized to process the texture image, thus can achieve preliminary denoising.
The WRCNN model is mainly composed of a DWT, convolutional layer, residual block, and IDWT. Except for the last convolutional layer, each convolutional layer is with Relu as the activation function, which can be observed exactly in Fig. 2. In WRCNN, ‘haar’ wavelet is used to decompose the pre-processed image. Down-sample operations in DWT can effectively expand the receptive field and help restore image details, and wavelet decomposition significantly reduces the network complexity. Especially the wavelet prediction network consists of 11 convolutional layers, which can be grouped into two parts. The first part, marked by red dashed box, contains 3 convolutional layers, with kernel sizes of 9×9, 1×1, 5×5, respectively, and the corresponding channels are 64, 32, and 1, respectively. Convolution kernels of different sizes can effectively improve the resolution of the image. Thus, more structural information can be obtained, and the noise in different directions can be removed to improve the performance of the network. The second part, related to the green dashed box, contains 8 convolutional layers, in which the first layer is “Conv+Relu”, followed by 3 residual blocks, and then the last layer with only “Conv”. 3 residual blocks have the same structure, each of them consists of 2 identical convolutional layers. All convolution kernels in the second part are of 3×3, and the channel number is 64, except for the last layer (one channel). Residual connection in the part can effectively transfer information from bottom to top and enhance texture details.
In the training phase, the network uses MSE loss function in the wavelet domain:
Dataset and parameter settings
In order to demonstrate the denoising ability of the LTV-WRCNN framework on LDCT images, we used clinical CT images from the “2016 NIH-AAPM-Mayo Clinic Low Dose CT Grand Challenge” as the dataset [37]. The dataset contained abdomen NDCT images and their corresponding simulated low-dose (quarter-dose) abdomen images with size of 512×512, and thickness of 3.0 mm. To be specific, we used the unprocessed DICOM images for LTV and subsequently processed texture image using NLM, and add it to the cartoon image, then we normalized the added image and fed it to the WRCNN model. In the experiment, 762 pairs of images were used as the training set, and 35 pairs of images were used as the test set. In order to expand the dataset, we crop the training images with stride 64 to generate 46,768 patches of 64×64 images.
In the training phase, the LTV-WRCNN framework was trained using the Keras, which supports GPU computing. We set a series of parameters and manually adjusted them to make the processed CT images achieve optimal values in terms of visual effect and objective indicators. The model was optimized utilizing adaptive momentum estimation (Adam) with parameters β1 of 0.9 and β2 of 0.999. The batch size was set to 32. The learning rate was initially set to 0.001 and then decreased by a factor of 0.5 every 10 epochs. The maximum number of training epochs was set as 100.
Specifically, the algorithms were run on the following environments: Windows 10 operating system, Intel(R) Core (TM) i9-10900X CPU @ 3.70 GHz; 32 GB RAM, and Python 3.8, 64-bit.
Experimental results
In order to evaluate our method, two typical LDCT images (Figs. 3(a) and 4(a)) from the testing set were selected for demonstration. LTV-WCNN was compared with NLM, local total non-local means (LTV-NLM), RED-CNN, WavResNet, and EDCNN, respectively. Subjective and objective evaluation metrics were used to verify the experimental results.

Results of Fig. 3(a) using different models. In all images, the blue arrows indicate the flat area and the orange arrows point to the edge/detail structure.
The subjective aspects primarily evaluate the noise reduction effect by assessing noise attenuation, edge retention, and area smoothness. Figures 3 and 4 show the results of the two test images processed by different methods.
Figures 3(a) and 4(a) are the LDCT images, Figs. 3(b) and 4(b) are the corresponding NDCT images, respectively. We can observe that results of NLM in Figs. 3(c) and 4(c) are over-smoothing. Results in Figs. 3(d) and 4(d) show that the LTV-NLM method still exist considerable noise and artifacts. RED-CNN can effectively suppress noise and artifacts, but MSE is used as the loss function, which is easy to lose details and cause fuzzy results (see the results in Figs. 3(e) and 4(e). From the images (f)-(g) in Figs. 3 and 4, we can see that the EDCNN leads to results with less streak artifacts than the WavResNet. In contrast, we can notice in Figs. 3(h) and 4(h) that the proposed LTV-WRCNN method had the best overall visual effect and can retain the edge information when removing the noise.
For a better observation, three regions of interest (ROIs) are marked by red squares in Figs. 3(a) and 4(a). Among them, the right-up corner image in Fig. 3 indicates the tissues containing edges and details, and the right-down corner image shows the tissue with flat area. The right-up corner image in Fig. 4 marked the lesion. Moreover, in Figs. 3 and 4, the blue arrows indicate the flat area, and the orange arrows indicate the edge/detail structure. From the images in these two figures, we can observe that the ROIs suffer a certain loss of details due to over-smoothing in Figs. 3 and 4(c), see the arrows in the right-down corner image in Fig. 3(c) and the lesion marked by circle in Fig. 4(c). The ROIs images in Figs. 3 and 4(d) show that the LTV-NLM method cannot effectively remove noise and artifacts, marked by the arrow in Figs. 3 and 4(d). Although showing good noise suppression, the RED-CNN and WavResNet results cause blurring effect and feature loss, as depicted by Figs. 3 and 4((e)–(f)). In contrast, the edge details in Figs. 3 and 4(g) are clearer. But, compared to Figs. 3 and 4(h), there are less noise and artifacts in Figs. 3 and 4(h).

Results of Fig. 4(a) using different models. In all images, the blue arrows indicate the flat area and the orange arrows point to the edge/detail structure.
To illustrate the advantages of the proposed method, Figs. 5 and 6 display the absolute images (with NDCT images) of the results obtained by the different methods. We can see that the difference images of LTV-WRCNN contain less noise artifact information. It illustrates that the proposed method performs better in suppressing artifacts and noise and obtains images closer to NDCT images.

Absolute difference image relative to the corresponding NDCT image in Fig. 3.

Absolute difference image relative to the corresponding NDCT image in Fig. 4.
Six objective indexes are used to evaluate the proposed method, including peak-signal-to-noise ratio (PSNR), structure similarity (SSIM), feature similarity (FSIM), gradient magnitude similarity deviation (GMSD), visual information fidelity (VIF), and average gradient (Avegrad). To be specific, the higher the PSNR value, the less noise in the image. SSIM and FISM are measures of the similarity of two images. SSIM measures image similarity in terms of luminance, contrast and structure, while FSIM extracts feature points by combining phase consistency and gradient magnitude, and it also uses feature similarity for quality evaluation. A larger value indicating a higher degree of similarity. GMSD indicates the pixel-wise similarity of the gradient magnitude maps between the reference and processed images. In general, a good processed result has a low GMSD value. VIF quantifies the information that could ideally be extracted by brain, and its score indicates the amount of shared information between the reference and processed image [38]. The average gradient is used to measure the sharpness of the image, and reflects the small detail contrast and texture transformation features in the image. Therefore, the larger the average gradient, the sharper the image and the better the texture preservation.
Table 1 shows the average values of PSNR, SSIM, FSIM, GMSD, VIF, and Avegrad for all images in the test set after denoising using different methods, where bold indicates the best result. From Table 1, it can be observed that the LTV-WRCNN framework has the highest PSNR and SSIM values and the lowest GMSD value, indicating that the result image has less noise and more like NDCT image. The FSIM values of different methods are close, among which the FISM value of WavResNet is 0.0001 higher than that of LTV-WCNN, indicating that all methods have roughly the same ability to extract feature information. The VIF value of LTV-WRCNN is the largest, which indicates that the obtained image is consistent with the NDCT image. The average gradient value of our proposed model is the largest, which means that the image processed by LTV-WRCNN model contains less noise and retains more details. In conclusion, the model proposed in this study yielded good results in terms of artifact suppression and structure protection.
Quantitative results of the different
Quantitative results of the different
In order to determine the optimal structure for the proposed LTV-WRCNN model, the network structure, directional regularizer function will be meticulously discussed in this section.
(1) Structure and Module
To explore the effect of each component of the LTV-WRCNN model, we conducted a decomposition experiment on its structure. First, we designed a basic model (WRCNN), removing the LTV-NLM module from the structure shown in Fig. 1. And then we designed a model LTV-RCNN, removing the DWT module. Finally, we removed the residual connections from the proposed model, which named LTV-WCNN. All models were tested using the same training strategy to fully demonstrate the potential capabilities of the models. The results of the ablation experiments are shown in Fig. 7. Table 2 show the PSNR, SSIM, FSIM, GMSD, VIF, and Avegrad of the test images.

Denoising results of different network structure.
Quantitative evaluation index of different network structure experiment
Combined with Table 2 and Fig. 7, it can be found that the WRCNN model still has noise and artifacts, and thus cannot retain structural information well. Because DWT is removed from LTV-RCNN model, contour information of LDCT image is easily lost. LTV-WCNN model is difficult to retain details due to the removal of residual connections. Compared with other models, LTV-WRCNN model has better noise reduction effect.
(2) Directional Regularizer Loss (DRL)
In order to verify the effectiveness of the proposed directional regularizer, we respectively train two versions of LTV-WRCNN models by using the loss function with and without directional regularizer. The experimental results are shown in Fig. 8. Table 3 show the PSNR, SSIM, FSIM, GMSD, VIF, and Avegrad of the test images.

Denoising results of the proposed LTV-WRCNN model with and without directional regularizer.
Quantitative evaluation index of different loss function experiment
It shows that the LTV-WRCNN model trained with loss function involving directional regularizer shows a significant higher PSNR, SSIM, FSIM, VIF and Avegrad. Moreover, this model has a smaller GMSD. As can be seen from Fig. 8(a) and (b), both loss functions can efficiently remove the stripe noise. By observing more carefully, we can find that the MSE function used in Fig. 8(a) makes the processed image too smooth. In Fig. 8(b), the image trained by the compound loss function can effectively separate the structural texture information of CT image from streak noise. This experiment further proves the effectiveness of the proposed directional regularizer for detail preserving.
In this study, we proposed the LTV-WRCNN framework for LDCT images to suppress noise and streak artifacts. The innovation of the algorithm lies in combining the LTV model with the CNN method, adding DWT and the residual connection. The LTV model can effectively decompose LDCT images to obtain the texture component, which contains high frequency information such as detail and noise/artifacts, as well as the cartoon component containing structural information. In WRCNN, we use CNN’s learning ability and wavelet transform’s feature extraction ability to reduce noise. In addition, we also defined the direction-regularized loss function to effectively remove noise/artifacts and further preserve the texture information. The extensive experiments on Mayo data demonstrate that the LTV-WRCNN achieves good results both in subjective visual and objective evaluation. Therefore, the proposed algorithm is expected to be of clinical importance for the further promotion and development of LDCT imaging technology.
Some issues remain to be addressed. Except for LTV model, there are other outstanding filters that can effectively decompose LDCT image, such as bilateral filtering, therefore we will explore more filters to achieve the decomposition of LDCT image and discuss their efficiencies. Furthermore, considering that our method requires the participation of paired CT images, which are not easy to obtain in practice, in the future, we will also try to explore denoising architectures from supervised to unsupervised.
Footnotes
Acknowledgments
This work was supported by the National Nature Science Foundation of China (61801438), the Science and Technology Innovation Project of Colleges and Universities of Shanxi Province (2020L0282), and the Natural Science Foundation of Shanxi Province (201901D111153).
