Abstract
BACKGROUND:
Coronary artery segmentation is a prerequisite in computer-aided diagnosis of Coronary Artery Disease (CAD). However, segmentation of coronary arteries in Coronary Computed Tomography Angiography (CCTA) images faces several challenges. The current segmentation approaches are unable to effectively address these challenges and existing problems such as the need for manual interaction or low segmentation accuracy.
OBJECTIVE:
A Multi-scale Feature Learning and Rectification (MFLR) network is proposed to tackle the challenges and achieve automatic and accurate segmentation of coronary arteries.
METHODS:
The MFLR network introduces a multi-scale feature extraction module in the encoder to effectively capture contextual information under different receptive fields. In the decoder, a feature correction and fusion module is proposed, which employs high-level features containing multi-scale information to correct and guide low-level features, achieving fusion between the two-level features to further improve segmentation performance.
RESULTS:
The MFLR network achieved the best performance on the dice similarity coefficient, Jaccard index, Recall, F1-score, and 95% Hausdorff distance, for both in-house and public datasets.
CONCLUSION:
Experimental results demonstrate the superiority and good generalization ability of the MFLR approach. This study contributes to the accurate diagnosis and treatment of CAD, and it also informs other segmentation applications in medicine.
Introduction
The function of coronary arteries is to provide oxygen and nutrients to the heart, to maintain its powerful continuous beating. Currently, Coronary Artery Disease (CAD) is a predominant contributor to global mortality [1, 2]. Coronary Computed Tomography Angiography (CCTA) has an extensive application in the diagnosis of CAD due to its ability to produce high-resolution images, its non-invasive nature, and its cost-effectiveness [3, 4]. Based on the segmented coronary arteries in CCTA images, the stenosis degree of the vessels can be quantified, the fractional flow reserve value [5] can be calculated, and the anatomical names of coronary arteries can be identified [6], which assist radiologists in judging the severity of CAD and providing reasonable treatment plans. Accurate coronary artery segmentation is crucial for diagnosing and treating CAD. Manual segmentation by radiologists is time-consuming and laborious, and research on automatic segmentation approaches is urgently needed.
Segmentation of coronary arteries in CCTA faces several challenges. The first one is a severe inter-class imbalance problem, as shown in Fig. 1 (a) and (b). The coronary arteries (the red labels in Fig. 1 (a) and (b)) occupy an extremely small proportion (usually less than 0.1%) of the whole CCTA image, thus the segmented coronary arteries are prone to disruption. The second one is the intra-class imbalance problem, as shown in Fig. 1 (a). The diameter of the proximal branches near the ascending aorta is larger than that of other branches. This makes it easier for these small vessel branches to be omitted by the segmentation approach. Third, the coronary artery branches scatter in the CCTA images (as shown in Fig. 1 (b)), and their shape and structure vary significantly from person to person, as shown in Fig. 1 (c) and (d). This makes it difficult for the algorithm to identify coronary arteries accurately.

Challenges in coronary artery segmentation. (a) and (b) present the coronary arteries from a cross-section perspective, and (c) and (d) show the 3D coronary arteries. The red labels in the sub-figures indicate coronary arteries. In Fig. 1 (a), the yellow arrow and green arrow show the proximal and distal branches, respectively, and it can be seen that the proximal branch has a larger diameter than the other branches (e.g. the distal branch).
Some researchers adopted the traditional segmentation approaches such as the threshold approach [7], the tracking method [8, 9], and the level-set algorithm [10, 11], to depict the coronary artery contour. These traditional approaches do not require high computer resources. However, these algorithms face problems such as low segmentation accuracy, requiring manual intervention, and requiring the construction of a complex model. Shams et al. [7] first applied Hessian-based filtering to enhance the arteries, and then the Ostu threshold approach [12] was exploited to delineate the vessels. Although the threshold approach is easy to execute, due to the presence of many tissues in CCTA images, it is difficult to achieve high segmentation accuracy by only using the threshold approach. Chen et al. [8] combined the geometric moment approach and snake algorithm to track arteries. Zhou et al. [9] exploited the region-growing approach to track vessels. However, for these tracking algorithms, the seed points for the coronary arteries need to be manually placed. Khokhar et al. [10] and Ge et al. [11] integrated the curvature feature constraint and area constraint to the level-set approach to segment coronary arteries, respectively. These algorithms usually require the building of a complex model and constraints to obtain promising performance. Gao et al. [13] first combined the Hough transform and the level-set approach to segment the aorta, then the region-growing algorithm was applied to obtain the connection between the coronary artery and aorta, and finally the arteries were segmented by the projection and dynamic programming approaches. Although segmented arteries by Gao’s approach achieve good agreement with the ground truth, the execution process of the algorithm is complex.
With the improvement of computer hardware, researchers have increasingly adopted deep learning to delineate coronary arteries [14]. Moeskops et al. [15] showed that a single convolutional neural network is suitable for depicting the contour of arteries. Kong et al. [16] combined the Fully Convolutional Network (FCN) [17] with the recurrent neural network to realize the segmentation of arteries. Shen et al. [18] integrated an attention gate into the FCN to attenuate irrelevant regions, and the segmentation results were further refined using the level-set algorithm. Tian et al. [19] first adopted the V-shaped network to segment coronary arteries, and then the region-growing approach was exploited to smooth the contour of the arteries. However, Shen’s and Tian’s approach treated the level-set and region-growing algorithm as an additional post-processing step, respectively, thereby increasing the complexity of the algorithm execution. The expression of multi-scale features can help networks capture objects of different sizes in images and better learn the contextual information contained in images [20, 21], thereby improving the performance of tasks such as detection [22, 23] and small target segmentation [24]. Zhu et al. [25] presented a Feature Fusion Network (FFNet) to segment coronary arteries. The FFNet introduces dilated convolution [26] in the bottom of Unet [27] to utilize the multi-scale information, and deep supervision strategy in the output, to further promote the performance. Dual Attention Unet (DAUnet) [28] exploited the dual-attention mechanism which fuses features between adjacent levels in the skip connection, to enhance the identification capability of vessels. Dong et al. [29] proposed a Coronary Artery Segmentation Network (CAS-Net). CAS-Net extracted the multi-scale information in one of the layers in the encoder. The exploitation of multi-scale information in these existing networks is insufficient, and there is still room for improvement in their segmentation accuracy. This has prompted us to explore the performance of multi-scale feature learning in coronary artery segmentation.
The contributions to this paper are as follows: (1) To address the challenges of coronary artery segmentation in CCTA images, this present study proposes a novel Multi-scale Feature Learning and Rectification (MFLR) network to achieve automatic segmentation of coronary arteries; (2) MFLR network introduces multi-scale modules in both the encoder and decoder, to fully collect the contextual information on the image and better capture the scale and shape changes of coronary arteries; (3) Experimental results show that the MFLR network outperforms other comparison approaches, and it proves the usefulness and necessity of fully exploiting multi-scale features in coronary artery segmentation.
Dilated convolution can expand the field of view by setting a large dilation rate (DR), and expressive features can be collected by using multiple dilated convolutions with different DRs [30, 31]. The shortcut connection in Res-Net [32] is beneficial for the propagation of gradients in the network and can improve the reusability of features [33, 34]. Taking guidance from the dilated convolution and Res-Net, the MFLR network was presented to solve the challenges faced in coronary artery segmentation. The structure of the MFLR network is shown in Fig. 2. The MFLR network mainly consists of two parts, i.e., the left encoder and the right decoder. In the encoder, we designed a Multi-scale Feature Extraction (MFE) module that employs dilated convolutions to produce rich feature representations from multiple receptive fields. In the decoder, a Feature Rectification and Fusion (FRF) module is proposed, which exploits high-level features with richer semantic information to guide and correct low-level features [35], and achieves the aggregation between the two-level features.

Diagram of multi-scale feature learning and rectification network.
To strengthen the learning and extraction capabilities of multi-scale features, the MFE module is proposed in the encoder, and its structure is shown in Fig. 3. As shown in Fig. 3, for the input feature map, the MFE module first performs 3 × 3 ×3 convolution with a DR of 1, Instance Normalization (IN), and Rectified Linear Unit (ReLU) activation operations, and passes the input feature map by the shortcut connection and sums it up with the convolved feature map, to promote gradient propagation and the reusability of features. Then, it performs the 1 × 1 ×1 convolution, IN, and ReLU operations to achieve the fusion between the two feature maps. This process can be formulated as:

Multi-scale feature extraction module.
Next, repeat the process similar to Equation (1), but using the dilated convolution with different DR, i.e., for the feature map after the aggregation, the 3 × 3 ×3 convolution with a DR of {2, 3, 4}, IN, ReLU activation, summation, and 1 × 1 ×1 convolution are performed:
Coronary arteries are scattered in CCTA images and the segmentation of the arteries suffers from severe inter-class and intra-class imbalance problems. Therefore, for coronary artery segmentation, it is necessary to learn richer contextual information to obtain superior segmentation performance [37, 38]. To further exploit multi-scale information, we proposed the FRF module, whose structure is shown in Fig. 4.

Feature rectification and fusion module.
Let f
high
and f
low
represent the input high-level and low-level feature maps, respectively. Generally speaking, the features in high-level feature maps are more abstract and have smaller spatial resolution, but have a larger receptive field. In contrast to high-level feature maps, low-level feature maps retain more detailed information in the image, but the receptive field is relatively small. For high-level feature maps, the proposed FRF module adopts Global Average Pooling (GAP) [35], 3 × 3 ×3 convolution, and 1 × 1 ×1 convolution to obtain f
multi
, which can capture richer scale information [39].
After performing GAP, 3 × 3 ×3 convolution, and 1 × 1 ×1 convolution operations, f
multi
captured the information with different scales. Then, the 1 × 1 ×1 convolution fused information from different scales, and the attention map f
att
was obtained through the Sigmoid activation function.
Based on the characteristic that the coronary artery in CCTA images belongs to a small target, the combination of Dice Loss [40] and Cross-Entropy (CE) Loss that was suitable for the small target segmentation [41] was adopted as the loss function, i.e.,
In Equation (9), L
Dice
and L
CE
represent the Dice Loss and the CE loss, respectively. For the Dice Loss, there is:
In Equation (10), y
i
denotes the label value at the position i,
In Equation (11), y
i
denotes the label at the position i,
Dataset
There are two datasets in this study. The first one is the in-house dataset. 81 CCTA scans from the General Hospital of North Theater Command, Shenyang, China were collected. Their average age, highest age, and lowest age are 56, 80, and 28 years, respectively. Among 81 subjects, 41 are males and 40 are females. The research conducted in this study received approval from the Biology and Medical Ethics Committee of Northeastern University under the ethical review approval number NEU-EC-2023B018 S. The size of each scan in the X and Y directions is 512 × 512, and it consists of 300∼400 slice images in the Z direction. The distance between two adjacent slice images is 0.45 mm. The whole data set was partitioned randomly, with 57 cases for training and 24 cases for testing. The second one is the public dataset, i.e., ASOCA [42], which has 40 CCTA scans. The 40 cases were randomly divided, with 28 cases used for training and 12 cases used for testing.
To reduce the interference, improve the contrast of coronary arteries in the images, and enhance the generalization ability of the model, pre-processing was executed. Pre-processing includes two steps. First, the CT value of the voxels in the collected 3D format data was truncated to [–230, 760] HU. Then, the volume images were normalized with the Z-score normalization, i.e.,

Comparison of images before (a) and after (b) pre-processing. The yellow arrow indicates the left anterior descending branch, and the orange box represents the background area with no coronary arteries.
The experiments described here are implemented using Python 3.6 and PyTorch 1.10.1 + cu111 on the NVIDIA GeForce RTX 3090 GPU. The Batch Size (BS) and Patch Size (PS) are set to 2 and [224, 224, 128], respectively. The training of the network employs the stochastic gradient descent optimizer. The initial Learning Rate (LR
init
) is set as 0.1, and it is updated by
Comparative experimental results
To verify the segmentation performance of the proposed MFLR network, this study compared it with mainstream networks in the field of medical image segmentation, i.e., the 3D U-Net [43], VoxResNet [44], ResUnet [45], CAS-Net [29], CS2-Net [46], FFNet [25], and DAUnet [28]. The evaluation indexes are the Dice Similarity Coefficient (DSC), Jaccard Index (JI), Recall, Precision, F1-score, and 95% Hausdorff Distance (HD95). Based on the in-house dataset, the quantitative comparison results between these methods are shown in Table 1. According to Table 1, among the seven comparison algorithms, CS2-Net achieved higher segmentation performance overall. The MFLR method, due to its more comprehensive utilization of the multi-scale information in the images, improved DSC by 0.92%, JI by 1.38%, Recall by 0.27%, Precision by 1.48%, F1-score by 0.0085, and HD95 decreased by 12.1997 compared with the CS2-Net.
Comparison of the MFLR network and various algorithms on the in-house dataset. ↑ indicates the higher the better, and ↓ represents the lower the better
Comparison of the MFLR network and various algorithms on the in-house dataset. ↑ indicates the higher the better, and ↓ represents the lower the better
Figure 6 shows the segmentation results of various algorithms from a cross-section perspective. In Fig. 6, the brown-yellow region denotes the overlap region between the predicted results of the network and the ground truth, and the proposed algorithm obtained more brown-yellow regions overall, which means it achieved better segmentation performance than other methods. Because of the exploitation of multi-scale modules, the MFLR network can segment vessels with both larger and smaller diameters simultaneously. Figure 7 shows the 3D prediction results of various algorithms. As shown from the blue circle in Fig. 7, the MFLR network can better identify small branches and segment more complete coronary arteries compared with other algorithms.

On the in-house dataset, segmentation results in cross-section. The coronary artery slice images in the basal and apical directions are displayed in the first and third columns, respectively, while the coronary artery slice images between the basal and apical directions are shown in the second column. Green (
) represents the ground truth, red (
) indicates the predicted results, and brown-yellow (
) denotes the overlap region between the predicted results and the ground truth.

On the in-house dataset, 3D segmentation results of different approaches. The areas highlighted by the blue circles indicate regions where the proposed method obtains more complete branches compared with the comparison approaches. The areas highlighted by the yellow circles indicate regions where the proposed method fails to detect vessels when compared with the ground truth.
Table 2 presents the quantitative results of different algorithms on the ASOCA dataset. It indicates that the MFLR network obtains the best performance on DSC, JI, Recall, F1-score, and HD95.
Comparison of the MFLR approach and various algorithms on the ASOCA dataset. ↑ indicates the higher the better, and ↓ represents the lower the better
Figures 8 and 9 show the 2D and 3D segmentation results of various algorithms on the ASOCA dataset, respectively. From the figures, it can be seen that the MFLR network segments more complete vessels than the comparative algorithms, which proves the effectiveness of the MFLR algorithm in addressing the challenges of coronary artery segmentation.

On the ASOCA dataset, segmentation results in cross-section. The coronary artery slice images in the basal and apical directions are displayed in the first and third columns, respectively, while the coronary artery slice images between the basal and apical directions are shown in the second column. Green (
) represents the ground truth, red (
) indicates the predicted results, and brown-yellow (
) denotes the overlap region between the predicted results and the ground truth.

3D segmentation results of different approaches on the ASOCA dataset. The areas highlighted by the blue circles indicate regions where the proposed network obtains more complete branches compared with the comparison approaches. The areas highlighted by the yellow circles indicate regions where the proposed network fails to detect vessels when compared with the ground truth.
Table 3 presents the results of using different dilated convolutions in the MFE module. The table shows that the evaluation index of the MFE module with 4 dilated convolutions and the DR of {1, 2, 3, 4} is higher than that of the MFE module with 3 dilated convolutions and the DR of {1, 2, 3}. This is because the increase in the number of dilated convolutions enables the network to leverage information at more scales in the feature map. However, the overall performance of the MFE module with 5 dilated convolutions is not as good as that of the MFE module with 4 dilated convolutions. This may be mainly due to the introduction of some useless information by a larger receptive field [29]. Therefore, in this study, the MFE module uses 4 dilated convolutions with a DR of {1, 2, 3, 4},respectively.
Number of dilation convolutions in the MFE module. ↑ indicates the higher the better, and ↓ represents the lower the better
Number of dilation convolutions in the MFE module. ↑ indicates the higher the better, and ↓ represents the lower the better
Table 4 shows the results of the ablation experiment, where the Cascade Dilated Convolution (CDC) module [30] is composed of dilation convolutions with dilation rates of {1, 2, 3, 4}, and there is no feature fusion operation between adjacent dilated convolutions in the CDC module. In Table 4, EN (1-5) represents the use of the corresponding module in each layer of the encoder, while DE (1-4) denotes the use of the corresponding module in each layer of the decoder. From Table 4, it can be seen that the method proposed in this paper (which combines the MFE module and FRF module) achieved the optimal segmentation performance.
Ablation study. ↑ indicates the higher the better
In recent years, deep learning has found widespread application in the field of medical image segmentation. In this study, the MFLR network was designed to fully learn contextual information on images and better capture the scale and shape changes of coronary arteries. From Tables 1 and 2, it can be seen that: (1) CAS-Net extracts the multi-scale feature in one layer of the encoder, and it has better performance than the 3D U-Net, however, the collection of multi-scale features in CAS-Net is insufficient, therefore, it does not obtain the best performance; (2) CS2-Net adopts the attention mechanism, but due to the lack of multi-scale feature learning, its performance is still not as good as MFLR network. In Table 4, the CDC module cascades multiple dilated convolutions with different DR, to encode multi-scale context. However, for the CDC module, the feature fusion between different scales is insufficient, therefore, the performance of the MFE module is superior to that of the CDC module. Although the MFLR network obtains a better performance overall than the existing approaches, the yellow circles in Figs. 7 and 9 show that some segmented vessels by our MFLR network remain incomplete. We attribute this observation primarily to the issue of inter-class imbalance. Given that the majority of voxels in CCTA scans represent the background, the network tends to gather more information from the background rather than from the foreground. Although we have adopted a multi-scale strategy to relieve the influence of the class imbalance, however, the challenge still exists. In future work, approaches that can further alleviate the class imbalance, e.g., the extraction of heart VOI [47, 48] and designing the loss function that is more suitable for small targets [49, 50], will be studied with the aim of further enhancing the networks’ performance.
Conclusion
This study designs a multi-scale feature learning and calibration network to automatically segment coronary arteries. The network gathers multi-scale information from different fields of view in the encoder. It presents a feature correction and fusion module in the decoder, and the module guides low-level features through high-level features with multi-scale information. The experimental results show that the proposed network achieves the best performance on most of the evaluation metrics compared with other state-of-the-art networks, on both the in-house and public datasets. It reveals the great potential for the proposed network to be used in clinical practice.. In the future, additional types of blood vessels in CCTA images will be segmented to validate and enhance the overall performance of the proposed network.
Supplementary material
Based on the in-house dataset, we analyzed the effects of the LR init , PS, BS, and network layers in the MFLR network. Table 5 shows the quantitative results with BS = 2 and PS = [96, 96, 96], and the LR init is set to 0.001, 0.01, and 0.1, respectively. Table 5 indicates a larger LR init can lead to higher metrics for the MFLR approach. If the LR init is greater than 0.1, there will be a gradient explosion problem. Therefore, the largest LR init is set to 0.1. Table 6 shows the metrics with BS = 2 and PS = [224, 224, 128], and the same conclusion as Table 5 can be found. These prove the correctness of setting LR init = 0.1 in the MFLR network.
The influence of the LR init for the MFLR network with BS = 2, PS = [96, 96, 96]. ↑ indicates the higher the better, and ↓ represents the lower the better
The influence of the LR init for the MFLR network with BS = 2, PS = [224, 224, 128]. ↑ indicates the higher the better, and ↓ represents the lower the better
Then, we analyze the influence of the different PS for the MFLR approach with LR init = 0.1 and BS = 2, and the PS is set as [96, 96, 96], [128, 128, 128], and [224, 224, 128], respectively. From Table 7, we know that, with the increase of the PS, the MFLR algorithm can achieve better performance. This may be mainly due to the large PS containing more complete context information in the CCTA image. Because of the limitation of GPU RAM, the maximum PS is set to [224, 224, 128].
The influence of the PS for the MFLR network with LR init = 0.1 and BS = 2. ↑ indicates the higher the better, and ↓ represents the lower the better
Based on Table 7, we further increased the BS, and the results are shown in Table 8. Table 8 presents that compared with Table 7, the increase of BS resulted in higher evaluation metrics for experiments with PS = [96, 96, 96] and PS = [128, 128, 128]. However, these results are still not as good as that setting LR init = 0.1, BS = 2, and PS = [224, 224, 128]. Therefore, in the proposed network, LR init = 0.1, BS = 2, and PS = [224, 224, 128] are utilized. Additionally, to avoid network normalization failure and the gradient problem, BS = 1 is not used in this paper.
The influence of the BS and PS for the MFLR network with LR init = 0.1. ↑ indicates the higher the better, and ↓ represents the lower the better
With LR init = 0.1, BS = 2, and PS = [224, 224, 128], we also analyzed the influence of the layer number of the proposed approach, as shown in Table 9. Table 9 indicates that with the increase of layer number, the performance of the MFLR network has improved. This may be mainly due to the increase in the number of layers, which enhances the learning ability of the network. When the number of network layers is set to 6, there will be a dimension problem caused by the down-sampling. Therefore, the number of layers in the MFLR network is set to 5.
The influence of the layer number for the MFLR network. ↑ indicates the higher the better, and ↓ represents the lower the better
Use of AI tools declaration
The authors declare they have not used Artificial Intelligence (AI) tools in the creation of this article.
Conflict of interest
The authors declare there is no conflict of interest.
Footnotes
Acknowledgments
This work was supported by the National Natural Science Foundation of China (No. 62273082, No. 61773110, and No. 11801065), the Natural Science Foundation of Liaoning Province (No. 20170540312 and No. 2021-YGJC-14), the Basic Scientific Research Project (Key Project) of Liaoning Provincial Department of Education (LJKZ00042021), and Fundamental Research Funds for the Central Universities (No. N2119008). This work was also supported by the Shenyang Science and Technology Plan Fund (No. 21-104-1-24, No. 20-201-4-10, and No. 201375), and the Member Program of Neusoft Research of Intelligent Healthcare Technology, Co. Ltd. (No. MCMP062002).
