Abstract
Presence of plaque and coronary artery stenosis are the main causes of coronary heart disease. Detection of plaque and coronary artery segmentation have become the first choice in detecting coronary artery disease. The purpose of this study is to investigate a new method for plaque detection and automatic segmentation and diagnosis of coronary arteries and to test its feasibility of applying to clinical medical image diagnosis. A multi-model fusion coronary CT angiography (CTA) vessel segmentation method is proposed based on deep learning. The method includes three network layer models namely, an original 3-dimensional full convolutional network (3D FCN) and two networks that embed the attention gating (AG) model in the original 3D FCN. Then, the prediction results of the three networks are merged by using the majority voting algorithm and thus the final prediction result of the networks is obtained. In the post-processing stage, the level set function is used to further iteratively optimize the results of network fusion prediction. The JI (Jaccard index) and DSC (Dice similarity coefficient) scores are calculated to evaluate accuracy of blood vessel segmentations. Applying to a CTA dataset of 20 patients, accuracy of coronary blood vessel segmentation using FCN, FCN-AG1, FCN-AG2 network and the fusion method are tested. The average values of JI and DSC of using the first three networks are (0.7962, 0.8843), (0.8154, 0.8966) and (0.8119, 0.8936), respectively. When using new fusion method, average JI and DSC of segmentation results increase to (0.8214, 0.9005), which are better than the best result of using FCN, FCN-AG1 and FCN-AG2 model independently.
Keywords
Introduction
With the improvement of living standards, cardiovascular diseases continue to spread, and the age of onset is getting younger, seriously threatening human health. Coronary heart disease (CHD) is also known as coronary heart disease. As a common heart disease, the cause is that the deposits on the coronary artery wall cause vascular stenosis, resulting in myocardial ischemia, leading to dysfunction and organic disease. Therefore, it is also called ischemic heart disease (IHD) [1]. The vast majority of coronary heart disease is caused by atherosclerosis of the coronary arteries, so it is customary to call coronary heart disease, it is a coronary atherosclerotic heart disease. In patients with coronary heart disease, there is a critical condition, that is, stenosis is 50% to 70%. In this case, it is determined whether a stent needs to be placed according to the structure of the coronary artery and the damage of the patient’s myocardial function. For the same type of disease, different doctors will give different treatment strategies based on the results of coronary angiography, but some patients have coronary stenosis more than 50%, and the patients’ myocardium is not significantly ischemic, and some are reversed. This shows that according to the patient’s coronary angiography guidance is subjective. Therefore, the accurate segmentation of coronary arteries through deep learning intelligent assisted diagnosis technology has important clinical significance for the early detection of coronary heart disease and the treatment of coronary heart disease.
Segmentation is a highly related task in medical image analysis. Image segmentation is a process of dividing an image into smaller partitions based on some features of pixels in the image. In medical imaging, the area of the image can be the area of certain tissues, organs or other related structures [2]. Segmentation tasks are mostly used for quantitative analysis and diagnosis. The gold standard in medical image segmentation is done by manual segmentation of clinical experts. This is a very time-consuming task, because modern medical imaging modalities (such as computed tomography (CT) and magnetic resonance imaging (MRI)) can generate large amounts of data in the form of 3D image volumes, and there are deviations and human errors in manual segmentation. Some semi-automatic methods have been used in clinical diagnosis to accelerate the segmentation process, but clinical experts are still needed to initialize or guide the segmentation. The importance of fully automated segmentation methods also increases as the amount of data available to patients increases.
In recent years, the level set-based segmentation algorithm has been widely used and has become the preferred algorithm for medical image segmentation [3, 4]. The level set method integrates different types of regularization (smoothing terms) and priors, and performs segmentation based on the problem of energy minimization [5]. They can provide segmentation functions with a tendency to change topological properties, but the disadvantage is that they require proper contour initialization to obtain effective segmentation results. In the recent stage, convolutional neural network (CNN) based on deep learning methods has been successfully applied to the analysis of medical images, especially for segmentation and detection tasks [6–9]. The difference from level set-based methods is that deep learning can automatically learn appearance models from a large amount of training data, extract features of complex structures and patterns, and these trained features are used for prediction.
In addition, the segmentation of medical images is more challenging than the segmentation of natural images. First, patient data is extremely diverse, and the pattern of the same pathology varies from patient to patient. Secondly, small and incomplete medical data sets make CNN training more prone to overfitting. When the model overfits the training set, overfitting occurs. Then the model will be difficult to generalize and will not be able to identify new examples that are not in the training set. Nevertheless, the recently proposed CNN architecture shows better performance than other machine learning-based medical image segmentation algorithms [10]. Steps to reduce over-fitting can add more data, use data enhancement, use generalization to have a better model structure, add regularization (in most cases, dropout, L1/L2 regularization is also possible), and reduce model complexity. Subsequently, the fully convolutional network (FCN) was proposed by Long et al. of the University of California, Berkeley, which expanded the original CNN structure and can perform pixel-level prediction without a fully connected layer [11]. Ronneberger et al. proposed the U-net neural network based on FCN network, which realizes the automatic segmentation of biological cell images. Unlike the traditional FCN network, the U-net network uses a jump-connection structure to combine the down-sampling layer and the upper layer. The sampling layers are connected, which makes pixel positioning more accurate. U-net network performs very well in the field of medical image segmentation, and many scholars use U-net network as the basic framework [12]. However, medical images such as MRI or CT are usually in the form of 3D volume, while the existing segmentation networks are mostly 2D in nature. These two-dimensional segmentation networks are applied layer by layer in order, thereby ignoring the spatial information in the third dimension [13]. Due to computational complexity and memory requirements, the use of 3D CNN is discouraged for model training. By considering the above problems, 3D FCN has recently been proposed to segment and detect MRI or CT images, in which the entire volume data is used as input, 3D volume output are directly obtained as a 3D prediction result in a single forward propagation, thereby reducing computational complexity [14]. Unlike other methods based on two-dimensional segmentation networks, they use a 3D convolution kernel, which can share spatial information in three dimensions.
Although 3D FCN has a good ability to process 3D medical volume data, when there are large differences in the shape and size of the target organs between different patients, some of the methods will overly rely on multi-level cascaded CNN. The cascading framework first extracts the region of interest (ROI) and intensively predicts the specific ROI. The application fields mainly include cardiac [15] and abdominal segmentation [16], lung nodule detection and classification [17]. However, this method leads to a waste of computing resources and complexity of model parameters. For example, similar low-level features are repeatedly extracted by all models in the cascade. To solve this problem, a simple and effective solution has recently been proposed, namely the attention gating model [18, 19]. The CNN model with AG does not affect the standard way of model training, and the AG can automatically learn features that focus on the target structure without additional supervision. During testing, these gates dynamically implicitly generate candidate regions and highlight salient features useful for specific tasks. In addition, they do not cause high computational cost, and do not need to learn large number of model parameters like a multi-model framework. The advantage is that the proposed AG can improve the model sensitivity and accuracy of dense label prediction by suppressing feature activation in irrelevant regions. In this way, it is possible to eliminate the necessity of using external organ positioning models while maintaining high prediction accuracy.
In this study, we propose to investigate a new coronary CTA blood vessel segmentation method based on deep learning multi-model fusion. The main contributions are as follows: 3D FCN (three-dimensional fully convolutional network) network is used to process three-dimensional coronary artery CTA images, so that the network can fully learn three-dimensional spatial features; The AG (attention gate) model is embed into the 3D FCN network to suppress feature activation in irrelevant regions, network prediction accuracy is improved; A multi-model fusion method is proposed. By fusing the results of the original network and the two improved networks with a majority voting algorithm, it effectively reduces the use of a single network model for blood vessel segmentation, which is likely to cause false negatives and false positives.
Methods
The overall workflow of this research is shown in Fig. 1. The framework includes:

Overall flow chart.
The original 3D FCN network;
The network (3D FCN-AG1) that embeds an attention gating model in the original 3D FCN;
A network with two attention gated models (3DFCN-AG2) which are embedded in the original network.
In the training phase, the training weight information is obtained by training these three different networks, and in the testing phase, the three network trained weights are used to predict the same test data, and the results of the three network predictions are used, the majority voting algorithm is fused to obtain the final result of network prediction. Finally, the level set function is used to iteratively optimize the final result of network prediction, the final segmentation result is obtained. The basic idea of level set is to regard the interface as the zero level set of a certain function ψ (called the level set function) in the high-dimensional space, and the evolution of the interface is also extended to the high-dimensional space. The level set function is evolved or iterated according to the development equation it satisfies. As the level set function continues to evolve, the corresponding zero level set is also constantly changing. When the level set evolution becomes stable, the evolution stops and the interface shape is obtained [20].
The structure of the 3D full convolutional network is used in the experiment, it is shown in Fig. 2. This network structure is similar to the mainstream 3D U-net [21] and V-net [22] network structures. By using 3D convolution, the purpose is to extract features from 3D CTA image data and adjust its resolution by using an appropriate step size at the end of each stage. The left part of the network consists of the encoding path, while the right part restores the data to its original size through the decoding path.

3D FCN network structure.
Assume that
wherein, α= x-m, β= y-n, χ= z-t,
wherein, F i represents the input, σ (F i ) represents the output, α i represents the learning control parameters which is necessary for training the negative portion F i , and α i are almost zero in ReLU. Therefore, PReLU can adjust the rectifier according to the input conditions, thereby improving the accuracy of the network, hardly increasing the calculation cost, and reducing the risk of overfitting.
The left side of the network is divided into different stages, running at different resolutions. Each stage contains one to three convolutional layers, and in each stage, the input of each stage is non-linearly processed in the convolutional layer, and the layer is added to the last convolutional layer of the stage, in order to be able to learn the residual function [24]. The advantage of incorporating residual function learning into the network structure is that the network can reach a state of convergence in a short time during the training process.
In each stage, the performed convolution uses a convolution kernel with a size of 5×5×5 and a step size of 1. As the data progresses in different stages along the encoding path, its resolution gradually decreases. This is achieved by convolution with a convolution kernel size of 2×2×2 and a step size of 2. Since the second operation only uses non-overlapping 2×2×2 convolution kernels to extract features, the size of the resulting feature map is halved. This method of using convolution to halve the feature map also replaces the previous CNN commonly used pooling operations [25]. In addition, the number of feature channels will be doubled at each stage of the encoding path in the network, and since the model is formed by the residual network, the number of feature maps will be doubled through these convolution operations, but the resolution will be reduced accordingly.
The PReLU non-linear activation function is applied to the entire network, and batch standardization is also used before the non-linear activation function [26]. Through a certain standardization method, the distribution of any neuron input value of each layer of the neural network is forced back to a standard normal distribution with a mean of 0 and a variance of 1, which can make the gradient larger, prevent the gradient from disappearing, and accelerate the network convergence. Using convolution to perform encoding operations will also cause the network to occupy less memory during training. In the network coding path, the downsampling part is used to reduce the size of the input and increase the reception domain of the features which are calculated in the subsequent network layer, while the network decoding path mainly extracts features and expands the lower resolution feature maps, the necessary information is collected and combined, and the number of calculated features at each stage is twice that of the previous layer.
In the last convolution layer, a convolution kernel size of 1×1×1 is used to generate a feature map of the same size as the input volume, and it is converted into the probability of foreground and background regions by applying the sigmoid activation function. After each stage of the network decoding path, a deconvolution operation is used to increase the size of the input, followed by one to three convolutional layers, involving the number of 5×5×5 convolution kernels used in the previous layer half. Like the coding path of the network, this part will also learn the residual function in the convolution stage, the convergence of the network model is accelerated.
In this work, an objective function based on the Dice coefficient is used, its value is between 0 and 1. The goal is to maximize the value of the Dice coefficient. The Dice coefficient D between two binary volumes can be written as Equation (3):
Wherein, the sum is run on N individual pixels, and the predicted binary segmentation volume p i ∈ P and Ground Truth binary volume g i ∈ G.
The shape and size of the blood vessels of CTA vary with the slices of coronary arteries, and the enhancement of blood vessels is very important to eliminate the impurity area in the CTA slices. In the standard CNN network model, the feature map grid is generally down-sampled gradually to capture a sufficiently large perceptual field, semantic context information is better captured. In this way, the network model can learn the relationship between the location of the coarse spatial grid level model and the global organization. However, it is still difficult to reduce false positive predictions for small objects showing large shape variability only by downsampling. In order to improve accuracy, most of the current segmentation frameworks rely on simplifying the task into separate positioning and subsequent segmentation steps [27]. In this task, the same goal can be achieved by integrating AG into the standard CNN model. Contrary to the positioning model in multi-level CNN, it does not need to train multiple models and a large number of additional model parameters, and the biggest feature of AG is that it can gradually suppress the feature response in unrelated background regions, there is no need to cut the ROI through a cascaded network.
In the standard attention gating model, the output of AG is the element-wise multiplication of the input feature map and the attention coefficient, and the formula is in Equation (4):
There is an attention coefficient α
i
∈ [0,1] in the formula, which identifies salient image regions and trims feature responses to retain activations related to specific tasks. In general, a single scalar attention value is calculated for each pixel vector
By learning multi-dimensional attention coefficients to address multiple semantic situations. Therefore, each AG learns to focus on a subset of the target structure, which contains a gated vector g
i
∈ R
F
g
for each pixel i, the focus area is determined. The gating vector contains contextual information to trim the lower-level feature response [28]. By comparing the performance of multiplicative attention [29] and additive attention [30], additive attention is finally used to obtain the gating coefficient. Although this is computationally more expensive, experiments have shown that it can achieve higher accuracy than multiplication. Additive attention is shown in Equations (5) and (6).
wherein AG is characterized in that a set of parameters Θ att comprising a linear transformation W x ∈ RF l ×Fint, W g ∈ RF g ×Fint, ψ ∈ RFint×1 and bias term b ψ ∈ R, b g ∈ R F int . The attention method based on vector cascade is used. For the input tensor, the convolution kernel with 1×1×1 convolution used to calculate the linear transformation, where the cascaded features x l and g are linearly mapped to the R F int dimension.
The attention coefficient (σ1) adopts the PReLU nonlinear activation function, because the PReLU activation function can adjust the rectifier according to the input conditions to improve the accuracy compared with the widely used ReLU activation function; because the softmax function is used in sequence, it will produce sparse activation at the output. Therefore, the attention coefficient (σ2) adopts the sigmoid activation function, so that the parameters of the AG converge better during the training process. The overall process of AG is shown in Fig. 3.

Workflow of attention gate.
In the work, AG is embedded into the 3D FCN network architecture, the salient features of the jump connection are highlighted, and the structure is shown in Fig. 4.

3D FCN integrating attention gate.
This structure can gate the information roughly extracted by the network to eliminate irrelevant and noisy areas in the jump connection. In addition, AG filters neuron activation during forward and backward propagation, and gradients from background regions are weighted downward during back propagation, which allows model parameters in shallower layers to be mainly based on the space related to a given task area to update. In each sub-AG, supplementary information is extracted and fused to define the output of the jump connection. In order to reduce the computational complexity of trainable parameters and AG, a convolution kernel of 1×1×1 is used to perform linear transformation, and the input feature map is downsampled to the resolution of the gated signal, it is similar to non-local blocks [31]. The corresponding linear transformation couples the feature diagrams and maps them to the lower-dimensional space for gating operations. Secondly, low-level feature maps are not used in the gating function, because they do not represent the input data in the high-dimensional space. Therefore, two 3D FCN network models which are embedded in AG are used. One is to embed AG in the last layer of hop connection in the network structure (as shown in Fig. 3(a)), and the other is to embed AG in the last two layer hop connections (shown in Fig. 3(b)), thereby enhancing the learning of related features in the entire network.
In view of the fact that if a single model is used in the work, it is easy to cause false negatives and false positives in the prediction results. Therefore, a model fusion method is adopted in the next stage. For the prediction result of the three network models (3D FCN, 3D FCN-AG1 and 3D FCN- AG2) is classified by majority voting, and the one with more votes is determined as the final classification. Specifically, for each pixel of the test data, three results will be predicted through three network models. If two or more of the prediction results are blood vessels, the final prediction result of this pixel is blood vessels, or vice versa.
The segmentation results of the final network prediction are obtained through the above method. According to observations, there is a problem of rough edges on the segmented blood vessels. In order to solve this problem, the level set method is still needed to iterate optimization for the contours of the blood vessels in the post-processing stage.
The basic idea of the level set method is to implicitly express a flat closed curve as a level set of a two-dimensional surface function [32], that is, for a set of points with the same function value, the motion of the curve is implicitly solved through the evolution of the level set function surface. The evolution of the level set function satisfies the following basic Equation (7):
Wherein, φs is the level set function, which represents a zero level set target profile curve, i.e., Γ (t) = {x|φs (x, t) =0}, indicates the level set function gradient norm; F is the speed function in the normal direction of the surface, which controls the movement of the curve.
The accuracy of segmentation is compared by calculating the JI (Jaccard index) and DSC (Dice similarity coefficient) scores [33] between the respective blood vessel segmentation. The JI score can be understood as the ratio of the predicted correct area to the union of the two areas, and the DSC score can be understood as the ratio of twice the area of the predicted correct result to the sum of the two areas. The range of both values is between 0 and 1. The higher the value, the better the accuracy of segmentation. The calculation formulas of JI and DSC are Equations (8) and (9):
wherein, Y stands for GT and Y p stands for predicted value.
Experimental data collection
The coronary CTA image data in the experiment included a total of 70 groups of patient data, and the number of slices in each group was between 250 and 350. Due to the integrity of the coronary CTA image data, there are useless slice sequences in the first part of the slice and in the latter part of the slices (only the aorta or blood vessels disappear in the slices). As shown in Fig. 5, Fig. 5(a) is the picture needed for the experiment. The picture contains the aorta and coronary vessels (the positional relationship is shown as the arrow points). Figure 5(b) only has the aorta and the coronary blood vessels have not yet appeared, and the blood vessels in Fig. 5(c) have all disappeared. And through statistics, it is found that the entire process from the appearance of the coronary artery to the complete disappearance of the coronary artery in each group of patients is within 150 frames of slices, so the frame of the coronary artery in each group of patient data is taken as the reference item by manual screening. The first 10 slices are taken as the starting frame, and 160 slices are selected as the experimental data of each group of patients. There are about 11200 CTA image pictures in total. The size of the CTA image data of each group of patients is 512×512×160. 50 groups of patient data are used as the training set, and the remaining 20 groups of patient data are used as the test set.

Data sample of various slices.
In the experiment, the Keras library is used to implement the model [34]. Adam’s optimization algorithm is used to optimize the network model [35]. The learning rate was initially set to 10–5, and the entire model was trained for 500 epochs on a single NVIDIA GPU (Nvidia GTX 1080Ti). epochs are defined as a single training iteration of all batches in the forward and backward propagation. This means that 1 cycle is a single forward and backward transfer of the entire input data. Simply put, the epochs refer to how many times the data will be “rounded” during the training process. The training process takes about 10 hours. Due to the limitation of running memory, the input size of the model is 128×128×160 each time, so the original CTA data is reduced to the size which is required for network input.
Experiment and discussion
In the experiment, the segmentation accuracy of different algorithms is shown in Table 1. First, the segmentation accuracy of the original 3D FCN network on coronary blood vessels was tested on 20 patient data sets. The average values of JI and DSC can reach 0.7962 and 0.8843. The aorta can be accurately segmented, but some small blood vessels in the coronary artery are lost to a certain extent. Secondly, we tested the embedded attention gating models in the one-layer jump connection and the two-layer jump connection of the original network. It was found that the mean JI and DSC of the segmentation results of the two improved network models significantly exceeded the original network. They respectively reach 0.8154, 0.8966 and 0.8119, 0.8936. Particularly, it has a good effect on the segmentation of coronary blood vessels whose brightness is not particularly obvious. As shown in Fig. 6 stage (1), the segmentation through the original network can easily lead to the loss of small coronary blood vessels (The blood vessel part which is shown by the yellow arrow in the figure is not segmented), but the network embedded with the attention gated model can make up for this problem that often occurs in the original network and improve the overall segmentation accuracy.
Comparison of segmentation accuracy for various algorithms
Comparison of segmentation accuracy for various algorithms

Segmentation results of different methods for three stages of coronary CTA data.
Then, in the process of visualizing the experimental results, it is found that although the embedded attention gating model has improved the overall effect, there will still be some segmentation problems. As shown in Fig. 5 stage (2), the original network segmentation performance is very good, but the network embedded in the attention-gated model has misjudgment of blood vessels, and impurities are mistakenly segmented into blood vessels (the yellow arrow in the figure shows the wrong part of the blood vessel segmentation), resulting in that some effects are even worse than the original network. In response to this problem, a multi-model fusion method is used. The prediction results of the original network and the two improved networks embedded in the attention-gated model are obtained through the majority voting algorithm, the latest prediction results are obtained. Experimental results prove that the average values of JI and DSC of the segmentation results which are obtained by this method are 0.8214 and 0.9005, which are better than the best results among the above three models. Finally, from the above experiments, it is found that if only the deep learning method is used to segment the blood vessel, although the skeleton of the blood vessel can be segmented very well, but in some detail processing, especially the segmented contour will be displayed relatively rough, it is not very smooth, so the post-processing part is added at the end, and the level set algorithm is used to further iteratively optimize the contours of the already segmented blood vessels, as shown in Fig. 6 stage (3).
From the three-dimensional visualization results, the segmentation completeness of three different networks of FCN, FCN-AG2, FCN(MV)+LS are compared on the three-dimensional model of coronary blood vessels. In the comparison of 3D models, we pay attention to the segmentation length and the fracture of the model of the blood vessels in the coronary model. In Fig. 7, we use the actual 3D model of the coronary artery (red) as the bottom, and the 3D model of the coronary artery segmented by the networks is off-white, that is, the red part of the model indicates that the network does not segment the coronary artery at all. The figure lists three examples of patients. Each row is a three-dimensional model of the coronary artery segmented from the data of the same patient using different networks, showing the same model from two view angles.

Comparison of three-dimensional models.
These three examples show that although FCN can segment the main stems of the three major blood vessels, it has poor segmentation ability on the end blood vessels and the small blood vessels other than the main stem. FCN-AG2 network is better than FCN in the segmentation length of blood vessels, but FCN(MV)+LS not only has the advantage of the segmentation length of blood vessels, but also can make up for the models’ fracture of the former two. For example, in the third example segmentation image, it shows that both FCN and FCN-AG2 are broken in the circled place in the figure. The fracture of the three-dimensional model is very unfavorable for the subsequent diagnosis of coronary artery disease. The three-dimensional model of coronary blood vessels obtained by FCN(MV)+LS is very complete, and the fracture information of the first two is repaired. From the comparison of three-dimensional models, the FCN(MV)+LS network established in this study not only has advantages in the segmentation length of the blood vessel model, but also repairs the fracture, and obtains a high-precision three-dimensional coronary vessel model that is more in line with clinical needs.
In order to facilitate the analysis of the segmentation results of each group of patient data in the test data, the results of each group are also counted in the form of box plots as shown in Fig. 8 in which the results and mean values of each case of the original 3D FCN are not high. The mean values of the two network results embedded with AG show a greater improvement compared to the results of the original 3D FCN. After fusion of the three models, the results are significantly improved in both the results of each case and the mean. The final segmentation results are further optimized based on the model fusion results through the level set method.

Boxplots of experimental data results.
Coronary heart disease is one of the biggest health problems in the world, so the early prevention and diagnosis of coronary heart disease is very important. At present, manual segmentation of coronary arteries is time-consuming and is determined by the operator’s subjective consciousness. In this study, a deep learning multi-model fusion approach is proposed to segment coronary CTA vessels. The method includes three network models: an original 3D FCN to process three-dimensional volume data, and an end-to-end network to perform three-dimensional coronary CTA for prediction, and two networks which are embedded with AG model in original 3D FCN. In the training process, this model can suppress feature activation in unrelated regions, the model sensitivity and accuracy of dense label prediction are improved. Then, the prediction results of the three networks are fused by a majority voting algorithm. In the segmentation, false negatives and false positives that are easily caused are reduced. The final result of the network prediction is then obtained. At the same time, the network prediction result is sent to the level set function for iterative optimization of the edge contour, and the final segmentation result is obtained. Compared with the segmentation effect of the original network, the method proposed in this study provides better segmentation accuracy and effect in coronary artery segmentation. Additionally, in the process of evaluating the proposed method, the JI and DSC scores are compared as performance metrics. The final results of the experiment show that the proposed method provides better and more accurate segmentation results. This method can be applied to clinical experiments to automatically detect the film, effectively judging whether there is cardiovascular disease.
Footnotes
Acknowledgments
This work is sponsored by the Scientific Research Project (No. 20C1073, No. 20B337) of Hunan Provincial Education Department, China.
