Abstract
BACKGROUND:
The distribution of pulmonary vessels in computed tomography (CT) and computed tomography angiography (CTA) images of lung is important for diagnosing disease, formulating surgical plans and pulmonary research.
PURPOSE:
Based on the pulmonary vascular segmentation task of International Symposium on Image Computing and Digital Medicine 2020 challenge, this paper reviews 12 different pulmonary vascular segmentation algorithms of lung CT and CTA images and then objectively evaluates and compares their performances.
METHODS:
First, we present the annotated reference dataset of lung CT and CTA images. A subset of the dataset consisting 7,307 slices for training and 3,888 slices for testing was made available for participants. Second, by analyzing the performance comparison of different convolutional neural networks from 12 different institutions for pulmonary vascular segmentation, the reasons for some defects and improvements are summarized. The models are mainly based on U-Net, Attention, GAN, and multi-scale fusion network. The performance is measured in terms of Dice coefficient, over segmentation rate and under segmentation rate. Finally, we discuss several proposed methods to improve the pulmonary vessel segmentation results using deep neural networks.
RESULTS:
By comparing with the annotated ground truth from both lung CT and CTA images, most of 12 deep neural network algorithms do an admirable job in pulmonary vascular extraction and segmentation with the dice coefficients ranging from 0.70 to 0.85. The dice coefficients for the top three algorithms are about 0.80.
CONCLUSIONS:
Study results show that integrating methods that consider spatial information, fuse multi-scale feature map, or have an excellent post-processing to deep neural network training and optimization process are significant for further improving the accuracy of pulmonary vascular segmentation.
Keywords
Introduction
Computed tomography (CT) is an important part of the diagnosis and treatment of lung diseases. With the development of medical image processing techniques, many studies have proved pulmonary vascular segmentation can significantly improve the diagnosis and detection of pulmonary nodules, pulmonary hypertension (PH), pulmonary embolism (PE) and other pulmonary diseases. The study [1] confirmed that the sensitivity of locating nodules can be increased by 20%by extracting pulmonary vascular structures from CT, which is the main cause of lung cancer and has no typical symptoms. PH refers the rise of pulmonary arterial pressure caused by loss and obstructive remodelling of the pulmonary vascular bed [2]. In clinical diagnosis, if the diameter of the main pulmonary artery is greater than or equal to 28.6 mm, the specificity of PH diagnosis is higher. This work [3] presented that automatic detection of the lung vascular tree can provide clinically relevant measures of blood vessel morphology, so as to improve the diagnosis of PH. PE is a pathological and clinical disease caused by the obstruction of a pulmonary artery or one of its branches which blocks the blood supply to tissues. Research [4] shows that the false positive rate of PE detection is reduced by 16.2%by integrating the pulmonary vascular tree extraction algorithm into the computer aided detection system (CAD), which saves a lot of time for radiologists.
Different medical imaging techniques are currently used in clinical practice which results in different performances in resolution, noise, vessel contrast or slice thickness, and an appropriate choice of the segmentation algorithm is mandatory to deal with the adopted imaging technique characteristics [5]. Thus, most pulmonary blood vessel segmentation methods all have subtle differences and lack reusability. Pulmonary diseases such as pulmonary hypertension have strict requirements on vascular structure, which requires pulmonary vascular segmentation results to be as accurate as possible. Therefore, the research of pulmonary vascular segmentation in computer-aided diagnosis has attracted many scholars’ attention.
Traditional image segmentation methods [6] are generally built on basic image processing of pixel intensities or textural features [11], including thresholding, vesselness filters, or tree growing or tracking [12]. Threshold segmentation is a classical approach in medical image segmentation. It divides the pixel level into several categories by setting the appropriate threshold. The most common method for vessel segmentation based on filters is the Hessian matrix. Due to the pulmonary vessels have the same characteristics as trees, the tree growing method has been studied in more depth such as region growing [13].
With the rapid development of deep learning, the algorithms have been applied to many research fields. The application in medical image starts from the success of convolutional neural network (CNN) in the classification field which makes people try to use deep learning technology for image segmentation [15]. Subsequently, Long et al. proposed the fully convolutional network (FCN) [16] which achieves pixel-level classification. Based on the FCN, U-Net was proposed, and it has the different fusion operation. In recent years, deep learning has been widely used in medical image processing [17], which has made breakthroughs in pulmonary vessel segmentation [19] of CT images.
Yajun Xu et al. proposed a new phased convolution network [21] which consists of the lung segmentation based on the CNN and the pulmonary vessel segmentation using the FCN to hierarchically learn rich pulmonary vessels. Hejie Cui et al. proposed a 2.5D U-Net++ method combined with the orthogonal fusion [22] which optimizes the presentation of intra and inter slice features for fully automated pulmonary vascular segmentation. [23] proposed stacked fully convolutional network for pulmonary vessel segmentation which consists of a stacked FCN and an orientation-based region growing method to address discontinuity problem that caused by blurry boundary and complicated pulmonary elements in pulmonary vessel segmentation. The approaches based on CNN ignore the context information and the 2.5D U-Net++ method has the long training time. The stacked FCN method leads to the loss of fractured vessels.
The purpose of International Symposium on Image Computing and Digital Medicine (ISICDM) 2020 challenge is to provide a public and fair platform to apply the researchers’ works to the real data. In addition, we can get intuitive results and objectively analyze reasons from different methods for reference and discussion.
This paper is organized as follows: Section 2 introduces data used in ISICDM 2020 challenge and presents all methods of 12 teams from different institutions presented on the pulmonary vascular segmentation task of ISICDM 2020 challenge. Section 3 provides results and analysis, and the paper is finally concluded.
Materials and methods
Dataset
The dataset was labeled by the internal software provided by the Key Laboratory of Medical Imaging Intelligent Computing Ministry of Education. There were four students and an expert in the field of medical imaging who are responsible for labeling and verification. The students were trained to perform the annotation of all images. The expert constantly evaluated and unified the annotation results via the professional knowledge of anatomy for accuracy. Moreover, all images of the dataset were collected across multiple sources for authenticity.
The lung imaging dataset in ISICDM 2020 challenge included 16 sets of chest CT plain scan images and 16 sets of computed tomography angiography (CTA) enhanced images with the following acquisition and reconstruction parameters: section thickness, < 2 mm; resolution, 512×512. These images were divided into training set and three test sets. The training set contains 10 cases of CT and CTA scans, and the test sets contain 6 cases of CT and CTA scans. In order to ensure the accuracy of the results and the fairness of ISICDM 2020 challenge, the challenge is divided into three stages: warm-up, qualifying and final and we provide different test sets for each stage of ISICDM 2020 challenge. The test sets of each stage are 1 case, 4 cases and 1 case of CT and CTA. In Warm-up stage, the results is not considered and we provide 1 case CT and CTA scans for testing in order to familiarize dataset and test equipment conditions. We make a ranking by performing the evaluation in the qualifying stage. Participants demonstrate the results and explain the algorithms on the scene of final stage. This paper only discusses the training and testing results of the 12 different segmentation algorithms in the second stage data. Each group of data in the training set is composed of dcm and mask. The dcm is the original image in DICOM format, and the mask is the annotation image in JPG format.
Network architecture
U-Net
U-Net network [24] is composed of coding and decoding (Fig. 1). Coding is to extract the low-level features of the image through down-sampling. Decoding is the process of recovering the low-level information to the high-level information by up-sampling. In addition, the fusion of the low-level features and the high-level features by the skip connection is to prevent the loss of information during the coding. In medical image processing field, it is usually troubled by the small amount of data, massive pixels per slice and fuzzy edges caused by noise. Because the multi-scale fusion of U-Net can solve these problems, so it is widely used in medical image processing for segmentation.

3D U-net architecture.
3D U-Net [25] is a semantic segmentation network based on the structure of U-Net network (Fig. 2). All 2D operations of U-Net architecture are changed 3D counterparts in 3D U-Net, such as 3D convolution, 3D up-convolutional, and 3D max pooling layers. Most medical imaging data is 3D data having a z-axis information which is different from 2D data. Due to the larger number of parameters for 3D convolution, 3D slices are usually cropped into fixed-size patches. When using this 2D network structure, the dimensional information of adjacent slices is not considered which will reduce the accuracy and efficiency of the network [26]. 2D convolution network is changed into 3D convolution network in the 3D U-Net network model, which makes 3D convolution network extract more local information.

3D U-net architecture.
Based on the original U-Net, nnU-Net framework (Fig. 3) optimizes the preprocessing, loss, data augmentation, patch-based strategy, model integration and post-processing. If the network structure is adjusted too much, it may lead to over-fitting for the training dataset. Therefore, the nnU-Net framework does not adjust the network structure greatly. The preprocessing contains cropping, resampling and normalization. The small amount of data problem is solved by utilizing a large variety of data augmentation techniques. The following data augmentation techniques are done in the framework: random rotations, random scaling, random elastic deformations, gamma correction augmentation and mirroring to prevent over-fitting [27]. The largest connected component will be retained after the post-processing.

nnU-net architecture. Stage 1: a 3D U-Net processes downsampled data, the resulting segmentation maps are upsampled to the original resolution. Stage 2: these segmentations are concatenated as one-hot encodings to the full resolution data and refined by a second 3D U-Net.3D U-net architecture.
The neural network structure with the attention model has good performance and interpretability. Multi-scale features are the significant part of semantic segmentation. One common way to extract multi-scale features is to feed multiple resized input images to a shared deep network, but attention can make the model focus on the features which people need by putting different weights on objects of different scales [28]. The principle of attention mechanism is to divide the task into several subtasks and assign different weights.
GAN
GAN is composed of a generator and a discriminator. The generator is used to increase the judgment error of the discriminative network by generating more realistic segmentation that are hard to distinguish from manual contours. The discriminator is used to decrease the judgment error of the discriminator network and improves the ability of telling truth from false [29]. When the discriminator cannot distinguish the real images from the generated images, the network achieves the fitting. In the medical image segmentation tasks, U-NET or other network structures are usually used as generators for segmentation. The discriminators composed of FCN or other network structures update the weight by distinguishing the segmentation result from the ground truth data. In order to solve the problem that the gradient of GAN network generator disappears, WGAN is improved version of GAN. The sigmoid function of GAN discriminator and the log function of calculating loss are removed, so that the probability of optimal discriminator on real samples and distributed samples is avoided as 1 and 0.
Post-processing
Conditional random field (CRF)
In order to solve the island phenomenon of false prediction pixels, CRF algorithm is used as a post-processing tool for training neural network. CRF is a kind of random field which has the same characteristics as maximum entropy model and Markov random field (MRF). The labeled pixel in the image is regarded as a random variable of MRF. Every pixel i has a category label x i and observation value y i . The energy of a label made of the unary potential function and the binary potential function can predict the category of a label.
Maximum connected domain processing
The result is processed by connected domain, the vascular regions are selected, and other non-vascular regions are discarded, which effectively reduces the over segmentation rate. The method uses breadth-first search and sets visit flag to find all connected domains. The largest one or more connected domain will be left in this process.
Morphological operation
Dilation of blood vessels can connect some vessels which are not obviously ruptured. The dilation operation is to convolute the image with the kernel for the local maximum and assign the pixel specified by the reference point the maximum value.
Participation methods
Team 1 uses the U-Net network model and adds the pre-processing and post-processing. The pixel values are normalized to discard the higher and lower values, and the images are reversed. The loss function is defined as cross-entropy loss and dice loss. They design the post-processing algorithm. They first dilate the pulmonary vessels to connect the broken vessels, and they select the two largest connected regions. In order to speed up the convergence speed of the network training, they use patch-sampling method in the training of U-Net network.
The 3D U-net model is applied by team 2 to implement the algorithm of pulmonary vascular extraction. In image preprocessing, the lung is cut into the smallest external cube with a fixed size of 128×128×128. Then, the data enhancement techniques such as zoom, translation, rotation, gamma transform, flip, elastic deformation, Gaussian noise are used to solve the problem of insufficient training dataset to a certain extent, which improves the generalization ability of the model. In the process of model training, the loss function is the weighted sum of DICE loss and BCE loss. Finally, the method uses five-fold cross-validation to evaluate the prediction performance of the model.
The method of team 3 is based on the nnU-Net which may automatically adapt to the specifics of all datasets and get higher mean dice on the multiple datasets for different scenarios. They adjust the images of training set to obtain the full-resolution vessel images and low-resolution lung images. The network architecture is trained on the low-resolution images (stage 1) and the full-resolution images (stage 2). Finally, they improve the results using the maximum connected domain post-processing method. In order to better display results, they extract the pulmonary vascular centerline based on the pulmonary vessel segmentation result.
Team 4 uses nnU-Net network model to train on the full-resolution images and low-resolution images which are obtained by the cubic spline interpolation. In pre-processing, they reprocess the source mask images, calculate the overlap of trachea and vessels and normalize the dataset. Considering the relationship between trachea and vessels, they train the trachea and blood vessels together, and use less than 5%overlapping markers as vessels. They add the deep supervision for each level of the decoder to increase the gradient scalability of the supervision signal.
Team 5 employs Multi-scale Fusion Network. Firstly, the images are processed by Gaussian and image enhancement which change the light and dark contrast. Then, the parallel encoder-decoder structure can mix low-resolution and high-resolution information, which is better than U-Net.
The method for team 6 puts the attention model to U-Net. The attention module is added to the down-sampling process. In the down sampling stage of the neural network, the results of the up-sampling stage are concentrated in the vascular area by the attention technology, and then the skip connection is made.
The attention module is added to train the U-Net neural network in the ISICDM 2020 challenge for team 7. They augment the data by rotation, Gaussian blur, and changing the contrast. The weighted sum of cross entropy loss function and mean square error loss function is regarded as loss function. It is defined as
Finally, they use CRF to fill in the hole by creating binary Gaussian potential function and binary bilateral potential function for improving the segmentation results.
Team 8 takes the U-Net network as the backbone. Due to the distribution of vessels in the overall shape presents a tree structure, the vessels in 2D slice discrete and appear in any area of the lung which makes segmentation difficult. Therefore, for the narrow vessels, they add an attention module to improve the segmentation accuracy of the model. Their model outputs three channels at the same time, which represents lung, pulmonary vessels and airway respectively. They add shape prior loss into the model to constrain the model training. This shape prior loss is especially aimed at the connectivity of the tubular structure to maintain the continuity of blood vessels and trachea as much as possible and avoid over-segmentation and under-segmentation by
They extract the pulmonary vascular skeleton from the model result and then prune the branches to remove the over segmented points and connect the incomplete branches.
Team 9 adopts two kinds of segmentation methods based on attention network. Due to the discrete distribution of blood vessels in the image, they choice the multi-scale feature segmentation method with the position attention module and the channel attention module. In this model, the feature maps with different scale objectives obtained by down-sampling are fused. The position attention module is used to get the features by weighted sum of features from all the position. The channel attention module integrates the features among all channel graphs.
Team 10 uses the Inf-Net network [30]. Due to the difficulties of segmentation in the low contrast slice images, they do some jobs on the data, including eliminating abnormal points, limiting Hounsfield (HU) value of CT image between –250 and 200, normalizing and eliminating noise. The edge information is extracted by edge attention mechanism module for the low-level feature map. The results of convolution operations are used to get the feature map of high-level. The coarse segmentation result is obtained by the decoder, and then it is combined with edge features and high-level features for adaptive learning in the reverse attention module. In the coding stage, the high-level features of cubic convolution are aggregated by parallel connection. The results of the reverse attention module are still fed back to the upper convolution layer for the same processing. For the loss function, it has
The network with a scale-aware pyramid fusion (SAPF) module [31] in Fig. 4 is utilized by team 11. The key to affect the performance of segmentation network is to introduce scale-aware pyramid fusion module between the encoder and decoder in U-Net. It can not only capture the multi-scale context information, but also highlight the specific information of the target. They normalize the image into 0–1 range and convert the data into a single channel to facilitate network training. Three parallel dilated convolutions with different expansion rates sharing the weight are used to capture the multi-scale context information in the SAPF module. The SAPF module uses spatial attention mechanism based on self-learning and highlights the specific scale information of the target for each image adaptively. The encoder uses Res-Net to improve the ability of feature extraction, which is easy to converge. A single model can segment three objects at the same time, which is reproducible and stable. According to the anatomical knowledge, the pulmonary vessels are in lung parenchyma. Therefore, they use two-stage end-to-end segmentation network to train and cut out the original image after obtaining the lung parenchyma. This model can be trained at the same time and has few parameters, which is convenient for clinical application. There is no complex integrated learning and pre-processing operation.

Scale aware pyramid fusion module.
Comparative analysis of the proposed method
Team 12 mainly uses two models: TS-WGAN and U-Net-WGAN. TS network is the training model for CT images prediction, and U-Net is the generator network for CTA images prediction. WGAN is used as the discriminator network. The starting part of the pulmonary vessels are thicker, and the area segmented is larger. The end part is smaller and more complex. Therefore, they designed the two-stage network model with different sampling depths.
We describe the results of the competition from the numerical results and the intuitive results. In the numerical results, we assess the performance by using three different measures. In the intuitive results, we display the visual results of all vascular segmentation methods and compare them with the ground truth.
Numerical results
In order to evaluate the segmentation task, we analyse segmentation results by calculating Dice coefficient, over segmentation ratio (OR) and under segmentation rate (UR). Dice coefficient is used to evaluate the overlapping part between ground truth and segmentation area. OR represents the ratio of segmentation area outside the ground truth. UR is used to evaluate the ratio of missing segmentation area. The Dice coefficient, OR, UR formula are defined as
where, TP means the vessel annotated as true is segmented, FP means the vessel annotated as true is not segmented, FN means the vessel annotated as false is not segmented. Figure 5(a) and Table 2 show the quantitative evaluation results for CT images. Figure 5(b) and Table 3 show the quantitative evaluation results for CTA images.

Dice coefficients and OR, UR results by each team. a-c.
Dice coefficient and OR, UR results of CT images segmented by each team
Dice coefficient and OR, UR results of CTA images segmented by each team
The most Dice coefficient results of CT images are between 0.7 and 0.8, but there have two groups about 0.5. Most teams whose Dice coefficient is close to 0.8 benefit from the addition post-processing process after network training to supplement the broken vessels and remove the redundant parts. If the Dice coefficient is too low, it is likely that the network is not well optimized, which leads to instability and low generalization performance. Therefore, there has severe over segmentation or unsuccessful segmentation data in the test dataset. The method of 3D U-Net has the best performance from Dice coefficient. This shows that it is necessary to consider valuable information along the z-axis. The lowest OR is 0.268 which obtained by the method of team 4. This benefits from the data augmentation of nnU-Net and the consideration of connection between pulmonary vessels and trachea. As can be seen from UR, the better result is the Multi-scale Fusion Net from team 5. This method fused the multi-scale feature map can also improve accuracy.
The most Dice coefficient results of CTA images are between 0.6 and 0.8, but the Dice coefficient by the sixth group is about 0.3. Team 1 has the best result from the Dice coefficient. The main reason is the usage of morphological operation which connects the broken vessels. The result of team 10 shows the lowest OR, because it combines with edge feature map and high-level feature map for adaptive learning in the reverse attention module.
Based on the values in Table 2 and Table 3, the Dice coefficient of seven groups on CT images segmentation are higher than those on CTA images segmentation, and the Dice coefficient of the other five groups on CTA images segmentation are higher than those on CT images segmentation in Fig. 5(c). CTA images with contrast medium have better vascular enhancement and more noise with less detail visibility which is benefit for vascular segmentation. Due to the small data set and the impact of annotation results, the comparative results may not be widely applicable, so this paper does not compare them.
In Fig. 6, we show the three-dimensional display for pulmonaty vascular segmentation results of 12 teams in the final stage of ISICDM2020 Challenge. Among the results, it can be observed that the method using 3D U-Net and U-Net network achieves better results. The majority of methods extract the pulmonary vascular tree completely except for the CTA02 result by team 6 and the CTA04 result by team 10. We find that the results of team 6 using Attention U-Net architecture only are over segmented seriously, which leads to insufficient memory to display case CTA02 normally. According to the prediction results of the model, the result of Case CTA04 is empty. This may be due to the poor generalization ability of Inf-Net architecture used by Team10. Team7 using Attention U-Net architecture perform the effect of incomplete segmentation for many cases.

Three-dimensional display for pulmonary vascular segmentation results by each team. a-b. CT01-CT04 respectively represents four datasets for CTA images. The CTA02 result by team 6 is over segmented, which leads to insufficient memory to display normally, and the CTA04 result for team 10 is blank.
In this section, we discuss the advantages and disadvantages of each method by the numerical results and intuitive results. According to the results, we compare the 12 models with similar or different network structures and analyze the reasons that result in differences.
As shown in Fig. 6, results of first, second and fifth groups are obviously better than those from the third and fourth groups. Among them, the segmentation results by first group which has the expansion on the data and looks for the largest common domain after post-processing is the best. Different from other post-processing methods, morphological dilation can connect broken small vessels and is also a common method of filling holes. This result also shows that 3D network can improve the accuracy of segmentation results to a certain extent by combining the spatial information of the image. The results of team 4 and team 5 which were obtained by nnU-Net show that it is more suitable for training on different datasets. Since the performance of 3D network is much higher than that of 2D network, 2D network is generally considered for thicken vessels segmentation and 3D network for narrow vessels segmentation. In nnU-Net network model, a large amount of data expansion can avoid overfitting.
In all groups, only the twelfth groups use the U-Net network model with GAN. The main problem of the results is that there are many broken branches at the end of vessels, which is also the cause of under segmentation, and no connectivity processing leads to under segmentation. From the 3D display results, we can see that most of the segmentation results by team 12 are relatively complete and can correctly segment relatively small branches. The GAN network has good generation ability which can obtain more realistic results.
In the results of using the network model with attention, team 8 and team 11 are obviously better than other groups. In the eighth group, their attention module is used to deal with the segmentation of narrow vessels. Shape prior loss and connectivity completion are added to the model to ensure the connectivity of vascular segmentation which reduces the under-segmentation rate. At the same time, the over segmentation rate is reduced by using post pruning method. Team 11 not only adds the attention module to highlight the target area, but also uses the scale-aware pyramid fusion module to capture context information. Team 6 and Team 7 does not optimize the network, so the model could not be well adapted. Teams 6–8 use the Attention based on U-Net. However, they achieve different results which may be related to the different loss function, learning rate and batch size they choice when training the model.
We can clearly see that the fifth, eleventh and twelfth segmentation results are better than other groups. Team 5 and Team 11 use scale fusion to extract different feature maps so that more information can be learned from the training model. In the fifth group, there are more regions segmented, but the over segmentation is serious. In addition to using U-net as the main network structure, Multi-scale Fusion, Attention and GAN can improve the accuracy of segmentation tasks to a certain extent. Besides, it is necessary to use morphological operations, shape prior loss constraints, maximum connectivity processing, pruning and other methods to fill holes, connect broken vessels and reduce false positive rate which can effectively improve the segmentation accuracy. When using a better network model, we should also consider how to optimize the model to make the network model more suitable for the current segmentation task, as well as the adaptive ability of the model to make it suitable for other data.
Conclusion
As an important part of lung tissue segmentation in the challenge competition, we have published the lung dataset with voxel-based annotation. On this basis, we summarize and analyze the segmentation algorithms of the 12 teams participating in the final stage of the challenge and perform the comparison of the overall results. The best algorithm achieves 79.7%Dice coefficient using 3D U-net by team 2 on CT scans which shows the importance of spatial information. The best method achieves 81.6%Dice coefficient using U-Net by team 1 on CTA scans, which shows U-Net has stable performance as the mainstream network. The Attention U-Net architecture works poor performance which shows the space for further improvement. We can draw the following conclusions: (1) It is very difficult to obtain the standard dataset, and it is still a great challenge to label the lung tissue area. (2) In the image segmentation, pre-processing, algorithm optimization and post-processing are all important processes, especially post-processing has a great impact on the results. (3) Because the segmentation result of medical image is 3D image, the accuracy can be improved by considering spatial information in neural network algorithm. (4) In the medical image segmentation, there have different scale objectives causing the different feature maps. Therefore, the method fusing multi-scale feature map can improve the accuracy of segmentation results.
Footnotes
Acknowledgments
In the process of experiment and writing, the contributors get a lot of writing suggestions and work guidance from editors and readers, which helps us to make the content more rigorous and easier to understand. Here, we would like to express our most sincere thanks to you. Again, this work is supported by the National Natural Science Foundation of China (61971118) and Fundamental Research Funds for the Central Universities (N182410001, N2104008).
