Abstract
BACKGROUND:
The results of medical image segmentation can provide reliable evidence for clinical diagnosis and treatment. The U-Net proposed previously has been widely used in the field of medical image segmentation. Its encoder extracts semantic features of different scales at different stages, but does not carry out special processing for semantic features of each scale.
OBJECTIVE:
To improve the feature expression ability and segmentation performance of U-Net, we proposed a feature supplement and optimization U-Net (FSOU-Net).
METHODS:
First, we put forward the view that semantic features of different scales should be treated differently. Based on this view, we classify the semantic features automatically extracted by encoders into two categories: shallow semantic features and deep semantic features. Then, we propose the shallow feature supplement module (SFSM), which obtains fine-grained semantic features through up-sampling to supplement the shallow semantic information. Finally, we propose the deep feature optimization module (DFOM), which uses the expansive convolution of different receptive fields to obtain multi-scale features and then performs multi-scale feature fusion to optimize the deep semantic information.
RESULTS:
The proposed model is experimented on three medical image segmentation public datasets, and the experimental results prove the correctness of the proposed idea. The segmentation performance of the model is higher than the advanced models for medical image segmentation. Compared with baseline network U-NET, the main index of Dice index is 0.75% higher on the RITE dataset, 2.3% higher on the Kvasir-SEG dataset, and 0.24% higher on the GlaS dataset.
CONCLUSIONS:
The proposed method can greatly improve the feature representation ability and segmentation performance of the model.
Keywords
Introduction
In clinical practice, medical image analysis [1] can provide physicians with digital and quantified medical information to help them make objective and accurate diagnoses. Medical image segmentation is important for medical image analysis and can be used for image-guided interventions, radiotherapy, or improved radiological diagnosis.
With the rapid development of artificial intelligence, especially deep learning [2], image segmentation methods based on deep learning have achieved remarkable results in the field of image segmentation. Compared with traditional machine learning and computer vision methods, deep learning-based image segmentation methods have great advantages in terms of segmentation accuracy and speed. It can effectively help doctors locate lesion sites as well as evaluate the effect before and after treatment, thus greatly reducing the workload of medical personnel. Therefore, image segmentation methods based on deep learning have become the first choice in medical image segmentation.
In the field of medical image segmentation based on convolutional neural networks, most architectures are convolutional networks with an encoder-decoder structure. The advantage of this architecture is that the convolutional network has excellent feature extraction capability and good feature representation [3]. It does not require manual extraction of image features, but provides an end-to-end image segmentation solution. U-Net [4] designed by Ronneberger et al. was presented at the 2015 MICCAI conference and has been widely used due to its excellent performance in medical image segmentation. In addition, the network improved on the basis of U-Net also has excellent performance in many aspects. In general, the main directions of U-Net variant model improvement are divided into three categories:
Adding rich connectivity, usually including dense connectivity (idea derived from DenseNet [5]) and residual connectivity (idea derived from ResNet [6]), to improve the segmentation performance by increasing the interaction and fusion between extracted features [7, 8, 9, 10, 11, 12, 13]. Using different ways of up-sampling and down-sampling to adjust the feature scale and then fuse the features between different stages to improve the segmentation performance by multi-scale fusion [8, 9, 10]. Simulating biological visual properties, designing attention mechanisms with different dimensions or different computation methods, and giving different attention weights to information in features to improve segmentation performance [14, 15, 16].
Of course, the improvement of a particular network is not limited to only one direction, but can be improved in multiple directions simultaneously. The above improvement work for U-Net proved to be effective, however, the feature information automatically extracted from different stages of the U-Net encoder was not processed in a targeted manner in the previous work. The common practice is to use the same approach for the features of different stages. It has been shown that the features extracted from different stages of U-Net have different semantics, with shallow features tending to carry concrete semantic information, while deep features tend to carry abstract semantic information. From this perspective, this paper argues that the semantic features at different stages should be given different treatments. Based on this point of view, this paper proposes a feature supplement and optimization U-Net (FSOU-Net) for medical image segmentation. And verifies the validity of the proposed model on three public datasets of medical image segmentation. Experimental results show that the proposed model has higher segmentation performance compared with U-Net and some variants of it.
Specifically, we make the following contributions:
In this paper, we propose the idea of processing the shallow semantic information automatically extracted by the U-Net encoder separately from the deep semantic information.
A multi-scale based shallow feature supplementation method is proposed to supplement the fine-grained semantic information of shallow features, which in turn improves the feature representation capability of the model.
A deep semantic feature optimization method based on multi-scale is proposed. Ablation experiments demonstrate that the method can work in concert with the shallow feature supplementation method to jointly improve the segmentation performance of the model.
The proposed FSOU-Net has more advanced segmentation results on three public datasets: RITE, Kvasir-SEG and GlaS.
In the rest of the paper, Section 2 mainly reviews some related works, and in Section 3 we propose a new image segmentation framework and discuss each module of our framework. Section 4 reports the experimental results of three public datasets in detail and provides a comparative analysis. Finally, Section 5 summarizes the content of the article.
Related work
Biomedical image segmentation
Medical image segmentation plays an important role in imaging medicine. Medical image segmentation is an indispensable means to extract quantitative information of special tissues in medical images, and is also a pre-processing step and prerequisite for visualization implementation. Earlier medical image segmentation usually used manual or semi-manual segmentation techniques. Segmentation methods are usually based on edge detection or template matching [17, 18, 19, 20, 21, 22] and learning based [23, 24, 25] segmentation methods. The drawbacks of these methods lie in the utilization of hand-crafted features to obtain the segmentation results. On the one hand, it is inefficient and poor generalization ability to obtain segmentation results by using hand-crafted features. On the other hand, these tasks not only need to be performed manually by experienced and specialized medical practitioners [26], but also are a tedious and error-prone task. To improve the efficiency and reduce the workload of medical personnel in clinical scenarios, there is a need to use automatic segmentation methods with high accuracy.
CNN-based deep learning models provide an end-to-end solution for image segmentation. Long et al. created FCN [27] based on classification networks, which pioneered the use of convolutional operations for semantic segmentation. The emergence of FCN provided ideas for solving segmentation problems using convolutional operations. SegNet [28] is an image semantic network based on FCN proposed by Cambridge. Unlike FCN, instead of up-sampling with deconvolution operations, it uses a larger pool index (position) from the encoder to perform nonlinear up-sampling of the decoder’s input. There are many other FCN-based constructs. Christ et al. [29] proposed cascaded FCN (CFCN). Each model extracts the prediction graph of the previous model through context features. This method improves the accuracy of segmentation. In order to reduce the number of false positives in medical images caused by the imbalanced ratio of background and foreground pixels, Zhou et al. [30] proposed to apply the focal loss technique to FCN.
The segmentation method based on CNN can automatically learn semantic features from images, and its segmentation performance is higher than other traditional segmentation methods, so it is widely used in image segmentation of human tissues such as cells [31], pancreas [32], liver tumors [33], and gland [34].
U-Net and its variant architecture
Based on FCN, [4] proposed U-Net based on the encoder-decoder U-shaped structure, in which the importance of jump connections for medical image segmentation was analyzed and demonstrated. UNet
Multi-scale information
Scale is an important factor that determines what kind of information is extracted by convolution operation, and feature scales with different granularity will also affect the ability of feature information representation. In general, fine-grained features can contain more detailed information, while coarse-grained features usually contain more overall information. Multi-scale is usually represented by feature size or receptive field of convolution module.
The Inception family of networks [37, 38, 39, 40] that solve the image classification problem obtains multi-scale feature information by varying the size of the convolutional kernels. The Inception V1 model proposed in [37] uses convolutional kernel sizes of 1, 3, and 5 for feature extraction at different scales as well as pooling to obtain information at multiple scales, respectively. The convolution layer of
Method
In this section, after analyzing the characteristics of semantic information at different scales extracted by the U-Net encoder, we present the ideas of this paper and give the specific structure of FSOU-Net. Then introduce the shallow feature supplement structure and the deep feature optimization module in the model.
Different scale semantic features of U-Net encoder
U-Net is a typical network with encoder-decoder structure, which mainly consists of U-shaped channels and jump connections, and the encoder and decoder together form the U-shaped channels. The encoder is used to extract features at different scales of the image, and it contains four sub-modules, each of which contains two convolutional layers. And the acquisition of different scales is achieved by pooling layers doing down-sampling. In Fig. 1, we visualize the semantic features of each stage of the U-Net encoder. It can be seen that the first layer of features shows more detailed information, and the edges and contours of the target region are clearly visible. As the layer deepens, the fine-grained semantic information starts to become blurred. In the third layer, it is difficult to recognize the outline of the target area, but the general location information of the target is still recognizable. In the fourth layer of semantic features, the number of target regions is no longer recognizable, while the information related to the image as a whole is a bit more. In the last layer of features, the feature maps of the originally different input images are basically the same to the human eye.
Visualization results of U-NET features at different stages. (a) is the original image, and (b) to (f) are visual images of semantic features at different stages.
From the above analysis, we can find that the semantic features extracted by the encoder at different scales contain different semantic information. We believe that there is more detail information in the shallow features and less overall information in the image; there is more overall information and less detail information in the deep features. Based on this, we categorize the first three layers of the semantic features extracted by the encoder as shallow semantic features and the last layer as deep semantic features, and use different processing for them. The implemented network, which we call FSOU-Net, has the architecture shown in Fig. 2.
Illustration of the proposed FSOU-Net. The input image is up-sampled to obtain fine-grained features and then the shallow semantic features are supplemented, and the deep semantic features are optimized by DFOM.
The U-Net encoder implements the scale transformation by down-sampling the image through the maximum pooling layer. On the one hand, down-sampling can change the scale of semantic features, and on the other hand, some fine-grained semantic information is lost. This causes U-Net and its variant networks to perform poorly when segmenting fine-grained targets or target edges. This is shown in Fig. 3.
(a) input image, (b) segmentation result of U-NET, (c) segmentation result of the proposed model, (d) ground truth.
Since the shallow semantic features mainly extract spatial information, such as target location and contour with fewer channels, this paper designs a multi-scale fusion method to supplement the encoder shallow semantic features.
In Fig. 2, in order to extract more fine-grained semantic features, we first use up-sampling to increase the size of the input image. Then two classical 3
The proposed deep feature Optimization module is shown in Fig. 4. Inspired by ASPP, we designed a Pyramid Sampling Structure (PSS). Dilated convolution is used to enlarge the sampling receptive field. It is composed of four parallel dilated convolution channels. After passing through the parallel convolution channel, the batch normalization and the activation function are used for nonlinear transformation. Finally, the results of the parallel dilated convolution channel are concatenated together for multi-scale fusion modulation in different receptive field.
ResNet [6] uses a shortcut connection mechanism to avoid gradient vanishing problem and gradient exploding problem. It allows neural networks to break through thousands of layers for the first time. The process can be described by the following formula:
Where
Deep Feature Optimization Module (DFOM). Convolution with dilatation rates of 1, 3, 5, and 7 is adopted, so the structure can fuse features of different scales.
The connection structure of ResNet is optimized in Inception-resnet [40]. Inspired by the Inception-resnet, we first add
Setup
Datasets
In order to evaluate the performance and robustness of the proposed method, three public datasets of medical images from different domains were selected for the experiments: the RITE dataset [45], the Kvasir-SEG dataset [44], and the GlaS dataset [43]. The experimental results show that the proposed method achieves better segmentation results on all three datasets, with different degrees of improvement compared to the current more advanced CNN networks. In the next sections, we first present the model performance evaluation metrics used for the experiments and the details of the experimental implementation; then we describe the significance of the datasets for clinical diagnosis in practice and show the experimental results on each dataset. Finally, to show that the two proposed methods can work together, we perform ablation experiments on the GlaS dataset for the proposed model. Details of the datasets, the number of training and testing samples used and their availability are shown in Table 1.
The medical datasets used in our experiments. All of these datasets are publicly available
The medical datasets used in our experiments. All of these datasets are publicly available
Image segmentation is actually a pixel-level classification task. Statistical analysis of the classification correctness of each pixel value of the segmentation result allows to evaluate the segmentation performance of the model. This is also a common way of evaluating medical image segmentation models. The common metrics of semantic segmentation, IoU (Intersection over Union), Dice (Dice coefficient), Se (Sensitivity), and Acc (Accuracy), are used as the evaluation metrics of the model in the experiments. Among them, IoU and Dice metrics are the main evaluation metrics for semantic segmentation. Their formulas are as follows:
Here, for each pixel point of the segmentation result TP is true positive, which is predicted to be a positive sample and also a positive sample in ground truth; TN is true negative, which is predicted to be a negative sample and also a negative sample in ground truth; FP is false positive, which is predicted to be a positive sample but a negative sample in ground truth; FN is false negative, which is predicted to be a negative sample but a positive sample in ground truth. Theoretically, the higher the values of IoU, Dice, Se, and Acc, the higher the efficiency of the algorithm segmentation.
Loss function
The loss function in the experiment uses Binary Cross Entropy Dice loss (BCEDiceLoss), which is a loss function that integrates Binary Cross Entropy loss (BCELoss) and Dice loss (DiceLoss). BCELoss is a commonly used loss function for binary classification. Assuming that
V-Net [46] proposed DiceLoss, which aims to cope with the scenario of strong imbalance between positive and negative samples in semantic segmentation. The calculation of the Dice metric has been introduced in the previous section; in fact, it is intuitive to understand that the dice coefficient calculates the similarity between the target region of the segmentation output by the model and the target region in ground truth. Assuming that
where
where
Implementation details
The experiments were trained on NVIDIA GeForce RTX 3060, a GPU with 12 GB of memory. The implementation of this experiment is based on PyTorch. For the three public datasets, we use Adam as the model optimizer with weight decay set to 1e-4. The learning rate is set to 1e-3 during training, and we use simultaneous BN to train the model. The batch size is set to 8 for the other two datasets, except for the Kvasir-SEG dataset which the batch size is set to 16. The optimal parameters for training are not saved until the loss in the validation set is no longer decreasing and the test data are predicted.
Results
Comparison on RITE
The RITE (Retinal Images vessel Tree Extraction) is a database that enables comparative studies on segmentation or classification of arteries and veins on retinal fundus images, which is established based on the public available DRIVE database (Digital Retinal Images for Vessel Extraction). The Angle, width and length of retinal blood vessels can be extracted from the segmented image. This information can be used to diagnose various ophthalmic diseases, such as atherosclerotic choroidal neovascularization [18].
The RITE dataset [45] is a subset of the DRIVE dataset. It contains 40 images, and the original dataset divides them into a training dataset and a test dataset, each containing 20 images. On the basis of the original training data set, we divide the training data set and validation data set, and the partition results are shown in Table 1. We extended the training data by using horizontal and vertical flips during training. The results of the experiments are shown in Table 2.
Experimental results on RITE dataset (take the mean of the 5-flod experiments)
Experimental results on RITE dataset (take the mean of the 5-flod experiments)

The RITE dataset is characterized by extremely fine branches in the target region, which also leads to the generally low segmentation accuracy of most models. the multiple down-sampling design used by U-Net may lose fine-grained semantic information, which is one of the drawbacks of the U-Net model, and for small targets, it tends to fail in recognition. The proposed model can supplement the semantic information of fine-grained targets in shallow semantic features, and thus obtain better segmentation performance. As we expect, it can be seen from the table that the proposed model obtains better segmentation performance compared to the pre-improved U-Net model, with the main metrics IoU and Dice being 0.9% and 0.7% higher than U-Net, respectively. the Se and Acc metrics are 2% and 0.04% higher, respectively.
Figure 5 visualizes the segmentation performance of the advanced model in the field of semantic segmentation for three datasets. For image A of the RITE dataset, the proposed model can segment finer vascular branches compared to the other networks, which we believe is due to the shallow feature supplement structure in FSOU-Net. In the segmentation results of image B, a trunk branch of the vascular network is completely segmented by the proposed FSOU-Net, while this trunk can be identified but not completely segmented in other advanced models. The proposed model can be better identified and has better segmentation performance compared with other advanced networks.
Colorectal cancer is a common type of cancer in humans, and polyps are the precursor of colorectal cancer. The risk of polyps increases with age. Half of people who get colonoscopies at age 50 are found to have polyps. Early detection of the disease can greatly improve the survival rate of colon cancer patients, and polyp detection is particularly important [44].
Kvasir-SEG is an open-access dataset of gastrointestinal polyp images and corresponding segmentation masks, manually annotated by a medical doctor and then validated by an experienced gastroenterologist. The dataset has a total of 1000 original images [44]. We divided them into training set, validation set and test set by 8:1:1 after random sorting, and did comparison experiments between the proposed model and some advanced models on this dataset, and the experimental results are shown in the Table 3.
The boundaries of the target regions in the Kvasir-SEG dataset are difficult to determine, which places higher demands on the model. For image C of Fig. 5, the models in the experiment all detect the location of the target region, but FSOU-Net outperforms the other networks in segmenting the contours of the target region. Image D has four target regions that need to be segmented, and only FSOU-Net segments them all. The disadvantage is that two target regions are connected together, which is an area for improvement.
As can be seen from Table 3, our method has the highest performance in four metrics, IOU reaches 76.3%, Dice reaches 84.1%, Se reaches 86.5%, and Acc reaches 94.63%. Compared to the original U-Net before improvement, the proposed method improves the IOU metric by 2.8%, Dice by 2.3%, Se by 2.5%, and Acc by 0.14%.
Experimental results on Kvasir-SEG dataset
Experimental results on Kvasir-SEG dataset
Gland is an important histological structure in human body. It exists in most organs and is the main structure that secretes carbohydrates and proteins. Malignant tumors arising from glandular epithelium are the most prevalent form of cancer. The morphology of glands is important to evaluate the degree of malignancy of glands [43]. Accurate gland segmentation is essential to obtain reliable morphological statistics.
The GlaS datasets used in the experiments are provided by a team of pathologists at the University Hospitals Coventry and Warwickshire, UK. The GlaS dataset was segmented into 85 training images (37 benign and 48 malignant) and 80 test images (37 benign and 43 malignant). On the basis of the original training data set, we divide the training data set and validation data set, and the partition results are shown in Table 1. We extended the training data by using horizontal and vertical flips during training. The experimental results are shown in Table 4.
Experimental results on GlaS dataset (take the mean of the 5-flod experiments)
Experimental results on GlaS dataset (take the mean of the 5-flod experiments)
Compared with the other two datasets, the GlaS dataset has relatively more segmentation targets, and the number of segmentation targets in the images varies greatly, and this dataset requires the model to have better segmentation of the boundaries of the target regions. As can be seen from the visualization in Fig. 5, the proposed model has a better segmentation effect on the boundary information compared with the original U-Net due to the supplementation of fine-grained semantic information in the shallow features and a certain degree of optimization of the deep semantic features. The experimental results show that the main metrics IOU and Dice are improved by 0.3% and 0.2, and Se and Acc are improved by 1.1% and 0.05%, respectively.
To verify that the two proposed methods can work together, we performed ablation experiments on the GlaS data set for the shallow feature supplement method and the deep feature optimization method. The experimental process adopts 5-fold cross validation, and the average value of the 5-fold result is taken as the final experimental result. The ablation study results are shown in Table 5.
Ablation experimental results on GlaS dataset (take the mean of the 5-flod experiments)
Ablation experimental results on GlaS dataset (take the mean of the 5-flod experiments)
Training process of different data sets. The IOU curves of the three best-performing models on validation sets of different data sets are shown.
From the table, it can be seen that with the addition of shallow feature supplementation alone or deep feature optimization both can improve the segmentation effect. Adding the shallow feature supplementation method alone improves the main metrics IOU and Dice by 0.2% and 0.16%, respectively, while adding the deep feature optimization method alone improves the main metrics IOU and Dice by 0.24% and 0.19%, respectively. The simultaneous addition of the two proposed methods has a better segmentation effect than the single addition of one, and the IOU and Dice metrics are improved by 0.34% and 0.24%, respectively. This indicates that the two proposed methods can work in concert to jointly improve the segmentation effect of the network.
As shown in Fig. 6, we compare the training process of the three best-performing models on each of the three datasets, and we can see that the proposed model always maintains the highest IOU metrics, while the convergence rate of IOU metrics also remains high among several best models.
Conclusion
In this paper, we propose FSOU-Net, an improved model of U-Net, which can supplement the shallow feature information automatically extracted by the network with fine granularity and optimize the deep information by using multi-scale method. The proposed model inherits the advantages of U-Net and overcomes some of its drawbacks. To be specific, the input image is up-sampled to obtain fine-grained feature information as a supplement to the low-level feature information, and the deep feature optimization module (DFOM) is used to optimize the deep feature information. Experimental results show that the model proposed in this paper achieves better segmentation results on all three datasets (i.e., RITE dataset, Kvasir-SEG dataset, and GlaS dataset).
In future research, there are some directions that can be followed up. For example, semantic information is divided into two types in this paper, while different processing methods can be adopted for semantic information of each stage in future work. There are also some limitations that need to be addressed. For example, we need to enhance the interpretability of the deep feature optimization module. Meanwhile, the method is a general approach that can be applied to other 2D medical image segmentation tasks. In this paper, our method has been validated on 2D images and extending it to 3D data would be a possible future work.
Footnotes
Acknowledgments
This work was supported by the Natural Science Foundation of Xinjiang Province (No. 2021B03001-4).
Conflict of interest
The authors declare that there are no conflicts of interest related to this work.
