FSOU-Net: Feature supplement and optimization U-Net for 2D medical image segmentation

Abstract

BACKGROUND:

The results of medical image segmentation can provide reliable evidence for clinical diagnosis and treatment. The U-Net proposed previously has been widely used in the field of medical image segmentation. Its encoder extracts semantic features of different scales at different stages, but does not carry out special processing for semantic features of each scale.

OBJECTIVE:

To improve the feature expression ability and segmentation performance of U-Net, we proposed a feature supplement and optimization U-Net (FSOU-Net).

METHODS:

First, we put forward the view that semantic features of different scales should be treated differently. Based on this view, we classify the semantic features automatically extracted by encoders into two categories: shallow semantic features and deep semantic features. Then, we propose the shallow feature supplement module (SFSM), which obtains fine-grained semantic features through up-sampling to supplement the shallow semantic information. Finally, we propose the deep feature optimization module (DFOM), which uses the expansive convolution of different receptive fields to obtain multi-scale features and then performs multi-scale feature fusion to optimize the deep semantic information.

RESULTS:

The proposed model is experimented on three medical image segmentation public datasets, and the experimental results prove the correctness of the proposed idea. The segmentation performance of the model is higher than the advanced models for medical image segmentation. Compared with baseline network U-NET, the main index of Dice index is 0.75% higher on the RITE dataset, 2.3% higher on the Kvasir-SEG dataset, and 0.24% higher on the GlaS dataset.

CONCLUSIONS:

The proposed method can greatly improve the feature representation ability and segmentation performance of the model.

Keywords

Deep learning algorithm medical image analysis semantic segmentation multi-scale convolutional neural networks

1. Introduction

In clinical practice, medical image analysis [1] can provide physicians with digital and quantified medical information to help them make objective and accurate diagnoses. Medical image segmentation is important for medical image analysis and can be used for image-guided interventions, radiotherapy, or improved radiological diagnosis.

With the rapid development of artificial intelligence, especially deep learning [2], image segmentation methods based on deep learning have achieved remarkable results in the field of image segmentation. Compared with traditional machine learning and computer vision methods, deep learning-based image segmentation methods have great advantages in terms of segmentation accuracy and speed. It can effectively help doctors locate lesion sites as well as evaluate the effect before and after treatment, thus greatly reducing the workload of medical personnel. Therefore, image segmentation methods based on deep learning have become the first choice in medical image segmentation.

In the field of medical image segmentation based on convolutional neural networks, most architectures are convolutional networks with an encoder-decoder structure. The advantage of this architecture is that the convolutional network has excellent feature extraction capability and good feature representation [3]. It does not require manual extraction of image features, but provides an end-to-end image segmentation solution. U-Net [4] designed by Ronneberger et al. was presented at the 2015 MICCAI conference and has been widely used due to its excellent performance in medical image segmentation. In addition, the network improved on the basis of U-Net also has excellent performance in many aspects. In general, the main directions of U-Net variant model improvement are divided into three categories:

Adding rich connectivity, usually including dense connectivity (idea derived from DenseNet [5]) and residual connectivity (idea derived from ResNet [6]), to improve the segmentation performance by increasing the interaction and fusion between extracted features [7, 8, 9, 10, 11, 12, 13].

Using different ways of up-sampling and down-sampling to adjust the feature scale and then fuse the features between different stages to improve the segmentation performance by multi-scale fusion [8, 9, 10].

Simulating biological visual properties, designing attention mechanisms with different dimensions or different computation methods, and giving different attention weights to information in features to improve segmentation performance [14, 15, 16].

Of course, the improvement of a particular network is not limited to only one direction, but can be improved in multiple directions simultaneously. The above improvement work for U-Net proved to be effective, however, the feature information automatically extracted from different stages of the U-Net encoder was not processed in a targeted manner in the previous work. The common practice is to use the same approach for the features of different stages. It has been shown that the features extracted from different stages of U-Net have different semantics, with shallow features tending to carry concrete semantic information, while deep features tend to carry abstract semantic information. From this perspective, this paper argues that the semantic features at different stages should be given different treatments. Based on this point of view, this paper proposes a feature supplement and optimization U-Net (FSOU-Net) for medical image segmentation. And verifies the validity of the proposed model on three public datasets of medical image segmentation. Experimental results show that the proposed model has higher segmentation performance compared with U-Net and some variants of it.

Specifically, we make the following contributions:

In this paper, we propose the idea of processing the shallow semantic information automatically extracted by the U-Net encoder separately from the deep semantic information.

A multi-scale based shallow feature supplementation method is proposed to supplement the fine-grained semantic information of shallow features, which in turn improves the feature representation capability of the model.

A deep semantic feature optimization method based on multi-scale is proposed. Ablation experiments demonstrate that the method can work in concert with the shallow feature supplementation method to jointly improve the segmentation performance of the model.

The proposed FSOU-Net has more advanced segmentation results on three public datasets: RITE, Kvasir-SEG and GlaS.

In the rest of the paper, Section 2 mainly reviews some related works, and in Section 3 we propose a new image segmentation framework and discuss each module of our framework. Section 4 reports the experimental results of three public datasets in detail and provides a comparative analysis. Finally, Section 5 summarizes the content of the article.

2. Related work

2.1 Biomedical image segmentation

Medical image segmentation plays an important role in imaging medicine. Medical image segmentation is an indispensable means to extract quantitative information of special tissues in medical images, and is also a pre-processing step and prerequisite for visualization implementation. Earlier medical image segmentation usually used manual or semi-manual segmentation techniques. Segmentation methods are usually based on edge detection or template matching [17, 18, 19, 20, 21, 22] and learning based [23, 24, 25] segmentation methods. The drawbacks of these methods lie in the utilization of hand-crafted features to obtain the segmentation results. On the one hand, it is inefficient and poor generalization ability to obtain segmentation results by using hand-crafted features. On the other hand, these tasks not only need to be performed manually by experienced and specialized medical practitioners [26], but also are a tedious and error-prone task. To improve the efficiency and reduce the workload of medical personnel in clinical scenarios, there is a need to use automatic segmentation methods with high accuracy.

CNN-based deep learning models provide an end-to-end solution for image segmentation. Long et al. created FCN [27] based on classification networks, which pioneered the use of convolutional operations for semantic segmentation. The emergence of FCN provided ideas for solving segmentation problems using convolutional operations. SegNet [28] is an image semantic network based on FCN proposed by Cambridge. Unlike FCN, instead of up-sampling with deconvolution operations, it uses a larger pool index (position) from the encoder to perform nonlinear up-sampling of the decoder’s input. There are many other FCN-based constructs. Christ et al. [29] proposed cascaded FCN (CFCN). Each model extracts the prediction graph of the previous model through context features. This method improves the accuracy of segmentation. In order to reduce the number of false positives in medical images caused by the imbalanced ratio of background and foreground pixels, Zhou et al. [30] proposed to apply the focal loss technique to FCN.

The segmentation method based on CNN can automatically learn semantic features from images, and its segmentation performance is higher than other traditional segmentation methods, so it is widely used in image segmentation of human tissues such as cells [31], pancreas [32], liver tumors [33], and gland [34].

2.2 U-Net and its variant architecture

Based on FCN, [4] proposed U-Net based on the encoder-decoder U-shaped structure, in which the importance of jump connections for medical image segmentation was analyzed and demonstrated. UNet $++$ [9] and Unet3 $+$ [10] incorporate a series of nested dense connections with multi-scale fusion structure, which aims to make full use of full-scale fine-grained semantics and coarse-grained semantics, while bridging the semantic gap between encoder and decoder. Inspired by the deep residual model [6], RCNN [35] and U-Net [4], [36] proposed R2U-Net. This approach combines residual connectivity and recurrent convolution for replacing the original submodules in U-Net. Both residual connectivity and recurrent convolution have the advantage of not increasing the number of network parameters. Deeper networks can improve performance, but they may hinder training and may suffer from degradation problem. To overcome these problems, ResUnet [11] uses residual units instead of plain neural units as the basic units and removes the cropping operation. This idea of residual fusion has been well validated in several models.

2.3 Multi-scale information

Scale is an important factor that determines what kind of information is extracted by convolution operation, and feature scales with different granularity will also affect the ability of feature information representation. In general, fine-grained features can contain more detailed information, while coarse-grained features usually contain more overall information. Multi-scale is usually represented by feature size or receptive field of convolution module.

The Inception family of networks [37, 38, 39, 40] that solve the image classification problem obtains multi-scale feature information by varying the size of the convolutional kernels. The Inception V1 model proposed in [37] uses convolutional kernel sizes of 1, 3, and 5 for feature extraction at different scales as well as pooling to obtain information at multiple scales, respectively. The convolution layer of $1\times 1$ convolution kernel is then used to aggregate the information. Finally, the features are superimposed and output. Subsequent inception series of networks use to join other features, but feature information fusion based on multi-scale throughout. Traditional neural networks learn fine-grained features in the front layer and coarse-grained features in the back layer, resulting in a network that often lacks coarse-grained features in the front layer. MSDNet [41] uses down-sampling to generate multi-scale features and maintains these features throughout the network. At the same time adopt dense connection mode in DenseNet. It surpasses ResNet and DenseNet. As a way to change receptive field, varying the dilatation rate can also produce multi-scale features. Atrous spatial pyramid pooling structure (ASPP) is designed in the DeeplabV2 [42] model using different dilatation rates of dilated convolution, which enhances the model’s ability to perceive segmented targets at different scales. For a given input, the convolution with different dilatation rates is sampled in parallel and used to capture contextual information at multiple scales.

3. Method

In this section, after analyzing the characteristics of semantic information at different scales extracted by the U-Net encoder, we present the ideas of this paper and give the specific structure of FSOU-Net. Then introduce the shallow feature supplement structure and the deep feature optimization module in the model.

3.1 Different scale semantic features of U-Net encoder

U-Net is a typical network with encoder-decoder structure, which mainly consists of U-shaped channels and jump connections, and the encoder and decoder together form the U-shaped channels. The encoder is used to extract features at different scales of the image, and it contains four sub-modules, each of which contains two convolutional layers. And the acquisition of different scales is achieved by pooling layers doing down-sampling. In Fig. 1, we visualize the semantic features of each stage of the U-Net encoder. It can be seen that the first layer of features shows more detailed information, and the edges and contours of the target region are clearly visible. As the layer deepens, the fine-grained semantic information starts to become blurred. In the third layer, it is difficult to recognize the outline of the target area, but the general location information of the target is still recognizable. In the fourth layer of semantic features, the number of target regions is no longer recognizable, while the information related to the image as a whole is a bit more. In the last layer of features, the feature maps of the originally different input images are basically the same to the human eye.

Figure 1.

Visualization results of U-NET features at different stages. (a) is the original image, and (b) to (f) are visual images of semantic features at different stages.

From the above analysis, we can find that the semantic features extracted by the encoder at different scales contain different semantic information. We believe that there is more detail information in the shallow features and less overall information in the image; there is more overall information and less detail information in the deep features. Based on this, we categorize the first three layers of the semantic features extracted by the encoder as shallow semantic features and the last layer as deep semantic features, and use different processing for them. The implemented network, which we call FSOU-Net, has the architecture shown in Fig. 2.

Figure 2.

Illustration of the proposed FSOU-Net. The input image is up-sampled to obtain fine-grained features and then the shallow semantic features are supplemented, and the deep semantic features are optimized by DFOM.

3.2 Multi-scale shallow feature supplement module

The U-Net encoder implements the scale transformation by down-sampling the image through the maximum pooling layer. On the one hand, down-sampling can change the scale of semantic features, and on the other hand, some fine-grained semantic information is lost. This causes U-Net and its variant networks to perform poorly when segmenting fine-grained targets or target edges. This is shown in Fig. 3.

Figure 3.

(a) input image, (b) segmentation result of U-NET, (c) segmentation result of the proposed model, (d) ground truth.

Since the shallow semantic features mainly extract spatial information, such as target location and contour with fewer channels, this paper designs a multi-scale fusion method to supplement the encoder shallow semantic features.

In Fig. 2, in order to extract more fine-grained semantic features, we first use up-sampling to increase the size of the input image. Then two classical 3 $\times$ 3 convolution layers are used to extract 64 channels of large-scale image features. Finally, the pooling layer is used to down-sample the large-scale semantic features and adjust them to the same size as the first three shallow semantic features, so as to supplement the shallow semantic features extracted by the original encoder. In recent years, there are also some models [37] that use a convolutional layer with a step size of 2 to implement down-sampling. In our experiments, we found that this approach may improve the performance of the model, but also may make the model performance become unstable. Because the parameters learned by the convolutional layer used for down-sampling are not necessarily applicable to all patterns of test data, while the pooling layer without parameters is easier to preserve the original information of semantic features. In order to preserve as much large-scale semantic information as possible, the number of feature channels was not adjusted using the convolutional layer after down-sampling.

3.3 Multi-scale deep feature optimization module

The proposed deep feature Optimization module is shown in Fig. 4. Inspired by ASPP, we designed a Pyramid Sampling Structure (PSS). Dilated convolution is used to enlarge the sampling receptive field. It is composed of four parallel dilated convolution channels. After passing through the parallel convolution channel, the batch normalization and the activation function are used for nonlinear transformation. Finally, the results of the parallel dilated convolution channel are concatenated together for multi-scale fusion modulation in different receptive field.

ResNet [6] uses a shortcut connection mechanism to avoid gradient vanishing problem and gradient exploding problem. It allows neural networks to break through thousands of layers for the first time. The process can be described by the following formula:

$\displaystyle z=h(x)+F({x,w})$ $\displaystyle y=\sigma(z)$

Where $x$ and $y$ are the input and output of the residual module, are the residual function, $z$ is the intermediate result of the residual module, $\sigma(z)$ is the activation function, $h(x)$ is the identity mapping function, a typical one is $h(x)=x$ .

Figure 4.

Deep Feature Optimization Module (DFOM). Convolution with dilatation rates of 1, 3, 5, and 7 is adopted, so the structure can fuse features of different scales.

The connection structure of ResNet is optimized in Inception-resnet [40]. Inspired by the Inception-resnet, we first add $1\times 1$ convolution layer to PSS, and then use the entire PSS as the residual function of the residual structure. After residual fusion with the initial input, the optimized output is obtained.

4. Experiments

4.1 Setup

4.1.1 Datasets

In order to evaluate the performance and robustness of the proposed method, three public datasets of medical images from different domains were selected for the experiments: the RITE dataset [45], the Kvasir-SEG dataset [44], and the GlaS dataset [43]. The experimental results show that the proposed method achieves better segmentation results on all three datasets, with different degrees of improvement compared to the current more advanced CNN networks. In the next sections, we first present the model performance evaluation metrics used for the experiments and the details of the experimental implementation; then we describe the significance of the datasets for clinical diagnosis in practice and show the experimental results on each dataset. Finally, to show that the two proposed methods can work together, we perform ablation experiments on the GlaS dataset for the proposed model. Details of the datasets, the number of training and testing samples used and their availability are shown in Table 1.

Table 1
The medical datasets used in our experiments. All of these datasets are publicly available

Dataset	Images	Input size	Training method
			Train	Valid	Test
RITE	40	128 $\times$ 128	16	4	20
GlaS	165	128 $\times$ 128	69	16	80
Kvasir-SEG	1000	128 $\times$ 128	800	100	100

4.1.2 Evaluation metrics

Image segmentation is actually a pixel-level classification task. Statistical analysis of the classification correctness of each pixel value of the segmentation result allows to evaluate the segmentation performance of the model. This is also a common way of evaluating medical image segmentation models. The common metrics of semantic segmentation, IoU (Intersection over Union), Dice (Dice coefficient), Se (Sensitivity), and Acc (Accuracy), are used as the evaluation metrics of the model in the experiments. Among them, IoU and Dice metrics are the main evaluation metrics for semantic segmentation. Their formulas are as follows:

$\displaystyle\textit{IoU}=\frac{TP}{FP+TP+FN}$ $\displaystyle\textit{Dice}=\frac{2TP}{FP+2TP+FN}$ $\displaystyle Se=\frac{TP}{TP+FN}$ $\displaystyle\textit{Acc}=\frac{TP+TN}{TP+TN+FP+FN}$

Here, for each pixel point of the segmentation result TP is true positive, which is predicted to be a positive sample and also a positive sample in ground truth; TN is true negative, which is predicted to be a negative sample and also a negative sample in ground truth; FP is false positive, which is predicted to be a positive sample but a negative sample in ground truth; FN is false negative, which is predicted to be a negative sample but a positive sample in ground truth. Theoretically, the higher the values of IoU, Dice, Se, and Acc, the higher the efficiency of the algorithm segmentation.

4.1.3 Loss function

The loss function in the experiment uses Binary Cross Entropy Dice loss (BCEDiceLoss), which is a loss function that integrates Binary Cross Entropy loss (BCELoss) and Dice loss (DiceLoss). BCELoss is a commonly used loss function for binary classification. Assuming that $P_{i}$ denotes the prediction result of the model, $G_{i}$ denotes the ground truth of the image, $G_{i,j}$ denotes ground truth the pixel value at position $i^{\text{th}}$ , $j^{\text{th}}$ . $P_{i,j}$ denotes the pixel value at position $i^{\text{th}}$ , $j^{\text{th}}$ of the model output image, $h$ denotes the height of the image, and $w$ denotes the width of the image, BCELoss can be described as follows.

$\displaystyle L_{\textit{BCE}}=\textit{BCELoss}({P_{i},G_{i}})=-\mathop{\sum}% \limits_{i=1}^{w}\mathop{\sum}\limits_{j=1}^{h}[{({G_{i,j}{\log}P_{i,j}})+({1-% G_{i,j}}){\log}({1-P_{i,j}})}]$

V-Net [46] proposed DiceLoss, which aims to cope with the scenario of strong imbalance between positive and negative samples in semantic segmentation. The calculation of the Dice metric has been introduced in the previous section; in fact, it is intuitive to understand that the dice coefficient calculates the similarity between the target region of the segmentation output by the model and the target region in ground truth. Assuming that $I$ denotes the intersection of the target region in the predicted image and the target region in ground truth, and $U$ denotes the union of the target region in the predicted image and the target region in ground truth, the DiceLoss can be expressed as follows:

$\displaystyle L_{\textit{dice}}=\textit{DiceLoss}({I,U})=1-\frac{I+\varepsilon% }{U-I+\varepsilon}$

where $\varepsilon$ is the smoothing factor. BCEDiceLoss can be expressed as follows:

$\displaystyle L_{\textit{BCEDice}}=\alpha L_{\textit{BCE}}+\beta L_{\textit{% dice}}$

where $\alpha$ and $\beta$ are the weights of the two losses, respectively, and are not zero. In the experiment, we set $\alpha$ to 0.5 and $\beta$ to 1.

4.1.4 Implementation details

The experiments were trained on NVIDIA GeForce RTX 3060, a GPU with 12 GB of memory. The implementation of this experiment is based on PyTorch. For the three public datasets, we use Adam as the model optimizer with weight decay set to 1e-4. The learning rate is set to 1e-3 during training, and we use simultaneous BN to train the model. The batch size is set to 8 for the other two datasets, except for the Kvasir-SEG dataset which the batch size is set to 16. The optimal parameters for training are not saved until the loss in the validation set is no longer decreasing and the test data are predicted.

4.2 Results

4.2.1 Comparison on RITE

The RITE (Retinal Images vessel Tree Extraction) is a database that enables comparative studies on segmentation or classification of arteries and veins on retinal fundus images, which is established based on the public available DRIVE database (Digital Retinal Images for Vessel Extraction). The Angle, width and length of retinal blood vessels can be extracted from the segmented image. This information can be used to diagnose various ophthalmic diseases, such as atherosclerotic choroidal neovascularization [18].

The RITE dataset [45] is a subset of the DRIVE dataset. It contains 40 images, and the original dataset divides them into a training dataset and a test dataset, each containing 20 images. On the basis of the original training data set, we divide the training data set and validation data set, and the partition results are shown in Table 1. We extended the training data by using horizontal and vertical flips during training. The results of the experiments are shown in Table 2.

Table 2
Experimental results on RITE dataset (take the mean of the 5-flod experiments)

Method	IoU (%)	Dice (%)	Se (%)	Acc (%)
FCN8s [27]	21.68	35.53	42.34	87.35
SegNet [28]	32.65	49.00	46.36	92.03
DeepLabv3 $+$ [47]	22.98	37.31	38.35	89.32
PSPNet [48]	17.55	29.74	27.02	89.45
UNet $++$ [9]	37.72	54.59	47.87	93.39
EfficientUNet $++$ [49]	34.93	51.59	41.97	93.48
ResUnet [11]	49.71	66.19	65.18	94.48
ResUnet $++$ [12]	51.16	67.46	65.79	94.74
U-Net [4]	54.07	70.04	67.48	95.22
Our	54.97	70.79	69.51	95.26

Figure 5.

Segmentation performance of different models. (a) Input Images, (b) DeepLabV3 $+$ [27], (c) UNet $++$ [9], (d) EfficientUNet $++$ [49], (e) ResUnet [11], (f) ResUnet $++$ [12], (g) U-Net [4], (h) FSOU-Net (our), (i) Ground Truth.

The RITE dataset is characterized by extremely fine branches in the target region, which also leads to the generally low segmentation accuracy of most models. the multiple down-sampling design used by U-Net may lose fine-grained semantic information, which is one of the drawbacks of the U-Net model, and for small targets, it tends to fail in recognition. The proposed model can supplement the semantic information of fine-grained targets in shallow semantic features, and thus obtain better segmentation performance. As we expect, it can be seen from the table that the proposed model obtains better segmentation performance compared to the pre-improved U-Net model, with the main metrics IoU and Dice being 0.9% and 0.7% higher than U-Net, respectively. the Se and Acc metrics are 2% and 0.04% higher, respectively.

Figure 5 visualizes the segmentation performance of the advanced model in the field of semantic segmentation for three datasets. For image A of the RITE dataset, the proposed model can segment finer vascular branches compared to the other networks, which we believe is due to the shallow feature supplement structure in FSOU-Net. In the segmentation results of image B, a trunk branch of the vascular network is completely segmented by the proposed FSOU-Net, while this trunk can be identified but not completely segmented in other advanced models. The proposed model can be better identified and has better segmentation performance compared with other advanced networks.

4.2.2 Comparison on Kvasir-SEG

Colorectal cancer is a common type of cancer in humans, and polyps are the precursor of colorectal cancer. The risk of polyps increases with age. Half of people who get colonoscopies at age 50 are found to have polyps. Early detection of the disease can greatly improve the survival rate of colon cancer patients, and polyp detection is particularly important [44].

Kvasir-SEG is an open-access dataset of gastrointestinal polyp images and corresponding segmentation masks, manually annotated by a medical doctor and then validated by an experienced gastroenterologist. The dataset has a total of 1000 original images [44]. We divided them into training set, validation set and test set by 8:1:1 after random sorting, and did comparison experiments between the proposed model and some advanced models on this dataset, and the experimental results are shown in the Table 3.

The boundaries of the target regions in the Kvasir-SEG dataset are difficult to determine, which places higher demands on the model. For image C of Fig. 5, the models in the experiment all detect the location of the target region, but FSOU-Net outperforms the other networks in segmenting the contours of the target region. Image D has four target regions that need to be segmented, and only FSOU-Net segments them all. The disadvantage is that two target regions are connected together, which is an area for improvement.

As can be seen from Table 3, our method has the highest performance in four metrics, IOU reaches 76.3%, Dice reaches 84.1%, Se reaches 86.5%, and Acc reaches 94.63%. Compared to the original U-Net before improvement, the proposed method improves the IOU metric by 2.8%, Dice by 2.3%, Se by 2.5%, and Acc by 0.14%.

Table 3
Experimental results on Kvasir-SEG dataset

Method	IoU (%)	Dice (%)	Se (%)	Acc (%)
FCN8s [27]	69.42	79.08	82.55	93.38
SegNet [28]	70.26	79.65	83.96	93.41
DeepLabv3 $+$ [47]	73.97	82.91	84.34	94.55
PSPNet [48]	69.42	79.17	80.13	93.66
UNet $++$ [9]	74.02	82.40	85.32	94.24
EfficientUNet $++$ [49]	71.51	79.99	80.68	93.55
ResUnet [11]	62.22	72.95	75.81	92.49
ResUnet $++$ [12]	63.71	73.36	78.40	92.30
U-Net [4]	73.56	81.82	84.08	94.49
Our	76.32	84.12	86.54	94.63

4.2.3 Comparison on GlaS

Gland is an important histological structure in human body. It exists in most organs and is the main structure that secretes carbohydrates and proteins. Malignant tumors arising from glandular epithelium are the most prevalent form of cancer. The morphology of glands is important to evaluate the degree of malignancy of glands [43]. Accurate gland segmentation is essential to obtain reliable morphological statistics.

The GlaS datasets used in the experiments are provided by a team of pathologists at the University Hospitals Coventry and Warwickshire, UK. The GlaS dataset was segmented into 85 training images (37 benign and 48 malignant) and 80 test images (37 benign and 43 malignant). On the basis of the original training data set, we divide the training data set and validation data set, and the partition results are shown in Table 1. We extended the training data by using horizontal and vertical flips during training. The experimental results are shown in Table 4.

Table 4
Experimental results on GlaS dataset (take the mean of the 5-flod experiments)

Method	IoU (%)	Dice (%)	Se (%)	Acc (%)
FCN8s [27]	71.76	82.95	85.94	83.16
SegNet [28]	60.67	74.76	79.70	75.08
DeepLabv3 $+$ [47]	76.32	86.10	86.83	86.53
PSPNet [48]	72.95	83.94	85.09	84.04
UNet $++$ [9]	80.37	88.62	88.15	89.10
EfficientUNet $++$ [49]	77.33	86.71	86.79	87.10
ResUnet [11]	84.26	91.14	91.31	91.35
ResUnet $++$ [12]	80.45	88.74	89.44	89.20
U-Net [4]	84.98	91.54	91.62	91.88
Our	85.32	91.78	92.72	91.93

Compared with the other two datasets, the GlaS dataset has relatively more segmentation targets, and the number of segmentation targets in the images varies greatly, and this dataset requires the model to have better segmentation of the boundaries of the target regions. As can be seen from the visualization in Fig. 5, the proposed model has a better segmentation effect on the boundary information compared with the original U-Net due to the supplementation of fine-grained semantic information in the shallow features and a certain degree of optimization of the deep semantic features. The experimental results show that the main metrics IOU and Dice are improved by 0.3% and 0.2, and Se and Acc are improved by 1.1% and 0.05%, respectively.

4.2.4 Ablation study

To verify that the two proposed methods can work together, we performed ablation experiments on the GlaS data set for the shallow feature supplement method and the deep feature optimization method. The experimental process adopts 5-fold cross validation, and the average value of the 5-fold result is taken as the final experimental result. The ablation study results are shown in Table 5.

Table 5
Ablation experimental results on GlaS dataset (take the mean of the 5-flod experiments)

Method	IOU (%)	Dice (%)	Se (%)	Acc (%)
U-Net	84.98	91.54	91.62	91.88
Our (SFSM)	85.17	91.70	92.71	91.87
Our (DFOM)	85.22	91.73	91.99	91.87
Our (SFSM $+$ DFOM)	85.32	91.78	92.72	91.93

Figure 6.

Training process of different data sets. The IOU curves of the three best-performing models on validation sets of different data sets are shown.

From the table, it can be seen that with the addition of shallow feature supplementation alone or deep feature optimization both can improve the segmentation effect. Adding the shallow feature supplementation method alone improves the main metrics IOU and Dice by 0.2% and 0.16%, respectively, while adding the deep feature optimization method alone improves the main metrics IOU and Dice by 0.24% and 0.19%, respectively. The simultaneous addition of the two proposed methods has a better segmentation effect than the single addition of one, and the IOU and Dice metrics are improved by 0.34% and 0.24%, respectively. This indicates that the two proposed methods can work in concert to jointly improve the segmentation effect of the network.

4.2.5 Visualization of training process

As shown in Fig. 6, we compare the training process of the three best-performing models on each of the three datasets, and we can see that the proposed model always maintains the highest IOU metrics, while the convergence rate of IOU metrics also remains high among several best models.

5. Conclusion

In this paper, we propose FSOU-Net, an improved model of U-Net, which can supplement the shallow feature information automatically extracted by the network with fine granularity and optimize the deep information by using multi-scale method. The proposed model inherits the advantages of U-Net and overcomes some of its drawbacks. To be specific, the input image is up-sampled to obtain fine-grained feature information as a supplement to the low-level feature information, and the deep feature optimization module (DFOM) is used to optimize the deep feature information. Experimental results show that the model proposed in this paper achieves better segmentation results on all three datasets (i.e., RITE dataset, Kvasir-SEG dataset, and GlaS dataset).

In future research, there are some directions that can be followed up. For example, semantic information is divided into two types in this paper, while different processing methods can be adopted for semantic information of each stage in future work. There are also some limitations that need to be addressed. For example, we need to enhance the interpretability of the deep feature optimization module. Meanwhile, the method is a general approach that can be applied to other 2D medical image segmentation tasks. In this paper, our method has been validated on 2D images and extending it to 3D data would be a possible future work.

Footnotes

Acknowledgments

This work was supported by the Natural Science Foundation of Xinjiang Province (No. 2021B03001-4).

Conflict of interest

The authors declare that there are no conflicts of interest related to this work.

References

Shen

Suk

. Deep learning in medical image analysis. Annual Review of Biomedical Engineering. 2017; 19: 221–248.

Litjens

Kooi

Bejnordi

Setio

AAA

Ciompi

Ghafoorian

Sánchez

, et al. A survey on deep learning in medical image analysis. Medical Image Analysis. 2017; 42: 60–88.

Liu

Song

Liu

Zhang

. A review of deep-learning-based medical image segmentation methods. Sustainability. 2021; 13(3): 1224.

Ronneberger

Fischer

Brox,

. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, Cham, 2015, October, pp. 234–241.

Huang

Liu

Van Der Maaten

Weinberger

. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4700–4708.

Zhang

Ren

Sun

. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.

Jin

Meng

Pham

Chen

Wei

. DUNet: A deformable network for retinal vessel segmentation. Knowledge-Based Systems. 2019; 178: 149–162.

Zhang

Jin

Zhang

. Mdu-net: Multi-scale densely connected u-net for biomedical image segmentation. arXiv preprint arXiv:1812.00352. 2018.

Zhou

Siddiquee

MMR

Tajbakhsh

Liang

. Unet

++

: A nested u-net architecture for medical image segmentation. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, Springer, Cham, 2018, pp. 3–11.

10.

Huang

Lin

Tong

Zhang

Iwamoto

, et al. Unet 3

+

: A full-scale connected unet for medical image segmentation. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2020, May, pp. 1055–1059.

11.

Zhang

Liu

Wang

. Road extraction by deep residual u-net. IEEE Geoscience and Remote Sensing Letters. 2018; 15(5): 749–753.

12.

Jha

Smedsrud

Riegler

Johansen

De Lange

Halvorsen

Johansen

. Resunet

++

: An advanced architecture for medical image segmentation. In 2019 IEEE International Symposium on Multimedia (ISM), IEEE, 2019, December, pp. 225–2255.

13.

Ibtehaz

Rahman

. MultiResUNet: Rethinking the U-Net architecture for multimodal biomedical image segmentation. Neural Networks. 2020; 121: 74–87.

14.

Oktay

Schlemper

Folgoc

Lee

Heinrich

Misawa

Rueckert

, et al. Attention u-net: Learning where to look for the pancreas. arXiv preprint arXiv:1804.03999. 2018.

15.

Roy

Navab

Wachinger

. Concurrent spatial and channel ‘squeeze & excitation’ in fully convolutional networks. In International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, Cham, 2018, September, pp. 421–429.

16.

Cheng

Tian

. Fully convolutional attention network for biomedical image segmentation. Artificial Intelligence in Medicine. 2020; 107: 101899.

17.

Lee

Hara

Fujita

Itoh

Ishigaki

. Automated detection of pulmonary nodules in helical CT images based on an improved template-matching technique. IEEE Transactions on Medical Imaging. 2001; 20(7): 595–604.

18.

Staal

Abràmoff

Niemeijer

Viergever

Van Ginneken

. Ridge-based vessel segmentation in color images of the retina. IEEE Transactions on Medical Imaging. 2004; 23(4): 501–509.

19.

Aquino

Gegúndez-Arias

Marín

. Detecting the optic disc boundary in digital fundus images using morphological, edge detection, and feature extraction techniques. IEEE Transactions on Medical Imaging. 2010; 29(11): 1860–1869.

20.

Chen

Smith

Ward

Najarian

. Automated ventricular systems segmentation in brain CT images by combining low-level segmentation and high-level template matching. BMC Medical Informatics and Decision Making. 2009; 9(1): 1–14.

21.

Tsai

Yezzi

Wells

Tempany

Tucker

Fan

Willsky

, et al. A shape-based approach to the segmentation of medical imagery using level sets. IEEE Transactions on Medical Imaging. 2003; 22(2): 137–154.

22.

Kanimozhi

Bindu

. Brain MR image segmentation using self organizing map. Brain. 2013; 2(10): 261–274.

23.

Aganj

Harisinghani

Weissleder

Fischl

. Unsupervised medical image segmentation based on the local center of mass. Scientific Reports. 2018; 8(1): 1–8.

24.

Abramoff

Alward

Greenlee

Shuba

Kim

Fingert

Kwon

. Automated segmentation of the optic disc from stereo color photographs using physiologically plausible features. Investigative Ophthalmology & Visual Science. 2007; 48(4): 1665–1673.

25.

Cheng

Liu

Yin

Wong

DWK

Tan

Wong

, et al. Superpixel classification based optic disc and optic cup segmentation for glaucoma screening. IEEE Transactions on Medical Imaging. 2013; 32(6): 1019–1032.

26.

Hwang

Park

. Accurate lung segmentation via network-wise training of convolutional networks. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, Springer, Cham, 2017, pp. 92–99.

27.

Long

Shelhamer

Darrell

. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3431–3440.

28.

Badrinarayanan

Kendall

Cipolla

. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2017; 39(12): 2481–2495.

29.

Christ

Elshaer

MEA

Ettlinger

Tatavarty

Bickel

Bilic

Menze

, et al. Automatic liver and lesion segmentation in CT using cascaded fully convolutional neural networks and 3D conditional random fields. In International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, Cham, 2016, October, pp. 415–423.

30.

Zhou

Shen

Riga

Yang

Lee

. Focal fcn: Towards small object segmentation with limited training data. arXiv preprint arXiv:1711.01506. 2017.

31.

Raza

SEA

Cheung

Epstein

Pelengaris

Khan

Rajpoot

. Mimo-net: A multi-input multi-output convolutional neural network for cell segmentation in fluorescence microscopy images. In 2017 IEEE 14th International Symposium on Biomedical Imaging (ISBI 2017), IEEE, 2017, April, pp. 337–340.

32.

Roth

Farag

Shin

Liu

Turkbey

Summers

. Deeporgan: Multi-level deep convolutional networks for automated pancreas segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, Cham, 2015, October, pp. 556–564.

33.

Christ

Ettlinger

Grün

Elshaera

MEA

Lipkova

Schlecht

Menze

, et al. Automatic liver and tumor segmentation of CT and MRI volumes using cascaded fully convolutional neural networks. arXiv preprint arXiv:1702.05970. 2017.

34.

Graham

Chen

Gamper

Dou

Heng

Snead

Rajpoot

, et al. MILD-Net: Minimal information loss dilated network for gland instance segmentation in colon histology images. Medical Image Analysis. 2019; 52: 199–211.

35.

Girshick

Donahue

Darrell

Malik

. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 580–587.

36.

Alom

Yakopcic

Hasan

Taha

Asari

. Recurrent residual U-Net for medical image segmentation. Journal of Medical Imaging. 2019; 6(1): 014006.

37.

Szegedy

Liu

Jia

Sermanet

Reed

Anguelov

Rabinovich

, et al. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1–9.

38.

Ioffe

Szegedy

. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, PMLR, 2015, June, pp. 448–456.

39.

Szegedy

Vanhoucke

Ioffe

Shlens

Wojna

. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2818–2826.

40.

Szegedy

Ioffe

Vanhoucke

Alemi

. Inception-v4, inception-resnet and the impact of residual connections on learning. In Thirty-First AAAI Conference on Artificial Intelligence, 2017, February.

41.

Huang

Chen

van der Maaten

Weinberger

. Multi-scale dense networks for resource efficient image classification. arXiv preprint arXiv:1703.09844. 2017.

42.

Chen

Papandreou

Kokkinos

Murphy

Yuille

. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2017; 40(4): 834–848.

43.

Sirinukunwattana

Pluim

Chen

Heng

Guo

Rajpoot

, et al. Gland segmentation in colon histology images: The glas challenge contest. Medical Image Analysis. 2017; 35: 489–502.

44.

Jha

Smedsrud

Riegler

Halvorsen

de Lange

Johansen

. Kvasir-seg: A segmented polyp dataset. In International Conference on Multimedia Modeling, Springer, Cham, 2020, January, pp. 451–462.

45.

Abràmoff

Garvin

. Automated separation of binary overlapping trees in low-contrast color retinal images. In International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, Berlin, Heidelberg, 2013, September, pp. 436–443.

46.

Milletari

Navab

Ahmadi

. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In 2016 Fourth International Conference on 3D vision (3DV), IEEE, 2016, October, pp. 565–571.

47.

Chen

Zhu

Papandreou

Schroff

Adam

. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 801–818.

48.

Zhao

Shi

Wang

Jia

. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2881–2890.

49.

Silva

Menezes

Rodrigues

Silva

Pinto

Oliveira

. Encoder-Decoder Architectures for Clinically Relevant Coronary Artery Segmentation. arXiv preprint arXiv:2106.11447. 2021.

FSOU-Net: Feature supplement and optimization U-Net for 2D medical image segmentation

Abstract

BACKGROUND:

OBJECTIVE:

METHODS:

RESULTS:

CONCLUSIONS:

Keywords

1. Introduction

2. Related work

2.1 Biomedical image segmentation

2.2 U-Net and its variant architecture

2.3 Multi-scale information

3. Method

3.1 Different scale semantic features of U-Net encoder

4.1 Setup

4.1.1 Datasets

Table 1 The medical datasets used in our experiments. All of these datasets are publicly available

4.1.3 Loss function

4.1.4 Implementation details

4.2 Results

4.2.1 Comparison on RITE

Table 2 Experimental results on RITE dataset (take the mean of the 5-flod experiments)

Table 3 Experimental results on Kvasir-SEG dataset

Table 4 Experimental results on GlaS dataset (take the mean of the 5-flod experiments)

Table 5 Ablation experimental results on GlaS dataset (take the mean of the 5-flod experiments)

5. Conclusion

Footnotes

Acknowledgments

Conflict of interest

References

Table 1
The medical datasets used in our experiments. All of these datasets are publicly available

Table 2
Experimental results on RITE dataset (take the mean of the 5-flod experiments)

Table 3
Experimental results on Kvasir-SEG dataset

Table 4
Experimental results on GlaS dataset (take the mean of the 5-flod experiments)

Table 5
Ablation experimental results on GlaS dataset (take the mean of the 5-flod experiments)