MTC-Net: Multi-scale feature fusion network for medical image segmentation

Abstract

Image segmentation is critical in medical image processing for lesion detection, localisation, and subsequent diagnosis. Currently, computer-aided diagnosis (CAD) has played a significant role in improving diagnostic efficiency and accuracy. The segmentation task is made more difficult by the hazy lesion boundaries and uneven forms. Because standard convolutional neural networks (CNNs) are incapable of capturing global contextual information, adequate segmentation results are impossible to achieve. We propose a multiscale feature fusion network (MTC-Net) in this paper that integrates deep separable convolution and self-attentive modules in the encoder to achieve better local continuity of images and feature maps. In the decoder, a multi-branch multi-scale feature fusion module (MSFB) is utilized to improve the network’s feature extraction capability, and it is integrated with a global cooperative aggregation module (GCAM) to learn more contextual information and adaptively fuse multi-scale features. To develop rich hierarchical representations of irregular forms, the suggested detail enhancement module (DEM) adaptively integrates local characteristics with their global dependencies. To validate the effectiveness of the proposed network, we conducted extensive experiments, evaluated on the public datasets of skin, breast, thyroid and gastrointestinal tract with ISIC2018, BUSI, TN3K and Kvasir-SEG. The comparison with the latest methods also verifies the superiority of our proposed MTC-Net in terms of accuracy. Our code on https://github.com/gih23/MTC-Net.

Keywords

Medical image segmentation multi-scale features detail enhancement feature fusion deep learning

1 Introduction

Medical image segmentation is essential for assisting doctors with diagnosis. It can help with the analysis of dermoscopic images, polyp images, breast ultrasound images, and so forth. In comparison to traditional manual diagnosis methods, computer-aided diagnosis (CAD) automatically identifies diseases and segments lesion portions to achieve higher accuracy and assist doctors in making more objective and timely diagnoses.

With the ongoing development of deep learning, medical image segmentation methods based on convolutional neural networks (CNNs) have garnered a lot of attention in recent years. The limitations of the previous methods are avoided, and segmentation efficiency and accuracy are greatly improved. For example, the well-known fully convolutional networks (FCNs) [1] and UNet [2], an encoder-decoder network that has demonstrated significant advantages in image segmentation tasks, were proposed in [3] to improve the quality of segmentation of objects of varying sizes by mitigating the unknown network depth through efficient fusion of U-Net of varying depths. [4] proposes DoubleU-Net, which uses two stacked U-Net architectures to generate more accurate segmentation masks. and the ResU-Net proposed by [5], which uses residual units to simplify deep network training; and so on. All of these methods demonstrated the utility of encoder-decoder networks in image segmentation tasks. However, due to network structure limitations, successive downsampling results in the loss of global context information, which is abundant and important for accurate localization of skin lesion boundaries, and has a direct impact on the segmentation network’s accuracy. Furthermore, the underlying feature maps of images are frequently overlooked, despite the fact that low-level feature maps contain a large amount of spatial information that is critical for the localization and delineation of lesion regions. As a result, significant advancements are still required to improve the network’s ability to extract contextual information and learn remote dependencies between image pixels.

In this paper, we introduce MTC-Net, a unique multi-scale feature fusion network for medical picture segmentation. We merge convolutional and self-attentive methods in the encoder to build multi-scale feature maps while attaining global perceptual field for all images, unlike typical convolutional neural networks. To improve the network’s feature extraction capability, a multi-branch multi-scale feature fusion module (MSFB) is added to the decoder, which is integrated with a global cooperative aggregation module (GCAM) to train and fuse multi-scale features adaptively to accomplish feature fusion in space. We propose, on the other hand, a detail enhancement module (DEM) to improve the network’s ability to segment irregular boundaries. We experimented extensively on several medical image datasets and compared with the latest models. MTC-Net achieves 93.17% Dice on the ISIC2018 dataset; 96.53% segmentation accuracy on the BUSI dataset; 4.48% improvement in specificity (SP) of our network on the TN3K dataset compared to U-Net; and 81.45% recall on the Kvasir-SEG dataset, which is higher than the latest model.

The contributions of this paper can be summarized as follows:

A detail enhancement module is proposed to capture the contextual information of local features through spatial attention module and channel attention module. The interdependence of channel mapping is utilized to enhance the contrast between class-related features. to improve the network’s ability to learn irregular boundary features of lesion regions.

We include a multi-branch multi-scale feature fusion module in the decoder to extract richer hierarchical features through multi-branch convolutional layers, null convolution and hierarchical segmentation blocks to produce more detailed feature representations. And the global cooperative aggregation module is used to fully fuse the feature information at different levels of the decoder through bilinear interpolation transform and soft pooling operation to obtain more accurate segmentation results.

2 Related works

2.1 CNN-based methods

Convolutional neural networks are an important network framework in the field of deep learning, particularly in computer vision. CNN began with LeNet in the 1990s to solve the visual task of handwritten digit recognition, and the basic architecture includes convolutional, pooling, and fully-connected layers. Then came the AlexNet architecture, which has five convolutional layers and three fully-connected layers and uses ReLU instead of Sigmoid to accelerate SGD convergence. And GoogleNet replaces all of the later fully connected layers with simple global average pooling, propelling convolutional neural network research to new heights. ResNet, which was recently proposed in [6], implements a simple constant mapping by jumping connections and has demonstrated advanced performance in classification and segmentation tasks.

CNNs are widely used in medical image classification and segmentation tasks due to their efficient and accurate performance. By fusing all layer he features with an edge-guided module and modifying the jump-join part of the traditional UNet, DEA-UNet solves the problem of blurred lesion edges [7]. To better deal with complex images, [8] proposed CA-Unet++, which solved the loss of eigenvalues in the long jump connection process and upsampling process, respectively, using the channel module and attention module to achieve better image segmentation efficiency and accuracy. The proposed MR-UNet model of [9] uses bilinear interpolation instead of deconvolution in the UNet model, and the encoder includes residual blocks to improve segmentation capability. [10] proposed an attention-based residual dense depth network that can better preserve image details by combining the attention mechanism and residual connectivity. Multiple downsampling is prone to information redundancy in the encoder-decoder framework, and [11] proposed a remote image change detection network based on a multi-feature self-attention fusion mechanism, which can well avoid this drawback by adding a multi-functional self-attention mechanism to obtain richer contextual information. Among the skin lesion segmentation tasks, [12] proposed a new semi-automatic skin lesion segmentation method that overcomes the problem of low contrast between the lesion region and the surrounding healthy skin by combining a full convolutional network with multi-scale integration. Similarly, [13] proposed a new dense pooling layer for accurate skin lesion boundary identification and improved segmentation accuracy. FCN (fully connected network) and U-Net, two neural networks, are prone to parameter redundancy. [14] proposed an improved skin lesion segmentation model based on deformable 3D convolution and ResU-NeXt++, which improves the Jaccard index of skin lesion segmentation by standard reverse propagation for efficient training and improved Jaccard index for skin lesion image segmentation. Ms RED [15] improved the network’s learning capability by replacing the traditional convolutional layer in the encoder and decoder networks with a multi-channel feature fusion module. FAT-Net [16] included a novel functional adaptive transformer network based on the classical encoder-decoder architecture that can effectively capture remote dependencies and global contextual information, as well as improve feature fusion between adjacent level features by activating effective channels and suppressing irrelevant background noise. However, the traditional convolutional neural network-based approach has limitations in capturing rich global contextual information, and more progress is required.

2.2 Transformer

Since its introduction, the Transformer model has been widely used in natural language processing, computer vision, and many other fields. It is essentially an Encoder-Decoder architecture, with the encoding component consisting of multiple layers of encoders (Encoder), each of which is composed of two sub-layers: the self attention layer and a feedforward network layer. The Decoder, on the other hand, is made up of three sub-layers: a multi-headed self-attentive layer, an additional layer capable of multi-headed self-attentive of the encoder output, and a fully connected layer. Many improved models have been developed in recent years as a result of Transformer’s satisfactory results in the field of image processing. The VIT model divides the image into multiple patches, each of which is mapped into a fixed dimension using a NN network and input to a subsequent Transformer Encoder; when the image is classified, an identity token is added at the beginning of the sequence. When pre-trained on a large amount of sufficient data, VIT can achieve good results. The TNT model [17] further subdivides the patches in the original VIT into sub-patches, treating each patch as a sentence and the elements obtained by further subdivision of the patches as words. The model can find more similar sub-patch groups in the data, improving the model’s learning ability and generalization on the data. [18] proposed a Transformer model for pixel-level image tasks, incorporating the Pyramid CNN concept into Transformer and capturing finer-grained information by increasing the initial resolution and decreasing the resolution layer by layer, while reducing the running overhead. Swin Transformer [19] uses local attention to divide patches into windows, and attention between patches is only performed within the windows to improve operational efficiency. Swin Transformer also proposes the Shifted Window method, which uses different window configurations in different layers, and the window positions in the next layer will be shifted horizontally and vertically by 2 patches in order to make the patches within different windows in the previous layer interact with information. Transformer’s application in the field of computer vision has gradually matured, making a significant contribution to the field of computer vision and multimodality.

3 Proposed method

3.1 Network architectures

Figure 1 depicts the network’s overall design. For precise and dependable segmentation of lesion locations in medical imaging, we present a multi-scale feature fusion network (MTC-Net). To improve MTC-Net’s ability to segment lesions and deal with common issues like blurred lesion boundaries, low contrast, and irregular shapes, we built a network with four key components: a fusion encoder, a global cooperative aggregation module (GCAM), a detail enhancement module (DEM), and a multi-branch multiscale feature fusion module (MSFB). The fusion encoder, in particular, is used to extract multi-scale distant dependent features from the input image while keeping rich global context information. The first two layers of the encoder extract high-resolution low-level features with a considerable quantity of boundary information, which serves as a vital guide for the network’s boundary learning. The latter two levels can extract extensive global contextual information. The decoder’s multi-branch multi-scale feature fusion module builds multi-scale feature representations by separating blocks into layers, and GCAM fuses the features learned in each layer of the decoding stage to increase segmentation accuracy even further. DEM accepts encoder features and improves interclass discrimination and intraclass response via two parallel attention modules. The output is subsequently sent to the decoder for improved segmentation performance. Our network can efficiently fulfill the task of segmenting medical image lesions using the approaches described above.

Fig. 1

The proposed MTC-Net network framework.

3.2 Fusion encoder

As shown in Fig. 1, we use a fusion encoder that combines translational invariance, input adaptive weighting, and a global perceptual field based on MBConv blocks and self-focus. Since shallow features mainly include some broad object feature patterns such as texture, color, and orientation, these features are usually not global, but deep features represent object-specific information, which usually requires global information. Therefore, we build a four-stage encoder structure that uses MBConv to capture spatial interactions in the first two stages and then performs deep convolution with inverted residuals. In addition to this, both depth-separated convolution and self-attention can be described as the sum of the weighted values of each dimension in the pre-defined perceptual field. The self-attention module(Multi-head attention module and feedforward network) is used in the latter two stages to provide additional contextual information through the broader perceptual field.

3.3 Multi-branch and multi-scale feature fusion module

To enhance the ability of the network to extract richer hierarchical features, combining the ideas from [20] and [21], we propose the multi-branch multi-scale feature fusion module (MSFB) shown in Fig. 2, whose internal structure consists of three components: multi-branch convolutional layer, null convolution and hierarchical segmentation block. The multi-branch structure is composed of convolutional layers of convolution kernels of different sizes. The feature graph is divided into s groups by adding segmentation blocks, and each group has w channels. Only the first set of filters can be connected directly to the next layer. The second set of feature maps is sent to a 3×3 void convolution to extract features, and then the output feature maps are divided into two subgroups on the channel dimension. One subgroup of feature maps is connected directly to the next level, while another subgroup is connected in series to the next set of input feature maps in the channel dimension. The series feature map consists of a set of 3×3 void convolution operations. This method is repeated until all of the input feature maps have been processed. Finally, all input group feature maps are cascaded and delivered to another layer of 1×1 convolution to reconstruct the features. The reconstructed feature maps produce a more detailed feature representation.

Fig. 2

Multi-branch and multi-scale feature fusion module structure.

3.4 Global cooperative aggregation module

We propose GCAM in order to fully fuse the feature information of different levels of the decoder and obtain more accurate segmentation results (as in Fig. 1). The features at layer l in the decoder are denoted as D_l (l ∈ {1, 2, 3, 4}), and they are unified to D₁ size using a bilinear interpolation transform and 3×3 convolution, before being connected and denoted as feature F. Soft pooling with multilayer perception (MLP) is used to obtain the coefficients of each channel of F, and the channel coefficient attention vector is denoted as a ∈ [0, 1], in addition to using a 3×3 null convolution and a 1×1 convolution layer to generate spatial attention coefficients b ∈ [0, 1]. GCAM’s output is represented as: $Y_{GCAM} = F \cdot a \cdot b + F \cdot a .$ (1)

3.5 Detail enhancement module

We introduce a detail enhancement module to improve the network’s ability to learn irregular boundary features in the lesion region (e.g., Fig. 1). The DEM is made up of two modules: spatial attention and channel attention. The spatial attention module is used to capture the broader contextual information of local features, and the generated features are represented as Q_y and K_x by two convolutional layers, 1×3 and 3×1, to extract edge information in the vertical and horizontal directions, respectively. 1×1 convolution is also applied to the input features to obtain a new feature map V. The transpose of Q is matrix multiplied with K, which can promote each other in similar spatial points and suppress each other for different spatial points suppress each other; after that, the softmax layer is applied to obtain the similarity note map of each position with the others. Then matrix multiplication is performed with the features V to obtain the enhanced features, and finally the final output features F_s of the spatial attention module are obtained by summing with the initial input features F. The channel attention module exploits the interdependence of channel mapping to enhance the contrast between class-dependent features. By matrix multiplying the input features F with their transpose, similar channels will promote each other and different channels will suppress each other. The channel dependency matrix is then Softmaxed so that the network can enhance the ability to recognize curve structure and background. The output features F_c are then obtained by multiplying and then summing with the features F. Finally, the attentional feature map information generated in both dimensions is integrated by matrix multiplication and convolution layer operations to produce the output of the DEM module. The final feature map looks like this: $T = conv (sab (x)) \cdot cab (x) + conv (sab (x)) .$ (2) where x is the input feature map, cab (x) is the channel attention module, and sab (x) is the spatial attention module.

3.6 Loss functions

Lesion segmentation is a binary classification task, and in this work we use a combination of the binary cross-entropy (BCE) loss function and dice coefficient (Dice) loss in a way that the BCE loss and dice loss are defined as: $L_{BCE} = - \sum_{i} [(1 - G_{i}) \ln (1 - P_{i}) + G_{i} \ln (P_{i})] .$ (3) $L_{Dice} = 1 - 2 \times \frac{\sum_{i} P_{i} G_{i}}{\sum_{i} P_{i} + \sum_{i} G_{i}} .$ (4) where P_i is the probability that the i pixel belongs to the segmented region and G_i is the ground truth value of the i pixel. Our loss function consists of a binary cross-entropy loss and a dice loss function: $L_{loss} = L_{BCE} + L_{Dice} .$ (5)

4 Experiments

4.1 Datasets

We conducted extensive experiments on four public medical image segmentation datasets to validate the effectiveness of our method, including the skin lesion image dataset ISIC 2018 [22], the breast ultrasound image dataset BUSI [23], the gastrointestinal polyp image dataset Kvasir-SEG [24], and the thyroid nodule dataset TN3K [25]. Among them, the ISIC 2018 dataset is a challenge dataset for "Skin lesion analysis for melanoma detection" sponsored by the International Symposium on Biomedical Imaging (ISBI) 2018.The ISIC 2018 dataset consists of 2594 color RGB images, which we resampled to 224 × 224 pixels and randomly divided into training set (70%), validation set (10%) and test set (20%). The BUSI dataset consists of 780 images in PNG format (133 normal images, 437 benign images, 210 malignant images), which we randomly divided into training set, validation set and test set by 7:2:1. the Kvasir-seg dataset contains 1000 polyps images from the Kvasir dataset v2 and their corresponding ground The resolution of the images contained in the Kvasir-SEG ranged from 332×487 to 1920×1072 pixels, and we resampled the dataset to 224 × 224 pixels and randomly divided it into training set (70%), validation set (10%), and test set (20%). The TN3K dataset contains 3493 ultrasound images, which are also resampled to 224 × 224 pixels and randomly divided into a training set, a validation set and a test set according to 7:2:1.

4.2 Experimental settings and evaluation criteria

Experimental settings. We built our network in PyTorch on a single NVIDIA GeForce RTX 2080Ti GPU card, using adaptive moment estimation (Adam) as the overall optimizer, with the initial learning rate set to 0.001 and the weights decaying to 0.00005, while employing the CosineAnnealingWarmRestarts Restart (T0 = 10, Tmult = 2) learning rate strategy. The network was trained for 200 periods to use the model with the highest Dice index on the validation set for the test set.

Evaluation Criteria. We used four widely used metrics to evaluate the accuracy performance of our proposed method, including Dice coefficient (Dice), Accuracy (ACC), Sensitivity (SE), Precision, intersection over union (IoU), and Specificity (SP), as follows: $Dice = \frac{2 \cdot TP}{2 \cdot TP + FP + FN} .$ (6) $ACC = \frac{TP + FN}{TP + TN + FP + FN} .$ (7) $Sensitivity = \frac{TP}{TP + FN} .$ (8) $Specificity = \frac{TN}{TN + FP} .$ (9) $Precision = \frac{TP}{TP + FP} .$ (10) $IoU = \frac{TP}{TP + FN + FP} .$ (11) where TP, TN, FP, and FN are true positive, true negative, false positive, and false negative, respectively. TP indicates the number of correctly segmented lesion pixels. TN indicates the number of correctly segmented background pixels. FP indicates background pixels that were incorrectly labeled as lesion pixels, and FN indicates lesion pixels that were incorrectly labeled as background pixels.

4.3 Comparison of experimental results on the ISIC 2018 dataset

For the ISIC 2018 dataset, we compared with 10 methods, among which there are three general-purpose segmentation networks, U-Net [2], AttU-Net [26], and DeepLabv3+ [27], and seven networks dedicated to skin lesion segmentation, CE-Net [28], BCDU-Net [29], CPF- Net [30], CA-Net [31], BA-Transformer [32], MFS-Net [33], and FAT-Net [16]. All experiments were done under the same computational environment conditions, and none of them performed data enhancement on the dataset. Table 1 depicts the statistical experimental results of skin damage segmentation on the ISIC 2018 dataset for different methods, it is obvious that our method on the four evaluation metrics of Dice coefficient (Dice), Accuracy (ACC), Sensitivity (SE) and Specificity (SP) all showed better results with 93.17%, 96.60%, 91.78% and 97.40%, respectively. Figure 3 compares the segmentation results of different networks on the ISIC 2018 dataset, and our networks all show better processing capabilities.

Table 1
Performance of different networks for skin lesion segmentation on the ISIC 2018 dataset, with the best results shown in bold

Network Dice ACC SE SP

U-Net [2] 0.8680 0.9575 0.8951 0.9713

AttU-Net [26] 0.8846 0.9586 0.9032 0.9721

DeepLabv3+ [27] 0.8926 0.9590 0.9023 0.9733

CE-Net [28] 0.9307 0.9618 0.9222 0.9630

BCDU-Net [29] 0.9263 0.9594 0.9094 0.9730

CPF-Net [30] 0.9274 0.9584 0.9116 0.9628

CA-Net [31] 0.9261 0.9594 0.9082 0.9698

BA-Transformer [32] 0.9059 \ \ \

MFS-Net [33] 0.8952 \ \ \

FAT_Net [16] 0.9229 0.9614 0.9126 0.9681

Ours 0.9317 0.9660 0.9178 0.9740

Network	Dice	ACC	SE	SP
U-Net [2]	0.8680	0.9575	0.8951	0.9713
AttU-Net [26]	0.8846	0.9586	0.9032	0.9721
DeepLabv3+ [27]	0.8926	0.9590	0.9023	0.9733
CE-Net [28]	0.9307	0.9618	0.9222	0.9630
BCDU-Net [29]	0.9263	0.9594	0.9094	0.9730
CPF-Net [30]	0.9274	0.9584	0.9116	0.9628
CA-Net [31]	0.9261	0.9594	0.9082	0.9698
BA-Transformer [32]	0.9059	\	\	\
MFS-Net [33]	0.8952	\	\	\
FAT_Net [16]	0.9229	0.9614	0.9126	0.9681
Ours	0.9317	0.9660	0.9178	0.9740

Fig. 3

Comparison visualization on the ISIC2018 dataset.

4.4 Comparison of experimental results on the BUSI dataset

We compared our network to seven other approaches on the BUSI dataset: U-Net [2], AttU-Net [26], DeepLabv3+ [27], CE-Net [28], BCDU-Net [29], DA-Net [34], and MCF-Net [35]. Table 2 displays the experimental outcomes. Our network MTC-Net outperforms the state-of-the-art breast nodule segmentation network, and our network performs better in all four evaluation metrics. MTC-Net outperformed DeepLabv3+ by 7.8%, 1.4%, and 0.35% in Dice coefficient (Dice), Accuracy (ACC), and Specificity (SP), respectively. Our MTC-Net performs better in terms of segmentation. Figure 4 compares the segmentation results of various networks on the BUSI dataset, and all of our networks outperform.

Table 2
Breast nodule segmentation performance of different networks on the BUSI dataset, the best results are shown in bold

Network Dice ACC SE SP

U-Net [2] 0.6692 0.9435 0.6845 0.9671

AttU-Net [26] 0.6735 0.9533 0.5676 0.9893

DeepLabv3+ [27] 0.6983 0.9512 0.6788 0.9758

CE-Net [28] 0.7018 0.9541 0.7301 0.9803

BCDU-Net [29] 0.6335 0.9433 0.5893 0.9752

DA-Net [34] 0.6783 0.9365 0.8038 0.9801

MCF-Net [35] 0.7106 0.9568 0.7223 0.9924

Ours 0.7771 0.9653 0.6229 0.9793

Network	Dice	ACC	SE	SP
U-Net [2]	0.6692	0.9435	0.6845	0.9671
AttU-Net [26]	0.6735	0.9533	0.5676	0.9893
DeepLabv3+ [27]	0.6983	0.9512	0.6788	0.9758
CE-Net [28]	0.7018	0.9541	0.7301	0.9803
BCDU-Net [29]	0.6335	0.9433	0.5893	0.9752
DA-Net [34]	0.6783	0.9365	0.8038	0.9801
MCF-Net [35]	0.7106	0.9568	0.7223	0.9924
Ours	0.7771	0.9653	0.6229	0.9793

Fig. 4

Comparison visualization on BUSI datasets.

4.5 Comparison of experimental results on the Kvasir-SEG dataset

On the Kvasir-SEG dataset, we compared with six methods, namely U-Net [2], AttU-Net [26], CE-Net [28], BCDU-Net [29], DA-Net [34], and MCF-Net [35]. The detailed comparison results are shown in Table 3. Our network achieves superior results in all four evaluation metrics, and it can be seen that the best result among the state-of-the-art segmentation networks is DA-Net [34], while our network compares with it in terms of Dice coefficient (Dice), Accuracy (ACC), Sensitivity (SE) and Specificity (SP) by 13.72%, 1.95%, 7.16% and 0.87%, respectively. The segmentation performance of the above networks with our MTC-Net network on the Kvasir-SEG dataset is shown in Fig. 5.

Table 3
Polyp segmentation performance of different networks on the Kvasir-SEG dataset, with the best results shown in bold

Network Dice ACC SE Precision

U-Net [2] 0.7112 0.9187 0.6088 0.9596

AttU-Net [26] 0.7389 0.9208 0.6822 0.9677

CE-Net [28] 0.7922 0.9369 0.7322 0.9771

BCDU-Net [29] 0.7088 0.8979 0.7558 0.9259

DA-Net [34] 0.7832 0.9324 0.7429 0.9697

MCF-Net [35] 0.7665 0.9265 0.7369 0.9721

Ours 0.9204 0.9519 0.8145 0.9784

Network	Dice	ACC	SE	Precision
U-Net [2]	0.7112	0.9187	0.6088	0.9596
AttU-Net [26]	0.7389	0.9208	0.6822	0.9677
CE-Net [28]	0.7922	0.9369	0.7322	0.9771
BCDU-Net [29]	0.7088	0.8979	0.7558	0.9259
DA-Net [34]	0.7832	0.9324	0.7429	0.9697
MCF-Net [35]	0.7665	0.9265	0.7369	0.9721
Ours	0.9204	0.9519	0.8145	0.9784

Fig. 5

Comparison visualization on the Kvasir-SEG dataset.

4.6 Comparison of experimental results on the TN3K dataset

On the TN3K dataset, we compared with seven methods, namely U-Net [2], AttU-Net [26], BCDU-Net [29], SGUNet [36], TRFE [37], TRFE+ [38], and CPF- Net [30]. The detailed comparison results are shown in Table 4. Our network has improved 9.92% and 4.48% in Dice coefficient (Dice) and Specificity (SP), respectively, compared to U-Net. The segmentation performance of our MTC-Net network on the TN3K dataset is shown in Fig. 6.

Table 4
The performance of different networks for thyroid nodule segmentation on the TN3K dataset, with the best results shown in bold

Network Dice ACC SE Precision

U-Net [2] 0.7951 0.9677 0.7867 0.9476

AttU-Net [26] 0.8032 0.9656 0.7912 0.9672

BCDU-Net [29] 0.7951 0.9682 0.7931 0.9639

SGUNet [36] 0.7955 0.9654 0.7936 \

TRFE [37] 0.8119 0.9671 0.8316 \

TRFE+ [38] 0.8330 0.9704 \ \

CPF- Net [30] 0.8270 0.9717 \ \

Ours 0.8943 0.9683 0.7593 0.9906

Network	Dice	ACC	SE	Precision
U-Net [2]	0.7951	0.9677	0.7867	0.9476
AttU-Net [26]	0.8032	0.9656	0.7912	0.9672
BCDU-Net [29]	0.7951	0.9682	0.7931	0.9639
SGUNet [36]	0.7955	0.9654	0.7936	\
TRFE [37]	0.8119	0.9671	0.8316	\
TRFE+ [38]	0.8330	0.9704	\	\
CPF- Net [30]	0.8270	0.9717	\	\
Ours	0.8943	0.9683	0.7593	0.9906

Fig. 6

Comparative visualization on TN3K datasets.

4.7 Ablation study

To demonstrate the effectiveness of our proposed module, we conducted ablation experiments on the ISIC2018 dataset. We compare the following models and present the results in Table 5:

Table 5
Ablation experiments for different parts of the proposed model on the ISIC2018 dataset

Network Dice IoU ACC SE Precision

Baseline 0.9023 0.8242 0.9415 0.9043 0.9210

Model 1 0.9231 0.8277 0.9496 0.9120 0.9352

Model 2 0.9268 0.8315 0.9529 0.9148 0.9391

Model 3 0.9259 0.8291 0.9537 0.9157 0.9376

Model 4 0.9275 0.8324 0.9585 0.9169 0.9389

MTC-Net 0.9317 0.8337 0.9660 0.9178 0.9408

Network	Dice	IoU	ACC	SE	Precision
Baseline	0.9023	0.8242	0.9415	0.9043	0.9210
Model 1	0.9231	0.8277	0.9496	0.9120	0.9352
Model 2	0.9268	0.8315	0.9529	0.9148	0.9391
Model 3	0.9259	0.8291	0.9537	0.9157	0.9376
Model 4	0.9275	0.8324	0.9585	0.9169	0.9389
MTC-Net	0.9317	0.8337	0.9660	0.9178	0.9408

Baseline: Only a hybrid encoder and decoder network.

Model 1: The MSFB module was added on the basis of baseline.

Model 2: The DEM module is added on the basis of baseline.

Model 3: The GCAM module is added on the basis of baseline.

Model 4: The GCAM module was added to Model 1.

The graphs of experimental results before and after adding the detail enhancement module are shown in Fig. 7, which proves the contribution of the module to the fuzzy boundary of network segmentation.

Fig. 7

Effect comparison of DEM on ISIC2018 dataset.

5 Conclusions

In this paper, we propose a multiscale feature fusion network (MTC-Net) for solving medical image segmentation problems. Unlike traditional convolutional neural networks, we use a fusion encoder to extract multiscale remotely dependent features from the input image, preserving rich global contextual information. In addition, we use a multi-branch multi-scale feature fusion module (MSFB) to enhance the feature extraction capability of the network. And, the Global Cooperative Aggregation Module (GCAM), adaptively fuses features from different scales to fully learn contextual information. Further detailed features are extracted using the Detail Enhancement Module (DEM). To validate the effectiveness of our network, we conducted extensive experiments on four public medical image segmentation datasets (ISIC2018, BUSI, Kvasir-SEG, and TN3K) to verify the superiority of the performance of our proposed MTC-Net compared with state-of-the-art segmentation methods.

Footnotes

Acknowledgment

This work was supported by the National Natural Science Foundation of China (Grant No. 61976126).

References

Long

Jonathan

, Shelhamer

Evan

and Darrell

Trevor

, Fully convolutional networks for semantic segmentation, In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015.

Ronneberger

Olaf

, Fischer

Philipp

and Brox

Thomas

, U-net: Convolutional networks for biomedical image segmentation, In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.

Zhou

Zongwei

, Siddiquee

Md Mahfuzur Rahman

, Tajbakhsh

Nima

and Liang

Jianming

, Unet++: Redesigning skip connections to exploit multiscale features in image segmentation, IEEE Transactions on Medical Imaging 39(6) (2019), 1856–1867.

Jha

Debesh

, Riegler

Michael A.

, Johansen

Dag

, Halvorsen

Pål

and Johansen

Håvard D.

, Doubleu-net: A deep convolutional neural network for medical image segmentation, In 2020 IEEE 33rd International symposium on computer-based medical systems (CBMS), pages 558–564. IEEE, 2020.

Zhang

Zhengxin

, Liu

Qingjie

and Wang

Yunhong

, Road extraction by deep residual U-net, IEEE Geoscience and Remote Sensing Letters 15(5) (2018), 749–753.

Kaiming

, Zhang

Xiangyu

, Ren

Shaoqing

and Sun

Jian

, Deep residual learning for image recognition, In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.

Zeng

Zhenhuan

, Fan

Chaodong

, Xiao

Leyi

and Qu

Xilong

, Dea-Unet: a dense-edge-attention Unet architecture for medical image segmentation, Journal of Electronic Imaging 31(4) (2022), 043032.

, Wu

Fei

, Liu

Sikai

, Tang

Jinhong

, Li

Guang Hui

, Zhong

Meiling

and Guan

Xiaohui

, CA-Unet++: An improved structure for medical CT scanning based on the Unet++ architecture, International Journal of Intelligent Systems 37(11) (2022), 8814–8832.

Zhengrong Wu, Like Zhao and Haixiao Zhang, Mr-Unet commodity semantic segmentation based on transfer learning, IEEE Access 9 (2021), 159447–159456.

10.

Zhiwei Qiao and Congcong Du, Rad-Unet: a residual, attention-based, dense Unet for CT sparse reconstruction, Journal of Digital Imaging (2022), 1–11.

11.

Gulnaz Alimjan, Yiliyaer Jiaermuhamaiti, Huxidan Jumahong, Shuangling Zhu and Pazilat Nurmamat, An image change detection algorithm based on multi-feature self-attention fusion mechanism Unet network, International Journal of Pattern Recognition and Artificial Intelligence 35(14) (2021), 2159049.

12.

Lei Bi, Jinman Kim, Euijoon Ahn, Dagan Feng and Michael Fulham, Semi-automatic skin lesion segmentation via fully convolutional networks, In 2017 IEEE 14th International Symposium on Biomedical Imaging (ISBI 2017), pages 561–564. IEEE, 2017.

13.

Ebrahim Nasr-Esfahani, Shima Rafiei, Mohammad H. Jafari, Nader Karimi, James S. Wrobel, Shadrokh Samavi and S.M. Reza Soroushmehr, Dense pooling layers in fully convolutional network for skin lesion segmentation, Computerized Medical Imaging and Graphics 78 (2019), 101658.

14.

Chen Zhao, Renjun Shuai, Li Ma, Wenjia Liu and Menglin Wu, Segmentation of dermoscopy images based on deformable 3D convolution and ResU-NeXt++, Medical & Biological Engineering & Computing 59(9) (2021), 1815–1832.

15.

Duwei Dai, Caixia Dong, Songhua Xu, Qingsen Yan, Zongfang Li, Chunyan Zhang and Nana Luo, Ms red: A novel multi-scale residual encoding and decoding network for skin lesion segmentation, Medical Image Analysis 75 (2022), 102293.

16.

Huisi Wu, Shihuai Chen, Guilian Chen, Wei Wang, Baiying Lei and Zhenkun Wen, Fat-net: Feature adaptive transformers for automated skin lesion segmentation, Medical Image Analysis 76 (2022), 102327.

17.

Kai Han, An Xiao, Enhua Wu, Jianyuan Guo, Chunjing Xu and Yunhe Wang, Transformer in transformer, Advances in Neural Information Processing Systems 34 (2021), 15908–15919.

18.

Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo and Ling Shao, Pyramid vision transformer: A versatile backbone for dense prediction without convolutions, In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 568–578, 2021.

19.

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin and Baining Guo, Swin transformer: Hierarchical vision transformer using shifted windows, In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2–2, 1002, 2021.

20.

Pengcheng Yuan, Shufei Lin, Cheng Cui, Yuning Du, Ruoyu Guo, Dongliang He, Errui Ding and Shumin Han, Hs-resnet: Hierarchical-split block on convolutional neural network, arXiv preprint arXiv:2010.07621, 2020.

21.

Songtao Liu, Di Huang, et al., Receptive field block net for accurate and fast object detection, In Proceedings of the European conference on computer vision (ECCV), pages 385–400, 2018.

22.

Noel Codella, Veronica Rotemberg, Philipp Tschandl, M. Emre Celebi, Stephen Dusza, David Gutman, Brian Helba, Aadi Kalloo, Konstantinos Liopyris, Michael Marchetti, et al., Skin lesion analysis toward melanoma detection 2018: A challenge hosted by the international skin imaging collaboration (isic), arXiv preprint arXiv:1902.03368, 2019.

23.

Walid Al-Dhabyani, Mohammed Gomaa, Hussien Khaled and Aly Fahmy, Dataset of breast ultrasound images, Data in Brief 28 (2020), 104863.

24.

Debesh Jha, Pia H. Smedsrud, Michael A. Riegler, Pål Halvorsen, Thomas de Lange, Dag Johansen and Håvard D. Johansen, Kvasir-seg: A segmented polyp dataset, In MultiMedia Modeling: 26th International Conference, MMM 2020, Daejeon, South Korea, January 5–8, 2020, Proceedings, Part II 26, pages 451–462. Springer, 2020.

25.

Bailin Yang, Meiying Yan, Zaoming Yan, Changrui Zhu, Dong Xu and Fangfang Dong, Segmentation and classification of thyroid follicular neoplasm using cascaded convolutional neural network, Physics in Medicine & Biology 65(24) (2020), 245040.

26.

Ozan Oktay, Jo Schlemper, Loic Le Folgoc, Matthew Lee, Mattias Heinrich, Kazunari Misawa, Kensaku Mori, Steven McDonagh, Nils Y. Hammerla, Bernhard Kainz, et al., Attention u-net: Learning where to look for the pancreas, arXiv preprint arXiv:1804.03999, 2018.

27.

Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff and Hartwig Adam, Encoder-decoder with atrous separable convolution for semantic image segmentation, In Proceedings of the European conference on computer vision (ECCV), pages 801–818, 2018.

28.

Zaiwang Gu, Jun Cheng, Huazhu Fu, Kang Zhou, Huaying Hao, Yitian Zhao, Tianyang Zhang, Shenghua Gao and Jiang Liu, Ce-net: Context encoder network for 2d medical image segmentation, IEEE Transactions on Medical Imaging 38(10) (2019), 2281–2292.

29.

Reza Azad, Maryam Asadi-Aghbolaghi, Mahmood Fathy and Sergio Escalera, Bi-directional convlstm u-net with densley connected convolutions, In Proceedings of the IEEE/CVF international conference on computer vision workshops, pages 0–0, 2019.

30.

Shuanglang Feng, Heming Zhao, Fei Shi, Xuena Cheng, Meng Wang, Yuhui Ma, Dehui Xiang, Weifang Zhu and Xinjian Chen, Cpfnet: Context pyramid fusion network for medical image segmentation, IEEE Transactions on Medical Imaging 39(10) (2020), 3008–3018.

31.

Ran Gu, Guotai Wang, Tao Song, Rui Huang, Michael Aertsen, Jan Deprest, Sébastien Ourselin, Tom Vercauteren and Shaoting Zhang, Ca-net: Comprehensive attention convolutional neural networks for explainable medical image segmentation, IEEE Transactions on Medical Imaging 40(2) (2020), 699–711.

32.

Jiacheng Wang, Lan Wei, Liansheng Wang, Qichao Zhou, Lei Zhu and Jing Qin, Boundary-aware transformers for skin lesion segmentation, In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 206–216. Springer, 2021.

33.

Hritam Basak, Rohit Kundu and Ram Sarkar, Mfsnet: A multi focus segmentation network for skin lesion segmentation, Pattern Recognition 128 (2022), 108673.

34.

Jun Fu, Jing Liu, Haijie Tian, Yong Li, Yongjun Bao, Zhiwei Fang and Hanqing Lu, Dual attention network for scene segmentation, In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3146–3154, 2019.

35.

Lizhu Liu, Yexin Liu, Jian Zhou, Cheng Guo and Huigao Duan, A novel mcf-net: Multi-level context fusion network for 2d medical image segmentation, Computer Methods and Programs in Biomedicine 226 (2022), 107160.

36.

Huitong Pan, Quan Zhou and Longin Jan Latecki, SGUNET: Semantic guided UNET for thyroid nodule segmentation, In 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), pages 630–634. IEEE, 2021.

37.

Haifan Gong, Guanqi Chen, Ranran Wang, Xiang Xie, Mingzhi Mao, Yizhou Yu, Fei Chen and Guanbin Li, Multi-task learning for thyroid nodule segmentation with thyroid region prior, In 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), pages 257–261. IEEE, 2021.

38.

Haifan Gong, Jiaxin Chen, Guanqi Chen, Haofeng Li, Guanbin Li and Fei Chen, Thyroid region prior guided attention for ultrasound segmentation of thyroid nodules, Computers in Biology and Medicine 155 (2023), 106389.