Abstract
BACKGROUND:
The rapid development of deep learning techniques has greatly improved the performance of medical image segmentation, and medical image segmentation networks based on convolutional neural networks and Transformer have been widely used in this field. However, due to the limitation of the restricted receptive field of convolutional operation and the lack of local fine information extraction ability of the self-attention mechanism in Transformer, the current neural networks with pure convolutional or Transformer structure as the backbone still perform poorly in medical image segmentation.
METHODS:
In this paper, we propose FDB-Net (Fusion Double Branch Network, FDB-Net), a double branch medical image segmentation network combining CNN and Transformer, by using a CNN containing g n Conv blocks and a Transformer containing Varied-Size Window Attention (VWA) blocks as the feature extraction backbone network, the dual-path encoder ensures that the network has a global receptive field as well as access to the target local detail features. We also propose a new feature fusion module (Deep Feature Fusion, DFF), which helps the image to simultaneously fuse features from two different structural encoders during the encoding process, ensuring the effective fusion of global and local information of the image.
CONCLUSION:
Our model achieves advanced results in all three typical tasks of medical image segmentation, which fully validates the effectiveness of FDB-Net.
Keywords
Introduction
Medical imaging is important for the diagnosis, treatment and evaluation of diseases, and its non-invasive imaging modality enables localization, characterization and quantitative analysis of diseases. Medical imaging comprises several types of procedures, such as computed tomography, X-ray, magnetic resonance imaging, and positron emission tomography [1]. Medical image segmentation is crucial for computer-aided disease diagnosis (CAD), treatment planning, and surgical pre-evaluation. It effectively assists doctors in observing and analyzing medical images, such as Magnetic Resonance Imaging (MRI), X-rays, and computed tomography images. The accuracy of medical image segmentation has a direct impact on the clinician’s ability to quantify the focal area, assess the patient’s condition, and select treatment.
Since U-Net [2], a medical image segmentation network based on a full convolutional neural network, was proposed in 2015, various CNN networks with U-Net as a variant have become the mainstay of medical image segmentation tasks [3–6]. Among them, the network structure of U-Net and its variants consists of an encoder and a decoder, which is shaped like a U-shape. Compared with the previous full convolutional neural network, U-Net greatly alleviates the network degradation problem caused by the excessive depth of the network layers, due to the innovative skip-connection module of U-Net, which is used to retrieve the detailed information and features of different granularity that are lost in the process of encoder downsampling. Although CNN-based network models have achieved excellent performance in medical image segmentation, CNN networks often have difficulty in obtaining the global receptive field due to the limitations of the convolution operation, and the performance of CNN networks in image segmentation tends to be severely inhibited due to the inherent bias of convolution [7]. Transformer is one of the most widely used models [8] in natural language processing (NLP) and has achieved great success in paraphrase generation [9], text-to-speech synthesis [10], and speech recognition [11]. Inspired by it, in 2020, Google researcher Alexey Dosovitskiy et al. introduced Transformer to computer vision tasks by proposing Vision Transformer [12], which mimics the input image by slicing it into several 16x16 slices and then mapping the slices into a one-dimensional sequence of two-dimensional image chunks (tokens), to mimic the input of an NLP task. The sequence of image blocks is then fed into the Encoder part of the Transformer for global attention computation operation, so that the image features have a global receptive field with long-range dependency that traditional CNN networks do not have. Such a Transformer computation operation can both avoid computing attention between each pixel, which greatly increases the computational and storage burden, and directly apply the advantages of the Transformer model in NLP tasks to the field of computer vision. However, although the Transformer structure can effectively construct remote dependencies between image blocks (Token), it lacks the ability to extract local features from images, and the performance of Transformer relies on the pre-training of large-scale datasets to reach or exceed the level of CNN networks. But due to the specificity of medical images, it is often difficult for researchers to obtain large-scale medical image datasets with doctors’ hand-annotated medical image datasets, so Transformer is often not as good as expected in medical image segmentation. In recent years, hybrid network architectures combining CNN and Transformer have gradually become hotspots for researchers in the field of medical image segmentation. For example, Jieneng Chen et al. proposed TransUNet [13], which innovatively extracts local feature maps of the image to be segmented through the CNN network first to be used in the original Vision Transformer to replace the image input. The Vision Transformer retrieves the local details of the CNN network through skip-connection at the Decoder layer. Guoping Xu et al. proposed LeViT-UNet [14], which integrates the Transformer and CNN into one Encoder, in which the multiscale feature maps generated by the two networks in the Encoder are passed through skip-connection to the decoder, effectively reusing the spatial information of the feature maps. However, existing hybrid network models still have some issues in medical image segmentation tasks: (1) The models fail to fully leverage the feature extraction capabilities of two different networks during the encoding stage, making it difficult to effectively integrate global and local features of different scales. (2) The models cannot utilize the extracted multi-scale features effectively. Inspired by these studies, we propose FDB-Net, a novel double-branch medical image segmentation model that integrates both CNN and Transformer architectures. FDB-Net effectively leverages the local feature extraction capabilities of CNN along with the global attention mechanism of Transformer to achieve precise medical image segmentation. Our proposed FDB-Net model comprises an encoding module, a decoding module, and a feature fusion module. The encoding module includes a CNN network with the g n Conv convolutional kernel and a Transformer variant network featuring Varied-Size Window Attention (VWA).
Additionally, to effectively fuse global and local information and enhance the recovery of coarse- and fine-grained details across all scales, our model incorporates the novel feature fusion module DFF. This module dynamically adjusts feature combinations within the jump connections, introducing both channel and spatial attention mechanisms. Furthermore, it utilizes channel attention to normalize feature details, thereby suppressing irrelevant information and improving segmentation accuracy, particularly in key regions. Experimental results across various segmentation tasks, including abdominal multi-organ segmentation, dermatological lesion region segmentation, and heart segmentation, demonstrate that our model outperforms state-of-the-art methods. The primary contributions of this paper are outlined as follows:
Considering the different advantages of CNN and Transformer, a new double branch network, FDB-Net, is proposed to efficiently extract feature information from medical images. A new feature fusion module (DFF) is proposed for fusing global and local features extracted from two branches at different scales during the coding process and suppressing feature expressions that are not related to the segmentation target to avoid interference with the segmentation results. The experimental results show that compared with the existing medical image segmentation methods, the proposed FDB-Net in this paper has high effectiveness and superiority.
CNN-based segmentation networks
In recent years, CNN networks have become a standard in medical image segmentation due to increasing computational power, particularly with the introduction of U-Net[2]. This method, based on convolutional neural networks, has emerged as a mainstream approach in medical image segmentation tasks. In contrast to previous full convolutional neural networks (FCNs [15]), U-Net proposes a U-shaped encoder-decoder structure combined with skip connections. This architecture retrieves detailed information lost during the downsampling process, providing coarse and fine feature maps for the segmentation model. It accelerates convergence and proves highly effective in handling medical image segmentation tasks, resulting in superior segmentation accuracy compared to almost all previous state-of-the-art models. U-Net has become a reference for subsequent model designs, including Res-UNet [16], Res-UNet++ [17], Dense-UNet [18], U-Net++ [19], and U-Net3+ [6]. These models enhance the ability to extract medical image features by refining the U-Net’s network structure, incorporating additional skip connections, and redesigning the codec structure. However, the pure convolutional network’s limited sensory field restricts its ability to establish global long-range dependencies and fully utilize contextual information.
In order to address the issue of the restricted receptive field in CNNs, Koltun et al. proposed Dilated Convolution [20]. This technique expands the convolution kernel size by inserting spaces (zeros) between kernel elements, effectively enlarging the receptive field and capturing multiscale contextual information. However, Dilated Convolution still suffers from the Gridding Effect, leading to insufficient correlation of long-distance information. To mitigate this, Kaiming He et al. introduced the Spatial Pyramid Pooling (SPP) structure [21], leveraging the classical spatial pyramid feature extraction method to extract image features at various scales from the same image. This improves segmentation accuracy. Additionally, the spatial pyramid can be combined with null convolution to create the Atrous Spatial Pyramid Pooling (ASPP) method [22], further optimizing segmentation tasks. Despite extensive efforts in CNN network optimization, inherent convolutional operation deficiencies persist, often introducing new challenges alongside optimization methods.
Vision transformers
Transformers have long been a fundamental approach in Natural Language Processing (NLP). Recently, researchers have extended their application to Computer Vision (CV) [12], yielding excellent results and establishing Transformers as another cornerstone in CV, following in the footsteps of CNNs. Unlike CNNs, the unique design of Transformers enables them to capture long-range dependencies and perform sequence-to-sequence (seq2seq) tasks. A Transformer consists of encoders and decoders, both incorporating positional encoding, multi-head attention mechanisms, Layer Normalization (LN) [12], Feed Forward Networks (FFNs), and Skip-Connections. The decoder is similar to the encoder but includes a masked multi-head attention mechanism in the input layer. Despite its advantages, Transformers still face challenges such as high computational complexity and reliance on large-scale pre-training, limiting their application in CV, particularly in medical image processing. Touvron et al. proposed DeiT [24], introducing a distillation strategy to enable advanced results with minimal training resources. X. Dong et al. proposed CSWin Transformer [25], which reduces computational costs by computing self-attention in cross-shaped windows both vertically and horizontally in parallel, effectively mitigating the high computational cost of global and local self-attention. Additionally, Local Enhanced Position Encoding (LEPE) is proposed to enhance the handling of local position information, further improving network performance. Z. Liu et al. introduced the Swin Transformer [26], which incorporates scale variation and employs a hierarchical construction method akin to CNN. Moreover, the proposed localization concept and self-attentive computation, featuring non-overlapping window regions, effectively reduce computational complexity. Hu Cao et al. proposed Swin-Unet [27], a pure Transformer architecture within the U-Net framework for medical image segmentation. In this architecture, labeled image blocks are input into a Swin Transformer-based U-shaped Encoder-Decoder via skip connections, facilitating local-global semantic feature learning. This approach yields outstanding performance in medical image segmentation tasks.
Hybrid CNN-transformer architecture
Pure CNN structures, when used as the backbone of medical image segmentation networks, often lack the ability to model the image globally over long distances due to the inherently small receptive field of convolutional operations, which leads to difficulties in achieving the expected performance of CNN networks in segmenting medical images. Pure Transformer structured networks are poor in image fine-grained feature extraction and rely too much on pre-training with large-scale datasets, which conflicts with the scarcity of medical images. Recently, networks combining CNN and Transformer have been considered by more and more researchers, and TransUNet [13] is the first network that applies a hybrid CNN and Transformer architecture to the field of medical image segmentation. This hybrid approach differs from pure Transformers, as TransUNet first utilizes the CNN layer to extract local features from the input medical image, and then inputs the extracted feature maps into Transformer for training, taking full advantage of the Transformer’s ability to capture long-distance dependencies in medical images and its symmetric encoder-decoder structure. TransUNet achieves superior segmentation accuracy in segmentation tasks compared to pure CNN or Transformer structured networks. However, the network does not effectively integrate multi-scale features extracted by the two different structural encoders, which also limits the further improvement of network performance. LeViT-UNet [14], on the other hand, is a fast inference network embedded in a U-Net-like structure - LeViT [28], which effectively balances the accuracy and efficiency of the Transformer Block.
Furthermore, LeViT employs skip connections to transmit multi-scale feature maps from its Transformer and convolutional blocks to the decoder, effectively reusing spatial information. However, this network only performs simple concatenation of the two feature maps in skip connections, resulting in some loss of detail information. Transfuse [29] introduces a novel parallel dual-branch structure to enhance feature extraction capability, avoiding local detail loss from redundant deep networks. While Transfuse proposes the BiFusion module to fuse multi-level features extracted by Transformer and CNN branches, it only conducts pooling and concatenation on features without separate treatment for global and local features. Transnorm [30] redesigns skip connections using a special attention mechanism to normalize channel and spatial information on each decoding path. However, Transnorm lacks the use of windowed self-attention mechanism, limiting global feature extraction and segmentation performance on medical images of different resolutions. HiFormer [31] utilizes Swin Transformer [26] and CNN as encoders, designing two multi-scale feature representations to capture both local and global features. It introduces the DLF module to fuse different features using cross-scale attention mechanism, ensuring feature consistency but potentially wasting some features as it only performs cross-scale attention calculation on two feature types.
Method
Our objective is to construct a dual-branch network that combines CNN and Transformer architectures to achieve accurate segmentation of high-resolution medical images. As illustrated in Fig. 1, the Encoder section of the model adopts a parallel structure, comprising a CNN branch with a g n Conv convolutional kernel and a Transformer branch utilizing a Variable Window Attention (VWA Transformer Block). These two branches focus on extracting local detailed features and global features of the input image, respectively. Additionally, at each encoding level, we incorporate a DFF feature fusion module to recombine and adaptively calibrate the features extracted from each branch of the dual-branch network. Subsequently, the processed feature information is passed to the decoding module. The Decoder stage of the model is responsible for recombining and recovering the feature information extracted during the downsampling process of the network, ultimately producing a clear segmentation result. Further elaboration on each component will be provided in the subsequent section.

The overview of the proposed FDB-Net. FDB-Net includes CNN and Transformer branches for extracting local and global features respectively, as well as DFF Blocks for feature fusion.
As depicted in Fig. 1, the model’s encoder part comprises two main branches and a series of cascaded feature fusion blocks (DFF), with the CNN and Transformer serving as the two branches. The input images are fed into these branches separately to fully exploit the advantages of both architectures. The CNN excels in capturing local correlations and accurately extracting local features from the images, while the Transformer is utilized to extract global features due to its unique mechanism for establishing global dependencies. Subsequently, the extracted features from both branches are fused at different levels through the feature fusion module DFF to ensure the effectiveness of the extracted features in the encoding process.
CNN module
The model proposed in this paper uses CNN as one of the feature extraction branches, and the CNN branch consists of four levels to extract the local spatial information of the image at different scales, inspired by g n Conv [27], this paper adopts a kind of CNN module (HorBlock) which contains recursive gated convolution (g n Conv) CNN module (HorBlock) for extracting deeper local spatial information, and the structure of this part is shown in Fig. 2.

CNN Block structure diagram,the following diagram reveals the internal implementation of g n Conv block.
Compared with the traditional CNN convolutional operation, g n Conv takes into account the lack of spatial interaction of CNN, and fully exploits the deep spatial semantic information of the feature layer only through simple operations such as convolution and fully-connected layers, and different from the lack of spatial interactions in the ordinary convolutional operation and second-order spatial interactions in the convolutional module that includes the mechanism of self-attention, g n Conv is realized by gated convolution and recursion with high efficiency. g n Conv, on the other hand, realizes arbitrary-order spatial interaction through the efficient implementation of gated convolution and recursion.
First, the gated convolution module gConv is utilized to perform 1st order spatial interactions on the input features
Where f k is a set of deep convolutional layers and g k is used to put the sequential matching dimension:
In Vision Transformer, the model divides the input feature map into fixed-size patches and performs self-attention computation between the patches, but in real image processing tasks, the model tends to degrade its performance due to the overly high computational complexity as the images are usually of high resolution. Swin Transformer restricts the attentional computation to non-overlapping localized windows through the use of sliding windows while also allowing cross-window connections, which brings higher efficiency and lower computational complexity. overlapping localized windows, while also allowing cross-window connections, leading to higher efficiency with lower computational complexity.

Transformer Block structure diagram, which consists of CPE position coding and Varied-Size Window Attention(VWA) block. The diagram below shows the internal details of the VWA block.
For an input
Then the VSA module uniformly samples M features from each different-sized window on K, V respectively to obtain K
w,v
, , which are fed into the MHSA module for self-attention computation together with Q
w
. However, the relative position information of q, k, v is deviated due to the offset of the window position during feature sampling, so the VSA module uses conditional position embedding (CPE) [34] to solve the problem of relative spatial position deviation as shown in Equation 8:
where Z l-1 are features from the previous Transformer Block, and the CPE is implemented by a deep convolutional layer with a convolutional kernel size equal to the default window size and a step size of one.
The primary challenge in a two-branch network lies in effectively integrating the local and global features extracted from both the CNN and Transformer branches while maintaining feature consistency. A common yet ineffective approach is to simply sum the features extracted by the CNN and Transformer branches at each hierarchical level and feed them directly to the decoder for upsampling. However, this method fails to ensure feature consistency, resulting in subpar performance and underutilization of the advantages of two-branch networks. To address this issue, we have proposed a novel feature fusion (DFF) module. This module incorporates a dual-attention mechanism to adaptively recalibrate and integrate these two different types of features, as illustrated in Figure 4.

Diagram of the internal structure of the DFF block. It contains the introduction of local and global features with the corresponding attention mechanism processing.
In medical image segmentation tasks, the features extracted by CNN networks usually contain more local details and localization information, and for the input features
where σ denotes the sigmoid activation function and
We follow the U-Net-like encoder-decoder architecture, relative to the encoder part, and design a corresponding decoder to combine the different size-resolution features obtained during the encoding process into a unified masked feature. For the input

The decoder module designs.
Datasets
Implementation details
For the experiments in this paper, all the code was done based on the Pytorch framework. The training and testing of the network models proposed in this paper are performed on an NVIDIA RTX 3090 GPU equipped with a 24GB video memory. Also, to ensure a fair comparison with other state-of-the-art methods, we uniformly divide the training images into 224x224 size.
Training parameters details
Training parameters details
During the training process we used the AdamW optimizer to optimize our model and the CosineAnnealing algorithm to dynamically adjust the learning rate, where the baseline learning rate is 1e-4 and the minimum learning rate adjusted by the CosineAnnealing algorithm is 1e-6. More parameter settings are shown in Table 1.
We utilize the Dice Similarity Coefficient (DSC) and the 95% Hausdorff Distance (HD95) to assess the segmentation performance of the model on the Synapse dataset. To validate the model’s performance in skin lesion segmentation, we employ several widely used metrics, including accuracy (ACC), sensitivity (SE), specificity (SP), and DSC. On the ACDC dataset, we primarily use DSC to evaluate the model’s performance. These metrics provide different information: accuracy measures the overall correctness of the prediction results, sensitivity and specificity evaluate the model’s ability to correctly predict positive and negative samples, respectively, and DSC quantifies the overlap between predicted segmentation results and ground truth labels. Additionally, the 95% Hausdorff Distance (HD95) provides information about the degree of matching between segmentation results and the edges of true samples, making the performance evaluation more comprehensive. By considering these metrics comprehensively, we can better assess the segmentation performance of the model on three different datasets.
Results of synapse multi-organ segmentation
On the Synapse Multi-Organ Segmentation dataset, we benchmarked the proposed model on two metrics, Dice Similarity Coefficient (DSC) and 95% Hausdorff Distance (HD95), and the comparison of this method with the best previous (SOTA) method is shown in Table 2.
Comparison results of the proposed method on the Synapse dataset. blue indicates the best result, and red displays the second-best
Comparison results of the proposed method on the Synapse dataset. blue indicates the best result, and red displays the second-best
The results presented in Table 2 demonstrate that our proposed method achieves optimal segmentation accuracy for multiple organs in the abdomen. Compared to the previous state-of-the-art (SOTA) method, our approach exhibits a significant improvement of 1.19% in the key metric, Dice coefficient. Particularly, we observe substantial absolute Dice coefficient increases of +3.93%, +0.66%, +0.50%, +0.11%, +2.49%, and +0.02% in the segmentation of the Gallbladder, left kidney, right kidney, Liver, Pancreas, and Spleen, respectively, when compared to the suboptimal method.
This shows that our model has a certain degree of superiority in capturing segmentation target features and locating target segmentation boundaries, and is significantly ahead of previous CNN-based or Transformer-based methods in several metrics. This is due to the fact that the model effectively utilises global and local features provided by a double branch network, which significantly improves the feature representation of the network compared to a traditional single branch network. capability, which is particularly important for segmentation tasks. Compared to recent hybrid CNN and Transformer-based networks, our approach still has a considerable lead in Dice coefficients and segmentation accuracies of multiple major organs, which demonstrates that compared to the simple splicing of features extracted from the CNN and Transformer blocks in the comparison model, our proposed DFF feature fusion block is quite Advancement. Our model adopts a parallel network structure in the encoding path, and in order to prevent the interference caused by the direct interaction of feature information between different branches, instead of letting the feature information of two different branches interact directly, we set up the DFF block to fuse the two kinds of features at each down-sampling level, which achieves the adaptive fusion of the features through the double-attention mechanism, suppresses the noise interference of the complex background on the segmentation task and improves the robustness of the model and accurately segment fine and complex organ structures.
In addition, in terms of visual evaluation, we provide Figure 6. In Figure 6, it can be seen that our method is more consistent with Ground Truth for the boundary segmentation of multiple abdominal organs, and is clearer in the boundary segmentation of gallbladder and left and right kidney than the previous CNN or transformer-based models. This is because our model provides both fine local position information and global image features, which is particularly important in organ segmentation of medical images.

Our method is visually compared with the segmentation results of other advanced methods on the Synapse dataset.
On the ISIC2017 dataset, we tested the proposed model on four indicators: Dice Similarity Coefficient (DSC), Specificity (SP), Sensitivity (SE) and Accuracy (ACC). The comparison between this method and the previous best (SOTA) method is shown in Table 3.
Comparison results of the proposed method on the ISIC2017 dataset. blue indicates the best result, and red displays the second-best
Comparison results of the proposed method on the ISIC2017 dataset. blue indicates the best result, and red displays the second-best
As can be seen from the results in Table 3, our proposed double-branch network model is almost overall superior to the existing SOTA methods in the skin disease segmentation task. In the key DICE coefficient or F1-Score, our method is 2.55% higher than the previous SOTA method, and our method is also superior to the existing methods in the sensitivity (SE) and accuracy (ACC) indicators.
In Figure 7, a partial segmentation result of skin lesions of dermatosis provided by our model is shown, which shows that the model accurately locates the boundary part of the lesion skin and provides smooth and accurate segmentation results. Compared with poorly performing pure CNN network models such as U-net and hybrid networks such as TransUnet, our method shows better generalization, thanks to the dual branching for grabbing features of different levels and the dual attention mechanism for fusing features in skip connections.

Our method is shown in the visualized segmentation results on the ISIC2017 dataset.
To further explore the generality of our method on different datasets, we also conducted tests on the ACDC dataset. The comparison of this method with the previous best (SOTA) methods is shown in Table 4.
Comparison results of the proposed method on the ACDC dataset. blue indicates the best result, and red displays the second-best
Comparison results of the proposed method on the ACDC dataset. blue indicates the best result, and red displays the second-best
It can be seen that our method achieves the best or the second best results in the three main organ segmentation indicators of left ventricle (LV), right ventricle (RV) and myocardium (MYO). In the average DSC index, the segmentation accuracy of our method is 0.43% ahead of the current best model MT-Unet. In the right ventricle (RV) segmentation accuracy, we are significantly ahead of the second best method, which indicates that our method has advantages in complex organ localization and segmentation.
In order to further explore the effectiveness of the proposed method, in this section, we design ablation experiments to verify the influence of different design choices on the model segmentation effect.
Impact of DFF block
In this section, we investigate whether the proposed DFF fusion module has a positive effect on the target medical image segmentation task, and whether it can help the model output more accurate segmentation results by using the double attention mechanism at each level of the decoding path to fuse the high-level and low-level features captured by the dual-branch network. We conducted verification on the Synapse Multi-Organ Segmentation and ISIC2017 datasets. For the model without the DFF Block, we adopted the U-Net skip connection scheme, which involves directly concatenating the features extracted by the dual-branch network at each level of the decoding path.The experimental results presented in Tables 5 and 6 clearly demonstrate the effectiveness of using the DFF block to fuse the dual-branch network at the encoding path. The network with the DFF block outperformed the one without it by a large margin across multiple metrics, indicating that the performance of the dual-branch model is dependent on a specific feature fusion approach. Our proposed DFF block is capable of fusing different-scale information hidden in the target image, which leads to a significant improvement in network performance.
Influence of DFF block on segmentation results on Synapse Multi-Organ Segmentation dataset
Influence of DFF block on segmentation results on Synapse Multi-Organ Segmentation dataset
Influence of DFF block on segmentation results on ISIC2017 dataset
To investigate the effectiveness of g n Conv and the impact of different CNN backbone on model performance, we conducted experiments on the Synapse Multi-Organ Segmentation and ISIC2017 datasets. The results presented in Tables 7 and 8 demonstrate that the CNN block constructed by g n Conv achieved the best performance across different CNN backbones. Specifically, on the Synapse Multi-Organ Segmentation dataset, the DSC and HD metrics of the g n Conv-based model outperformed the second-best CNN backbone by 0.80% and 0.45%, respectively, while on the ISIC2017 dataset, the g n Conv-based model achieved a significant improvement in the crucial DSC metric, outscoring the second-best CNN backbone by 1.44%. These results indicate the potential of the g n Conv block in exploring high-order latent features in images.
Segmentation results on the Synapse Multi-Organ Segmentation dataset using different CNN backbone
Segmentation results on the Synapse Multi-Organ Segmentation dataset using different CNN backbone
Segmentation results on the ISIC2017 dataset using different CNN backbone
In this section, we will explore the impact of using different Transformer backbone on the results of the model in medical image segmentation tasks. The comparison between the Transformer trunk without the variable sliding window Swin Transformer and the Transformer trunk containing VWA Block in this paper is shown in Table 9 and 10. This indicates that in complex medical image segmentation tasks, the improvement of the global modeling ability of the model greatly promotes the performance of segmentation accuracy, and the Transformer branch with stronger global semantic information feature extraction ability can effectively improve the final segmentation results.
Segmentation results on the Synapse Multi-Organ
Segmentation dataset using different Transformers backbone
Segmentation results on the Synapse Multi-Organ
Segmentation dataset using different Transformers backbone
Segmentation results on the ISIC2017 dataset using different transformers backbone
In this paper, we propose a double branch hybrid network model for the task of segmenting medical images, and its excellent performance on three datasets-Synapse Multi-Organ, Skin Lesion, and ACDC has demonstrated the effectiveness of our proposed method. By combining the different focuses of CNN and Transformer in image feature extraction and using them effectively through a feature fusion module, our approach gives full play to the global and local features extracted by the model at different levels, improves the model’s generalisation performance and robustness, and overcomes the problems of noise interference and low segmentation accuracy in medical image segmentation to a certain extent. We hope that our work can further explore the possibility of combining CNN and Transformer, fully utilizing the potential of features at different scales, and provide some reference for other related work.
Ethical approval
Our paper satisfies and fulfills the compliance with ethical standards. The research does not involve human participants and/or animals.
Data availability
The data that support the findings of this study are openly available in [Synapse] at [https://www.synapse.org/#!Synapse:syn3193805/files], [ISIC2017] at [https://challenge.isic-archive.com/data/#2017], [ACDC] at [https://www.creatis.insa-lyon.fr/Challenge/acdc/databases.html].
Conflict of interest
The authors declare that there is no conflict of interest.
Funding
National Natural Science Foundation of China (No. 62266011); the Science and Technology Foundation of Guizhou Province(ZK[2022] 119).
Author contributions
Jiang Z. and Wu Y. were responsible for the writing of the main body and experimental sections of this paper. Huang L. created all the figures in this paper. Gu M. organized the experimental results for this paper.
Footnotes
Acknowledgments
This work was financially supported in part by National Natural Science Foundation of China (No. 62266011), the Science and Technology Foundation of Guizhou Province (ZK[2022] 119).
