Abstract
BACKGROUND:
UNet has achieved great success in medical image segmentation. However, due to the inherent locality of convolution operations, UNet is deficient in capturing global features and long-range dependencies of polyps, resulting in less accurate polyp recognition for complex morphologies and backgrounds. Transformers, with their sequential operations, are better at perceiving global features but lack low-level details, leading to limited localization ability. If the advantages of both architectures can be effectively combined, the accuracy of polyp segmentation can be further improved.
METHODS:
In this paper, we propose an attention and convolution-augmented UNet-Transformer Network (ACU-TransNet) for polyp segmentation. This network is composed of the comprehensive attention UNet and the Transformer head, sequentially connected by the bridge layer. On the one hand, the comprehensive attention UNet enhances specific feature extraction through deformable convolution and channel attention in the first layer of the encoder and achieves more accurate shape extraction through spatial attention and channel attention in the decoder. On the other hand, the Transformer head supplements fine-grained information through convolutional attention and acquires hierarchical global characteristics from the feature maps.
RESULTS:
mcU-TransNet could comprehensively learn dataset features and enhance colonoscopy interpretability for polyp detection.
CONCLUSION:
Experimental results on the CVC-ClinicDB and Kvasir-SEG datasets demonstrate that mcU-TransNet outperforms existing state-of-the-art methods, showcasing its robustness.
Introduction
Endoscopic detection plays a vital role in pre-cancer screening and surgical intervention. Due to the diverse sizes, colors, and shapes of polyps, as well as their often low-contrast boundaries, the misdiagnosis rates in manual detection range from 14% to 30% [1]. Therefore, there is a clear demand for deep learning-based polyp segmentation techniques in clinical practice. These techniques not only assist physicians in the comprehensive evaluation of polyp size, shape, and quantity but also ensure more precise quantification.
Research presented at Digestive Disease Week USA (DDW2018) suggests that AI-assisted polyp detection can match the expertise of colonoscopy professionals. Since the introduction of UNet for medical image segmentation tasks, convolutional neural networks (CNNs) derived from UNet have become the primary methods in the field of polyp image segmentation. One method to enhance the polyp segmentation performance of UNet involves integrating multi-scale features and attention mechanisms [2]. Fan et al. [3] present a parallel architecture employing a reverse attention mechanism, utilizing an RA module to establish the relationship between region and boundary cues, thereby exploiting boundary information and improving segmentation accuracy. Numerous studies have emphasized the significance of spatial information in polyps. Research indicates that UNet utilizes skip connection operations to capture spatial features, and introducing convolutional operations within these skip connections can enhance the accuracy of polyp segmentation [4]. Yin et al. [5] aim to refine model generalization by reinforcing its spatial features, focusing particularly on the spatial contextual relationships within and among overlapping polyp images. Xu et al. [6] embed a large kernel convolution in the primary encoder through the PFC strategy, ensuring the preservation of low-level features and offsetting any reduction in spatial information. However, CNNs often suffer from segmentation errors, particularly when there are substantial variations in texture, shape, and size among polyps across different patients, due to their inability to capture long-range dependencies and global information within images.
Due to its unique self-attention mechanism, the Transformer [7] can demonstrate significant advantages in capturing global features. This characteristic enables the Transformer to possess immense potential in image-processing tasks, particularly in scenarios requiring a comprehensive understanding and analysis of global image information. However, the pure Transformer architecture performs poorly in medical image segmentation tasks due to its lack of spatial contextual information. Given the complexity and variability of image features involved in polyp segmentation tasks, such as texture, shape, and size, the integration of Transformers with CNNs has become the current mainstream research direction. For instance, TransNetR [8] employs a pre-trained ResNet50 in the encoder section to capture local detailed information of polyps and fuses the extracted features from the Transformer decoder at multiple levels by leveraging both local details and global contextual information, it enables more precise localization and segmentation of polyps. Wang et al. [9] combined the advantages of CNN and Transformer through feature fusion between the encoder and decoder, thereby enhancing the performance of polyp segmentation. The introduction of CNN as a decoder in the Transformer architecture can also overcome the limitations in modeling local features of polyps [10]. Furthermore, Zhang et al. [11] incorporated the Transformer into the bottleneck layer of the U-shaped CNN to better extract spatial and semantic-related information of polyps.
Inspired by these studies, we propose ACU-TransNet, a hybrid architecture that combines the advantages of CNN-based and Transformer-based models. Unlike the conventional approach of merely adding Transformer modules to the encoder, decoder, or bottleneck layers, ACU-TransNet comprises a comprehensive attention UNet and a Transformer head. In the comprehensive attention UNet, we introduce a stem block and a hybrid attention decoder layer to assist in extracting richer multi-feature maps. These feature maps are then transmitted to the Transformer head through a bridge layer. Notably, we propose the DSKA to mimic self-attention receptive fields and leverage MHA to integrate local-global contextual information, which facilitates the utilization of fine-grained feature information present in medical images.
The main contributions of the article are as follows: We design a Comprehensive Attention U-Net as the backbone, it improves UNet by stem block and hybrid attention decode layer for more accurate polyp shape extraction. We design a Transformer head, it employs Deformable Separable Kernel Attention(DSKA) and Multi-scale Spatial Aggregation (MSA) to establish local-to-global context, capturing long-range dependencies and extracting fine-grained information from feature maps. We propose a hybrid network architecture called ACU-TransNet for polyp segmentation tasks. It consists of a sequentially connected U-shaped network and a Transformer head, capable of preserving both local and global contexts while capturing long-range relationships between elements.
Proposed method
To fully combine the advantages of CNN and Transformer, we propose ACU-TransNet.
As shown in Fig. 1. It consists of a comprehensive attention UNet and a Transformer head. First, the input image is passed through a U-shaped feature extraction architecture that compensates for high-level features lost by stepwise convolution and pooling operations through deformable convolution and channel-weighted stem blocks. The generated multiple feature maps are fed into the bridge hierarchy after which a specially designed Transformer head processes these outputs and generates the finally predicted binary image. We use group normalization [12] (GN) instead of batch normalization (BN) to better accommodate different datasets and GPU memory sizes while getting better performance.

The architecture of the ACU-TransNet. A bridge layer is used to connect the CNN to the Transformer.
The comprehensive attention UNet represents the initial stage of feature encoding and decoding in ACU-TransNet, responsible for processing input images to generate multiple feature maps. Unlike the original UNet, ACU-TransNet employs an asymmetric structure, enhancing shape extraction accuracy through the stem block at the encoder end and the hybrid attention decoder layer at the decoder end.
Stem block
Medical image segmentation networks, within the head encoder, are capable of extracting pivotal features, including texture, color, shape, and spatial relationships. The loss of these low-level features can compromise the accuracy of polyp segmentation. To address this problem, we introduce the stem block. As shown in Fig. 2, the stem block primarily consists of a 7×7 deformable convolution (DConv) [13] and an enhanced deep separable convolution. The deep separable convolution comprises a 7×7 depthwise convolution (DWConv), channel attention, and a 1×1 point convolution (PConv). Deep Separable convolution tends to underperform on low-dimensional features [14]. To mitigate this, we enhance the input feature map dimension by integrating a deformable convolution before the deep separable convolution. DConv facilitates the utilization of larger kernels without modifying the network’s parameter count. Notably, when juxtaposed with standard convolution, deformable convolution better conforms to irregular polyp contours. Initially, the input image passes through a DConv, followed by a refined deep separable convolution to extract primary features. Channel attention is embedded between the depthwise convolution and pointwise convolution (PConv), further enhancing channel feature extraction capability. Channel features are enriched through weighted summation of global average pooling and max pooling. The residual structure improves feature propagation, reduces feature redundancy, and facilitates the acquisition of rich primary features, compensating for pixel loss due to convolutions and pooling operations.

The architecture of stem block. It is located in the first layer of the ACU-TransNet.
An option to fully capture channel-wise relies on channel attention for parallel pooling, the channel attention between DWConv and PConv. Let the average pooling be the AvgPool () and max pooling be the MaxPool (), the feature X undergoes the pooling operation to obtain
The channel weight S for the different channels obtained by two 1×1 convolutional layers is:
Contrary to channel attention mechanisms like SE [15] and CA [16] that typically rely on average pooling, channel attention bolsters the network’s representational capacity. It accentuates locally prominent features, including edges and textures, via global maximum pooling. Additionally, global average pooling seizes the overall feature distribution, thus offering a more enriched and holistic feature representation.
The stem block better extracts low-level features through the use of large-kernel deformable convolution, channel attention, and deep separable convolution. This improvement facilitates a more efficient flow of information throughout the network, providing subsequent layers with richer and more precise feature representations.Consequently, ACU-TransNet retains the edge, texture, and color features of polyps more effectively in the initial layers, laying the foundation for the formation of more complex and abstract high-level features.
In colonoscopy detection images, the spatial location information of polyps is important [6] because polyps in the same region have different texture colors and shape sizes. Additionally, the cascade convolution operation loses the spatial details of tiny polyps. Therefore, we propose a hybrid attention decoder layer, which consists of channel attention and spatial attention to selectively enhance the desired features. As shown in Fig. 3, the block obtains spatial attention by employing a 1×1 convolution block and a group norm activation function. It aggregates feature maps from different layers by merging adjacent features of different resolutions from channel attention and applying spatial attention. This method automatically learns to focus on specific lesion areas, thus eliminating the need for post-processing. Instance normalization enhances the focus on pixel information, resulting in better performance when combined with GELU and instance normalization.

The architecture of the hybrid attention decoder layer. Take the first layer of the hybrid attention decoder as an example, where X denotes features from the skip connection and G denotes features from the bottleneck.
At the end of the UNet, the input images undergo a transforming process to produce high-level features with the same dimensions but a different number of channels. To refine and enhance these features, a convolution layer with a kernel size of 1 is utilized as the bridge layer between the UNet and the Transformer head, called the bridge layer.
Transformer head
The self-attention mechanism in Transformers conventionally processes images into 1D sequences, thereby overlooking crucial 2D spatial information [17]. To mitigate this limitation, we introduce a novel Transformer head that seamlessly integrates the adaptivity and long-range dependencies of self-attention while harnessing the benefits of convolutional operations for local fine-grained feature extraction. The architecture of the fully convolutional Transformer is depicted in Fig. 4.

The architecture of the Transformer head. The DSKA and MSA are composed of convolutional attention, the DSKA obtains global attention through deformable large kernel attention, while MSA applies atrous convolutions with a linearly increasing receptive field to the output of DSKA.
The proposed Transformer initially integrates features from the bridge layer. Following this, these concatenated features are transformed into a sequential representation at the Embedding layer, where positional encoding is omitted to reduce model complexity. Subsequently, instead of employing traditional multi-head self-attention, we leverage the proposed Deformable Separable Kernel Attention (DSKA) to capture global contextual information.
To further enrich the output features with spatial contextual information, we employ a Multi-scale Spatial Aggregation (MSA) module. The process of acquiring both global and spatial features is iterated three times, ensuring a comprehensive representation of the input image. Finally, the extracted rich features undergo upsampling to restore the original image size, and prediction results are generated through feedforward layers.
F notes embedding features, Flatten (·) denotes tensor flattening operations.F b notes features of the bridge layer, and Conv(C b ,C b ) denotes the convolution operation on the number of channels C b output from the bridge layer.
Datasets and preprocessing
The CVC-ClinicDB and Kvasir-SEG data sets are obtained from https://datasets.simula.r-seg/ and https://polyp.grand-challenge.org/CVCClinicDB/, CVC-ClinicDB data set comprises 612 video frames derived from 31 high-definition endoscopic examinations [19] and Kvasir-SEG data set contains 1000 polyp images and their corresponding ground truth [20] Fig. 5).

Input image and ground truth in two colonoscopy datasets.
Before training, we resized images to 256×256 to address size inconsistencies in the original datasets. We applied image normalization, adjusting pixel values based on a given mean and standard deviation. The training and test sets were divided at a 9:1 ratio through random and uniform distribution (Table 1). Moreover, we employed five-fold cross-validation on the training set. During training, we utilized data augmentation techniques–random horizontal flipping, scaling, and rotation–to enhance the model’s performance and generalizability. However, the validation set remained unaugmented during the training process.
Number of training and testing sets
Polyp segmentation algorithms use colonoscopy images to discern the unique features and patterns of polyps. In this paper, in addition to the evaluation metrics, Dice coefficient and Jaccard, suggested by the Kvasir-SEG dataset, we further introduce Accuracy, Precision, and Recall, which are used as network performance metrics in the literature [21]. The Dice coefficient and Jaccard are commonly utilized for an intuitive evaluation of performance.
All experiments conducted in this paper were implemented on the deep learning open source framework PyTorch 2.1.0 (CUDA version 11.7), using NVIDIA GeForce1080Ti GPU, i5-4570 CPU, and RAM 16G on a host computer, with the batch size set to 2, an initial learning rate of 1-3e. The best model is selected after 120 rounds of iterations based on cross-validation results.
Result and conclusion
Performance comparison
In this subsection, we tested the proposed model on four indicators: Dice, Jaccard(Jac), Accuracy (Acc), Precision(Pre), and Recall(Rec) on the CVC-ClinicDB and Kvasir-SEG datasets.
As shown in Table 2, our method overall outperforms existing CNN and Transformer-based polyp segmentation methods on the CVC-ClinicDB dataset. Additionally, it surpasses previous state-of-the-art (SOTA) methods in accuracy (Acc), precision (Pre), and recall (Rec). As shown in Table 3, on the Kvasir-SEG dataset, ACU-TransNet achieved a 6.97% and 3.37% improvement in Dice scores when compared to CoInNet and Colonformer, respectively. Additionally, it achieved a 6.09.
Comparison results of the proposed method on the CVC-ClinicDB, bold indicates the best result
Comparison results of the proposed method on the CVC-ClinicDB, bold indicates the best result
Comparison results of the proposed method on the Kvasir-SEG dataset, bold indicates the best result
In terms of visual evaluation, we provide Fig. 6, which visually demonstrates the segmentation results of the ten methods on the two datasets. It can be observed that ACU-TransNet (Ours) is more consistent with the ground truth of polyp images and achieves more accurate segmentation for irregular contours. Compared to other CNN or Transformer-based methods, our method benefits from the fully integrated advantages of the UNet-Transformer architecture, which leverages both the strengths of CNNs and Transformers. Furthermore, the introduction of deformable convolutions and attention mechanisms makes our network more suitable for capturing polyp characteristics.

Our method is visually compared with the segmentation results of other advanced methods on the CVC-ClinicDB and Kvasir-SEG datasets, the obvious segmentation errors for each method are boxed in red. (a) images; (b) ground truth;
Five-fold cross-validation can offer a comprehensive evaluation of a model’s performance on unseen data. In our study involving the CVC-ClinicDB and Kvasir-SEG datasets, we conducted a comparative analysis of three CNN methods–UNet, PraNet, and CoInNet–with our proposed ACU-TransNet. This analysis demonstrates that ACU-TransNet achieves performance enhancement while maintaining the stability of model training.
As evident from Fig. 7, when the batch size was set to 2, all four methods displayed a degree of instability during training. However, ACU-TransNet exhibited relatively stable training loss and consistent performance on the validation data, indicating its robustness and better generalization on unseen data.

Plot of training trend change, ACU-TransNet converges better than UNet, PraNet, and CoInNet on training and validation sets.
In this subsection, we verify the effectiveness of the stem block, hybrid attention decoder layer, and Transformer head modules added in the baseline and discuss. In Table 4, we validate the GN and GELU activation functions for UNet lifting with a small batch size and use it as the baseline.
Comparative evaluation of U-Net and baseline on CVC-ClinicDB and Kvasir-SEG datasets
Comparative evaluation of U-Net and baseline on CVC-ClinicDB and Kvasir-SEG datasets
1Convolution layer: Conv3×3 + BN + ReLU. 2Convolution layer: Conv3×3 + GN + GELU.
In Tables 5 and 6, we investigate the performance of various model components for polyp segmentation in the CVC-ClinicDB and Kvasir-SEG datasets. the baseline model provides a foundation for evaluating the effectiveness of subsequent improvements. The introduction of the stem block results in an improvement in Dice coefficient and other metrics, indicating its ability to preserve low-level features effectively. The hybrid attention decoder layer further enhances the model’s capability to extract features from different polyp contours, as evidenced by the increase in the Dice coefficient. Combining both the stem block and the hybrid attention decoder layer shows a synergistic effect, leading to stable yet slightly improved performance. Finally, the inclusion of the Transformer head provides the model with global context awareness and the ability to capture long-range dependencies, which translates into a significant boost in the Dice coefficient and overall performance. In summary, the hybrid attention decoder layer and the stem block are emphasized for their significance in enhancing both local and global feature extraction, while the Transformer head notably strengthens the modeling of the global context. The combination of these components in the final model leads to SOTA performance in the task of polyp segmentation.
Ablation study of the ACU-TransNet architecture in CVC-ClinicDB, bold indicates the best result
Ablation study of the ACU-TransNet architecture in Kvasir-SEG, bold indicates the best result
In Fig. 8, we present selected results from our ablation experiments and visually highlight the disparities between the segmentation outcomes and the ground truth through color-coding. Notably, the addition of distinct modules has led to significant improvements in both under-segmentation and over-segmentation compared to the original baseline model. This observation underscores the efficacy and validity of our proposed module in enhancing the overall segmentation performance.

Visualization of ablation experiments. (a) baseline; (b) baseline + stem block; (c) baseline + hybrid attention decoder layer; (d) baseline + stem block + hybrid attention decoder layer; (e) baseline + Transformer head;
We introduce ACU-TransNet, an attention- and convolution-enhanced UNet-Transformer network designed for polyp segmentation. ACU-TransNet integrates the powerful local feature extraction capabilities of UNet with the global context awareness of Transformer. The UNet architecture is strengthened by the inclusion of a stem block and a hybrid attention decoder layer, which enhances the extraction of various polyp types. Channel attention is employed to emphasize specific feature categories, while spatial attention highlights disease-related features. In the Transformer head, we leverage DSKA and MSA to capture local-to-global contextual information and long-range dependencies. We validate the effectiveness and superiority of ACU-TransNet using two authoritative public polyp datasets, demonstrating that its evaluation metrics exceed those of current state-of-the-art (SOTA) methods for polyp segmentation. Our future work includes improving the model’s accuracy, efficiency, and complexity reduction. Additionally, we aim to integrate this model into colonoscopy devices to assist physicians in diagnostic evaluations through polyp segmentation.
Statements and declarations
Ethical approval
Our paper satisfies and fulfills the compliance with ethical standards. The research does not involve human participants and/or animals.
Conflict of interest
The authors declare no competing interests.
Funding
This work was financially supported in part by the National Natural Science Foundation of China (No. 62266011), and the Science and Technology Foundation of Guizhou Province (ZK[2022]119).
