A Lightweight convolutional medical segmentation algorithm based on ConvNeXt to improve UNet

Abstract

In recent years, UNet and its derivative networks have gained widespread recognition as major methods of medical image segmentation. However, networks like UNet often struggle with Point-of-Care (POC) healthcare applications due to their high number of parameters and computational complexity. To tackle these challenges, this paper introduces an efficient network designed for medical image segmentation called MCU-Net, which leverages ConvNeXt to enhance UNet. 1) Based on ConvNeXt, MCU-Net proposes the MCU Block, which employs techniques such as large kernel convolution, depth-wise separable convolution, and an inverted bottleneck design. To ensure stable segmentation performance, it also integrates global response normalization (GRN) layers and Gaussian Error Linear Unit (GELU) activation functions. 2) Additionally, MCU-Net introduces an enhanced Multi-Scale Convolution Attention (MSCA) module after the original UNet’s skip connections, emphasizing medical image features and capturing semantic insights across multiple scales. 3)The downsampling process replaces pooling layers with convolutions, and both upsampling and downsampling stages incorporate batch normalization (BN) layers to enhance model stability during training. The experimental results demonstrate that MCU-Net, with a parameter count of 2.19 million and computational complexity of 19.73 FLOPs, outperforms other segmentation models. The overall performance of MCU-Net in medical image segmentation surpasses that of other models, achieving a Dice score of 91.8% and mIoU of 84.7% on the GlaS dataset. When compared to UNet on the BUSI dataset, MCU-Net shows an improvement of 2% in Dice and 2.9% in mIoU.

Keywords

Convolution neural network deep learning medical image processing semantic segmentation

1 Introduction

In recent years, healthcare professionals have gained the capability to perform rapid bedside ultrasound examinations using handheld devices known as Point-of-Care Ultrasound (POCUS). These devices are more portable compared to traditional ultrasound equipment [1]. POCUS has demonstrated remarkable clinical efficacy in detecting various conditions, including pneumonia, pleural effusion, and dyspnea caused by pericardial effusion [2], as well as assessing pre- and post-resuscitation conditions following cardiac arrest [3]. POCUS’s application can also be found in home healthcare settings, enabling medical assessments through the linkage of a portable ultrasound device with a cell phone [4]. On the other hand, the field of MRI is also progressing toward POC applications [5]. With the ongoing advancement of POC medical devices, the prevailing trajectory of future research lies in the analysis of examination findings through artificial intelligence [6]. Using techniques like image segmentation and image classification within the domain of deep learning on such devices can significantly expedite the diagnostic process for both patients and doctors. However, the majority of POC devices face limitations in terms of computational performance and memory. Consequently, designing a model that strikes a balance between model complexity and computational efficiency remains a formidable challenge in this domain.

Fig. 1

POC application demonstration (shown in the image as POCUS).

Presently, prevalent solutions for medical image segmentation tasks widely apply UNet [7] and its diverse derivative networks. These networks include UNet++ [8], AttentionUNet [9], V-Net [10], UNet3+ [11], and ResUNet [12] etc. Notably, there has been a recent integration of the self-attention mechanism from Vision Transformers [13] into the UNet framework. This innovation has led to the creation of Transformer-based UNet derivative image segmentation models such as TransUNet [14], TransBTS [15], MedT [16], Swin UNet [17], and more. However, while the aforementioned studies tend to introduce more complex networks into UNet to enhance model performance, the practicality of deploying such segmentation networks with a large number of parameters on bedside devices is limited. These devices often face constraints in computational performance and memory [18]. Particularly, Transformer-based UNet derivative image segmentation models exhibit outstanding performance due to their excellent modality fusion capabilities and the self-attention mechanism that effectively captures global information. However, Transformer-based UNet derivative networks exhibit increased model complexity, a higher number of parameters, extended network training durations, and a demand for advanced computational devices such as GPUs. These attributes are a result of the strong emphasis on performance enhancement.

Recently, the ConvNeXt network [19] demonstrated that a pure convolutional network can outperform Transformer-based networks, highlighting the effectiveness of convolutional neural networks. Taking that into consideration, we have decided to employ a pure convolutional neural network approach to address the aforementioned problem. Referring to ConvNeXt’s improvement of the original ResNet, we have proposed the MCU Block as a replacement for the original convolution module in UNet. This modification enables our model to prioritize the target region by acquiring global information. At the same time, model parameters and computational complexity are reduced through the use of depth-wise (DW) convolution [20]. To enhance the model’s feature extraction capabilities across multiple scales of information, we have introduced Multi-scale Convolutional Attention (MSCA) [21] after the jump connection. The convolutional channels within MSCA have been adjusted, and its structure has been optimized specifically for the medical image segmentation task. We have employed convolution to replace the original pooling layer to prevent information loss during downsampling pooling. Additionally, we have added batch normalization (BN) [22] during both the upsampling and downsampling stages to ensure the stability of the training process.

The main contributions of this paper are as follows:

We propose MCU-Net, an efficient network model tailored for medical image segmentation.

We introduce the MCU Block, which enhances the model’s capability to acquire global information in medical image segmentation. Additionally, we incorporate a convolutional attention module (MSCA) to improve multi-scale feature extraction. The structure of the upsampling and downsampling layers is redesigned for better performance. Both the MCU Block and MSCA utilize depth-separable convolution to optimize network model complexity.

We conduct extensive experiments on GlaS and BUSI datasets, demonstrating the exceptional performance of our proposed network in terms of both model complexity and segmentation performance.

The remaining sections of this paper are structured as follows: In Section 2, we provide an overview of the current state of progress in related work. Section 3 details the methodology of MCU-Net and its constituent modules. Section 4 is dedicated to presenting the experimental details, and Section 5 showcases the experimental results. In Section 6, we discuss the limitations of the research presented in this paper and outline prospects. Finally, the whole paper is summarised in Section 7.

2 Related work

2.1 Previous UNet-based approach

FCN (Fully Convolutional Network) [23] is a pioneer of image segmentation that employs complete convolutions for feature extraction. UNet, introduced as a typical method of medical image segmentation networks, addressed information loss by introducing jump connections between the original encoder and decoder of FCN. UNet achieved significant success in medical image segmentation applications, establishing itself as a benchmark model in the field. Zhou et al. [8] introduced U-Net++, which redesigned jump connections to create a network that aggregates features with different semantic scales on the decoder sub-network. This innovative approach efficiently addresses the challenge of unknown network depth by integrating U-Nets with different depths. Building upon the foundation laid by U-Net++, Huang et al. [11] proposed U-Net3+, which further enhanced the model by employing full-scale jump connections, deep supervision, and a combination of hybrid loss functions and classification bootstrap modules. Jha et al. [24] introduced the DoubleU-Net, utilizing a stacked UNet approach by adding UNet at the base of the original UNet architecture. Concurrently, they employed Atrous Spatial Pyramid Pooling (ASPP) to capture contextual information within the network. In a separate endeavor, Anita et al. [12] enhanced the UNet model by introducing a residual structure, resulting in the ResUNet. They replaced conventional convolutions in the model with Res-blocks to significantly boost network performance. Oketay et al. [9] introduced the Attention-UNet, which incorporates an attention module into the jump connection of the U-Net framework. This approach utilizes Attention Gate implicit learning to effectively suppress irrelevant regions in the input image while emphasizing crucial features for segmentation tasks. While these various methods enrich and broaden the potential applications and research scope of UNet-type networks, they also introduce heightened complexity in network structures and algorithms, posing challenges for practical applications that should not be underestimated. In Table 1, we provide a comprehensive summary of the evaluation criteria, advantages, and disadvantages of the aforementioned networks.

Table 1
Advantages, disadvantages, and evaluation metrics of UNet-based networks

Networks Advantages Disadvantages Evaluation metrics

Unet Added upsampling, downsampling, and skip connections, a strong ability to fuse features from different levels. Excessive downsampling can lead to more loss of spatial information. IoU (Intersection over Union)

UNet++ Multi-scale features enhance context awareness and the cascading features retain detailed information. It consumes excessive memory and is difficult to train. IoU

UNet3+ It features nested dense connections, which enhance computational efficiency by reducing network parameters. Insufficient utilization of multi-scale original features. Dice (Dice Coefficient)

DoubleU-Net Bottom nested UNet, strong ability of the network to focus on low-level features. Excessive parameters lead to increased training time. Dice, mIoU (Mean Intersection over Union), Recall, Precision

ResUNet Better address gradient vanishing and semantic information loss issues. Weak ability to extract features at the edges of the target. Dice, JI (Jaccard Index), Recall

Attention-UNet Emphasize local information features, resulting in higher model sensitivity. Requires multiple layers of stacking to extract distant information; efficiency is relatively lower. Dice, Precision, Recall

Networks	Advantages	Disadvantages	Evaluation metrics
Unet	Added upsampling, downsampling, and skip connections, a strong ability to fuse features from different levels.	Excessive downsampling can lead to more loss of spatial information.	IoU (Intersection over Union)
UNet++	Multi-scale features enhance context awareness and the cascading features retain detailed information.	It consumes excessive memory and is difficult to train.	IoU
UNet3+	It features nested dense connections, which enhance computational efficiency by reducing network parameters.	Insufficient utilization of multi-scale original features.	Dice (Dice Coefficient)
DoubleU-Net	Bottom nested UNet, strong ability of the network to focus on low-level features.	Excessive parameters lead to increased training time.	Dice, mIoU (Mean Intersection over Union), Recall, Precision
ResUNet	Better address gradient vanishing and semantic information loss issues.	Weak ability to extract features at the edges of the target.	Dice, JI (Jaccard Index), Recall
Attention-UNet	Emphasize local information features, resulting in higher model sensitivity.	Requires multiple layers of stacking to extract distant information; efficiency is relatively lower.	Dice, Precision, Recall

2.2 Attention mechanism

The attention mechanism represents a specialized form of neural network that assigns distinct weights to various segments of the input data, guided by the demands of a specific task. This mechanism effectively steers the network’s attention toward regions of interest, resulting in a notable enhancement in performance. Hu et al. [25] introduced SE-Net, which features the Squeeze and Excite (SE) module. This module facilitates the transformation from spatial features to global features, thus better capturing intricate inter-channel relationships. Woo et al. [26] proposed CBAM, a feed-forward convolutional neural network attention module. CBAM seamlessly combines cross-channel and spatial information, effectively extracting informative features. Huang et al. [27] introduced CCNet, featuring a crossing attention mechanism. This mechanism adeptly captures both horizontal and vertical information, strategically weighting the features of target pixel points through correlation analysis to attain comprehensive global contextual information. In contrast, ECA-Net [28] refines the excitation module of SE-Net, underscoring the significance of bypassing dimensionality reduction in channel attention. It emphasizes that suitable cross-channel interactions can substantially mitigate model complexity while preserving performance. On the other hand, SA-Net [29] integrates the Shuffle Attention (SA) module. Diverging from conventional spatial and channel attentions, SA-Net employs a shuffle attention approach, utilizing a permutation matrix to reorganize the feature map. This restructuring fosters the exchange of feature information, bolstering visual transmission and elevating model performance. Remarkably, the attention mechanism engenders performance enhancement with only a minor portion of model parameters and computational resources, which serves as an encouraging prospect for our pending problem. However, it’s important to acknowledge that the attention mechanisms aforementioned might not excel in the domain of multi-scale feature extraction. Refer to Table 2 for a summary of the advantages and disadvantages of the attention mechanisms.

Table 2
Advantages and disadvantages of attention mechanism

SE -Net Enhancing important channels, capturing global information, with low parameter count and computational cost. Lacking local information, requiring extensive data training.

CBAM Adapting to different input sizes, enriching channel and spatial feature information. Excessive focus on local features may lead to overfitting issues during model training.

CCNet Establishing distant dependencies enhances the semantic information of features. The segmentation performance in edge regions is unsatisfactory, and it requires significant computational resources due to large module parameters.

ECA-Net Capturing global information, non-reducing cross-channel interactions, adaptive one-dimensional convolution kernels. Lacking long-range dependency, and performance bottleneck in complex images with existing limitations.

SA-Net Establishing distant dependency relationships, enriching channel and spatial feature information. Difficult training, information exchange, and recombination lead to information loss.

SE -Net	Enhancing important channels, capturing global information, with low parameter count and computational cost.	Lacking local information, requiring extensive data training.
CBAM	Adapting to different input sizes, enriching channel and spatial feature information.	Excessive focus on local features may lead to overfitting issues during model training.
CCNet	Establishing distant dependencies enhances the semantic information of features.	The segmentation performance in edge regions is unsatisfactory, and it requires significant computational resources due to large module parameters.
ECA-Net	Capturing global information, non-reducing cross-channel interactions, adaptive one-dimensional convolution kernels.	Lacking long-range dependency, and performance bottleneck in complex images with existing limitations.
SA-Net	Establishing distant dependency relationships, enriching channel and spatial feature information.	Difficult training, information exchange, and recombination lead to information loss.

2.3 Further development of convolution

Recently, Convolutional Neural Networks (CNNs) have made a comeback as ConvNeXt [19], a pure convolutional network, achieved superior performance in segmentation tasks by drawing inspiration from the successful experiences of Vision Transformer and re-designing the ResNet50/200 convolutional models. Based on the research of ConvNeXt, Zhang et al. [30] introduced BCU-Net. This model redesigns ConvNeXt into a form of encoder-decoder. At the same time, BCU-Net will allow parallel input of advanced ConvNeXt and UNet. Finally, the Multilabel Recall Loss (MRL) module will facilitate the deep integration of local and global pathology semantics between the two heterogeneous branches. Chen et al. [31] proposed an improved UNet network for oral therapy, among which the DCN Block applies a dual ConvNeXt Block and proposes an attention module using multi-filter and self-attention techniques. Although the above networks are based on ConvNeXt for network design, none of them are investigated in the direction of model light-weighting, thus making it difficult to meet the medical needs of POC applications. Inspired by ConvNeXt, Han et al. [32] designed ConvUNeXt. This innovative design reconfigures the convolutional blocks within UNet and introduces a lightweight attention mechanism that hones in on the target region. Concurrently, UNeXt [33] stands as a streamlined segmentation model that simplifies the conventional UNet structure and elevates its standard MLP by introducing a Tokenized MLP layer. Refer to Table 3 for a comprehensive overview of the evaluation criteria, advantages, and limitations of the networks mentioned.

Table 3
Advantages, disadvantages, and evaluation metrics of further evolved convolutional networks

Networks Advantages Limitations Evaluation Criteria

ConvNeXt Fully convolutional network and ConvNeXt-L’s performance surpasses that of Swin Transformer. ConvNeXt, as a versatile network, has not been specifically optimized for medical image segmentation. No image segmentation metrics.

BCU-Net Its seamless plug-and-play architecture makes it easy to deploy, and the model demonstrates strong capabilities in capturing both global and local pathological semantic information. A complex model that combines both networks and demands significant computational resources. Accuracy, Sensitivity, Specificity, F1-score

An improved UNet with DCN blocks The network exhibits strong feature expression capabilities, and the incorporation of multiple filter attention modules helps prevent the loss of information from the original image. The network’s performance allocation ability is weak, which may lead to the waste of certain model performances. Dice

ConvUNeXt Proficient at capturing features from different levels while having a low parameter count. The attention mechanism is too simplistic to capture local features. MIoU, Dice

UNeXt Achieving faster inference with fewer parameters and computational complexity. Limited feature extraction capability. F1, IoU

Networks	Advantages	Limitations	Evaluation Criteria
ConvNeXt	Fully convolutional network and ConvNeXt-L’s performance surpasses that of Swin Transformer.	ConvNeXt, as a versatile network, has not been specifically optimized for medical image segmentation.	No image segmentation metrics.
BCU-Net	Its seamless plug-and-play architecture makes it easy to deploy, and the model demonstrates strong capabilities in capturing both global and local pathological semantic information.	A complex model that combines both networks and demands significant computational resources.	Accuracy, Sensitivity, Specificity, F1-score
An improved UNet with DCN blocks	The network exhibits strong feature expression capabilities, and the incorporation of multiple filter attention modules helps prevent the loss of information from the original image.	The network’s performance allocation ability is weak, which may lead to the waste of certain model performances.	Dice
ConvUNeXt	Proficient at capturing features from different levels while having a low parameter count.	The attention mechanism is too simplistic to capture local features.	MIoU, Dice
UNeXt	Achieving faster inference with fewer parameters and computational complexity.	Limited feature extraction capability.	F1, IoU

3 Methodology

3.1 MCU-net overall network architecture

The UNet architecture is divided into three key components: encoder, decoder, and skip connections. When it comes to the encoder, a sequence of convolutions and consecutive downsampling operations are employed to extract semantic information at various stages of the network. While the decoder involves upsampling the extracted features. During this process, the semantic information obtained from different stages in the encoder is integrated with the high-resolution semantic information derived from the upsampling through jump connections. This integration aims to alleviate the spatial information loss caused by downsampling.

MCU-Net adopts the U-shaped structure of the UNet as its core architecture. Unlike UNet, MCU-Net replaces the conventional convolution blocks in the encoder-decoder sections with MCU Blocks. MCU Block simplifies the UNet model and improves network performance. However, it should be noted that this alteration might lead to a limitation in the local feature extraction capability of the network. Consequently, MCU-Net introduces the Multiscale Convolutional Attention (MSCA) module following the skip connection within the UNet structure. The primary purpose of the skip connection in UNet is to extract distinct levels of semantic information for subsequent upsampling operations. In contrast, from different scales, MSCA processes semantic information obtained through skip connections. This adjustment empowers the model to distribute more weights on local features within the target region. Furthermore, adaptations are made to the upsampling and downsampling layers within the encoder-decoder framework to ensure the stability of the training process. The comprehensive architecture of our proposed MCU-Net is visually depicted in Fig. 2.

Fig. 2

Overall network architecture of MCU-Net based on UNet.

MCU-Net consists of Encoder, MSCA Block, Decoder, and Skip Connections. The Encoder in MCU-Net consists of five MCU Blocks and four downsampling layers. The outputs from the five encoder layers are denoted as E1, E2, E3, E4, and E5. The Decoder is constructed with four MCU Blocks and four upsampling layers. The outputs from the four decoder layers are labeled D1, D2, D3, and D4. The MSCA module connects the results from the skip connections with the upsampling outputs within the decoder. For reference, the inputs and outputs of MCU-Net are presented in Table 4.

Table 4

We show in detail the inputs and outputs of each module in MCU-Net. Econv denotes the encoder module and Dconv denotes the decoder module

Block	Layer	Input	Output
Encoder	Econv₁	480×480×3	480×480×32 = E1
	Down-Sampling₁	480×480×32	240×240×32
	Econv₂	240×240×32	240×240×64 = E2
	Down-Sampling₂	240×240×64	120×120×64
	Econv₃	120×120×64	120×120×128 = E3
	Down-Sampling₃	120×120×128	60×60×128
	Econv₄	60×60×128	60×60×256 = E4
	Down-Sampling₄	60×60×256	30×30×256
	Econv₅	30×30×256	30×30×512 = E5
Decoder	Up-Sampling₄	30×30×512	60×60×512
	Dconv₄	60×60×1024	60×60×256 = D4
	Up-Sampling₃	60×60×256	120×120×256
	Dconv₃	120×120×512	120×120×128 = D3
	Up-Sampling₂	120×120×128	240×240×128
	Dconv₂	240×240×256	240×240×64 = D2
	Up-Sampling₁	240×240×64	480×480×64
	Dconv₁	480×480×128	480×480×32 = D1

3.2 ConvNeXt + Block

ConvNeXt, building upon the foundation of the standard ResNet [34], refined the design of the Vision Transformer and delved into crucial components that contribute to enhancing model performance. ConvNeXt conducted in-depth exploration across various aspects, encompassing macro-level design, ResNeXt, large kernel convolutions, inverted bottlenecks, and micro-level design, surpassing Transformer’s performance. When incorporating the ConvNeXt Block into the UNet architecture, we observed an enhancement in the performance of UNet. Consequently, MCU-Net adopts the ConvNeXt Block as the major convolutional module. Furthermore, to optimize the ConvNeXt Block’s performance within the medical image segmentation domain, we introduced the MCU Block after refining it to meet the specific requirements.

Adapting the appropriate convolution kernel size. The utilization of a 7×7 convolutional kernel size within the ConvNeXt network is rooted in the rationale that a larger kernel size results in an expanded receptive field [35]. Meanwhile, more profound semantic information can be extracted. It helps to determine the target’s location. Consequently, this approach outperforms other approaches in terms of pixel-level segmentation tasks.

Depth-wise separable convolution. Employing large convolutional kernels contributes to heightened network complexity. To address this challenge, we maintain the use of DW convolution [20]. This strategy substantially reduces the FLOPs and counterbalances the augmentation in the model’s parameter count associated with employing a sizable 7×7 convolution kernel. Additionally, DW convolution extends the network’s width, effectively compensating for potential capacity loss.

Inverted bottleneck design. An inverted bottleneck structure, akin to that found in Transformers, finds application in ConvNeXt, as shown in Fig. 3. In the Transformer, the hidden dimension of the MLP block is expanded to be four times wider than the input dimension. ConvNeXt employs an inverted structure to achieve this, involving two consecutive 1×1 convolutions. The first 1×1 convolution elevates the dimension to H×W×4 C, and the second 1×1 convolution reduces it back to H×W×C. The MCU Block retains this inverted structure, positioning it after the initial DW convolution layer.

Fig. 3

Implementation details of convolutional modules for UNet, ResNet, ConvNeXt, and MCU-Net.

Activation function and normalization layer. As the activation function, the GELU function [36], a well-established choice in the Transformer architecture, was opted for. GELU is regarded as a smoother alternative to the ReLU function [37] and is employed as an activation function sandwiched between two consecutive 1×1 convolutional layers. Within the MCU Block, a novel Global Response Normalization (GRN) [38] layer is introduced. The GRN layer fosters competitive interactions among features across different channels, thereby amplifying the network’s expressive capabilities. To address the challenges of gradient vanishing and explosion, as well as to expedite model convergence, the residual connection is leveraged. In the MCU Block, the two paths of residual and convolution are directly combined, followed by the addition of a GELU activation function. After their combination, the addition of GELU aims to accentuate the module’s non-linear modeling potential. The module’s implementation details are visually illustrated in Fig. 3, showcasing the convolution block designs for UNet, ResNet, ConvNeXt, and MCU-Net collectively.

3.3 Up-sampling and down-sampling module adjustment

In U-Net, the global max pooling layer serves as the method for downsampling. However, global max pooling lacks adjustability and doesn’t possess learnable parameters. In contrast, convolution operations are equipped with learnable parameters that can cater to data-specific adjustments, rendering the downsampling effect more tailored to actual requirements. Consequently, we substituted the original downsampling approach with a 2×2 convolution kernel and a stride of 2. This change avoids potential information loss that could arise from global maximum pooling. We have not adjusted the downsampling module and still use Bilinear Interpolation for upsampling, but we have added Batch Normalization [22] before both the upsampling and downsampling modules. This approach curtails the network’s sensitivity to significant parameter shifts during inverse computations, rendering the network’s parameter updates more stable. This step ensures training process stability and standardizes data throughout the upsampling and downsampling procedures. The implementation details of upsampling and downsampling are shown in Fig. 4.

Fig. 4

Up-sampling and down-sampling adjustment details.

3.4 Multi-scale convolutional attention (MSCA) module

After adding the MCU Block, we observed a reduction in the overall model complexity. However, we also identified a shortfall in effectively capturing local information during the image segmentation process. To address this limitation, we opted to integrate an attention mechanism to enhance the model’s capabilities. The Multi-scale Convolutional Attention (MSCA) [21] module emerged as a novel form of attention mechanism, harnessing an economical convolutional attention approach. Compared to previous attention mechanisms, MSCA uses deep convolution that aggregates local information, which can lead to a significant reduction in the computational cost of the attention network. MSCA introduces an innovative multi-branch depth-banded convolution scheme for capturing multi-scale context. This approach incorporates three distinct convolution kernel sizes: 7, 11, and 21. It further restructures conventional convolution into a pair of banded convolutions, where a single pair of 7×1 and 1×7 convolutions effectively replace a standard 7×7 convolution [39]. This banded convolution design achieves dual objectives: It enables the extraction of semantic information with minimal computational cost and parameters while complementing mesh convolution to facilitate the extraction of banded features, such as blood vessels and muscle cells, in the medical domain. Simultaneously, the utilization of three convolutions with varying scales empowers MSCA to capture information across multiple scales, enriching the semantic information accessible to MCU-Net. This enhancement encompasses fine-grained features that were previously less perceptible to the network.

In this study, our objective was to enhance MCU-Net’s capacity for capturing intricate semantic information within medical image segmentation. To achieve this, the convolution kernel dimensions were adjusted from 7, 11, 21 to 5, 7, 11. Additionally, a 1×1 convolution was incorporated preceding the MSCA module. This addition of a 1×1 convolution corresponds to a fully connected computational process, augmenting the network’s depth without expanding the receptive field. Subsequently, a GELU activation function was integrated to heighten the nonlinearity of the network. This augmentation enables the network to encapsulate more intricate features and ultimately improves the overall performance of the final model. From a mathematical perspective, the MSCA can be represented as Equation (1): $\begin{matrix} F = GELU (Conv (x)), \\ Att = {Conv}_{1 \times 1} (\sum_{i = 0}^{3} Scale, (DWConv (F))), \\ Out = {Conv}_{1 \times 1} (Att \otimes F) . \end{matrix}$ (1) here x denotes the input feature, Att and Out are the attention graph and output respectively, and ⊗ is the element-by-element matrix multiplication operation. DWConv denotes deep convolution, Scale_i, i ∈ {0, 1, 2, 3} denotes the i branch in Fig. 5, and is identity connection. The structure of the MSCA attention module is shown in Fig. 5.

Fig. 5

Multi-scale convolutional attention module (MSCA).

4 Experiments

4.1 Dataset

We utilized two datasets for our experiments: the Gland Segmentation dataset (GlaS) [40] and the Breast Ultrasound Images dataset (BUSI) [41]. Figure 6 shows the data set used in our experiments.

Fig. 6

The experimental data set is shown, where the first row is the Gland segmentation dataset (the green outline is the segmentation area) and the second row is Breast Ultrasound Images (the red outline is the segmentation area).

Gland segmentation dataset (GlaS): The GlaS dataset comprises a total of 165 H&E-stained tissue sections of stage T3 or T42 colorectal adenocarcinoma. It includes 85 images for training and 80 images for testing.

Breast UltraSound Images (BUSI): The BUSI dataset comprises ultrasound images representing various breast cancer cases, including normal, benign, and malignant cases, each accompanied by corresponding segmentation maps. For this thesis, our focus narrows to benign and malignant breast cancer images, amounting to a total of 647 images utilized for both training and testing. To facilitate our experiments, we have amalgamated these two image categories and distributed them into training and testing sets, maintaining an 8 : 2 ratio.

Furthermore, medical segmentation frequently grapples with limited data availability. In this paper, we address this challenge by augmenting the image data through various techniques, including random horizontal and vertical flipping, center cropping, random cropping, and random aspect ratio cropping. These augmentation strategies serve to bolster the model’s generalization capacity and enhance its overall robustness.

4.2 Evaluation indicators

To assess the performance of various methods, we employ the Dice coefficient (Dice) and Mean Intersection over Union (MIoU) as our evaluation metrics.

The Dice coefficient, which is widely used in medical image classification tasks, measures the similarity between two samples and ranges between 0 and 1. A higher Dice coefficient value indicates a greater similarity between the predicted and true labels. Mathematically, the Dice coefficient is defined as follows: $Dice = \frac{2 TP}{2 TP + FN + FP}$ (2)

Furthermore, the mIoU metric calculates the average intersection ratio of the true and predicted values across all samples. It provides a comprehensive evaluation of the overlap between the predicted and true segmentation. The mIoU can be calculated using the following formula: $mIoU = \frac{1}{k + 1} \sum_{i = 0}^{k} \frac{TP}{FN + FP + TP}$ (3)

Here, TP represents the predicted medical image segmentation region correctly identified as the true medical image segmentation region, TN denotes the predicted background region correctly identified as the true background region, FP signifies the predicted medical image segmentation region falsely identified as the true medical image segmentation region, and FN indicates the predicted background region falsely identified as the true background region.

In the evaluation of deep learning models, two critical metrics are the number of parameters (Params) and the floating point operations per second (FLOPs). The parameter count represents the total number of trainable parameters in the model, providing insight into the model’s computational and spatial complexity. On the other hand, FLOPs quantify the computational time complexity, reflecting the number of floating point calculations the model can perform in a unit of time. By comprehensively considering these two metrics, a more comprehensive assessment of a deep learning model’s complexity can be achieved, facilitating guidance for model optimization efforts.

4.3 Experiment details

We implemented the MCU-Net model using PyTorch on a single NVIDIA RTX 3090 GPU card with 24 GB of RAM. During training, we did not utilize any initial training weights for MCU-Net. For the GlaS dataset, we set the batch size to 4, while for the BUSI dataset, it was set to 8. The input size for both datasets was uniformly defined as 480×480, with a weight decay of 5e-5. To train the MCU-Net model, we employed the Adam W optimizer [42] with an initial learning rate of 0.0015. With BCE and dice loss, the loss function $L$ between the predicted output $\hat{y}$ and the target y is defined as follows: $L = 0.5 BCE (\hat{y}, y) + Dice (\hat{y}, y)$ (4)

It is important to note that the same training settings and loss functions were utilized for training all baseline models, ensuring a fair and consistent evaluation across all methods.

5 Results

5.1 Comparison with state-of-the-art

To validate the overall segmentation performance of MCU-Net proposed in this paper, we conducted comparative analyses with other state-of-the-art methods on the GlaS and BUSI datasets. The considered methods encompass UNet, UNet++, AttUNet, ResUNet, and ConvUNeXt. Tables 5 and 6 present contrasting results across these algorithms in terms of Dice and mIoU metrics after testing the GlaS and BUSI datasets, the evaluation metrics for model complexity (Params and FLOPs), as well as the time taken for training over 300 epochs.

Table 5
Metrics of Different Methods on GlaS Dataset

Method Params (M) FLOPs (G) Dice (%) mIoU (%) 300epoch (min)

UNet 4.32 30.28 90.8 83.7 20

UNet++ 9.16 94.6 90.8 83.8 44

AttUNet 8.72 101.9 91 84 40

ResUNet 3.68 28.92 90.4 83.2 23

ConvUNeXt 3.5 24.68 91.4 84.5 32

MCU-Net 2.19 19.73 91.8 84.7 19

Method	Params (M)	FLOPs (G)	Dice (%)	mIoU (%)	300epoch (min)
UNet	4.32	30.28	90.8	83.7	20
UNet++	9.16	94.6	90.8	83.8	44
AttUNet	8.72	101.9	91	84	40
ResUNet	3.68	28.92	90.4	83.2	23
ConvUNeXt	3.5	24.68	91.4	84.5	32
MCU-Net	2.19	19.73	91.8	84.7	19

Table 6

Metrics of different methods on BUSI dataset

Method	Dice (%)	mIoU (%)	300epoch (min)
UNet	67.4	65.7	103
UNet++	67.9	66.7	208
AttUNet	68.5	67.3	185
ResUNet	67.6	66.3	110
ConvUNeXt	68.9	68.3	136
MCU-Net	69.4	68.6	100

In terms of model performance, our method has the best metric scores in Dice and mIoU. Specifically, in the GlaS dataset, Dice is 1 over UNet and 0.4 over ConvUNeXt, and mIoU is 1 over U-Net and 0.2 over ConvUNeXt. According to the BUSI dataset, Dice is 2 above U- Net and 0.5 above UNeXt, while mIoU is 2.9 above U-Net and 0.3 above ConvUNeXt. The performance evaluation metrics of UNet++ and AttUNet perform consistently in all datasets; while ResUNet performs slightly worse than the other models.

Furthermore, our approach not only excels in segmentation performance but also demonstrates leadership in terms of model complexity. MCU-Net’s parameters in Params and FLOPs are 2.19M and 19.73Grespectively. This accomplishment primarily stems from our strategic utilization of DW convolution throughout MCU-Net, encompassing both the MCU Block and MSCA. This judicious integration effectively reduces the computational complexity and parameter count of the model. In comparison, the nearest contender, ConvUNeXt, incorporates an attention mechanism that has not been fine-tuned to optimize Params and FLOPs, thereby positioning us ahead in terms of model complexity. Figure 7 illustrates the correlation between mIoU and Params, as well as mIoU and FLOPs. The mIoU used here corresponds to the GlaS dataset. It is clear from the graph that MCU-Net has achieved outstanding results in terms of model complexity.

Fig. 7

Comparison plot. The X-axis corresponds to the mIoU (higher is better). The Y-axis corresponds to the number of parameters, and GFLOPs (lower is better), respectively.

Finally, in terms of training time, UNet and ResUNet only took 20 minutes and 23 minutes on the GlaS dataset, and 103 minutes and 110 minutes on the BUSI dataset. This is mainly because UNet and ResUNet do not utilize attention mechanisms, while MCU-Net incorporates the MSCA attention mechanism. Benefiting from its fully convolutional design, MCU-Net achieved training times of 19 minutes on GlaS and 100 minutes on BUSI. On the other hand, both AttUNet and ConvUNeXt employ attention mechanisms, yet their training times surpass that of MCU-Net. This observation underscores the fact that the MSCA attention mechanism introduces lower model complexity. Moreover, due to the intricate nature of its skip connections, UNet++ exhibited a training time of up to 208 minutes on the BUSI dataset.

5.2 Visualization study

In Figs. 8 and 9, we showcase the visual segmentation results obtained from various methods on the GlaS and BUSI datasets. In Fig. 8, the H&E-stained tissues of colorectal adenocarcinoma present complex semantic information, exhibiting low contrast with the surrounding tissues, unclear boundaries, and the presence of numerous cluttered regions that require simultaneous segmentation. In certain scenarios, other networks display varying degrees of under-segmentation and a lack of sensitivity to the recognition of local semantic information. Notably, in the red box of Fig. 8, UNet and ConvUNeXt struggle to effectively extract local semantic information in this region. Conversely, MCU-Net demonstrates heightened segmentation accuracy, with its image segmentation results closely approximating the Ground Truth. In Fig. 9, the visualization of the BUSI dataset underscores disparities among the segmentation outcomes produced by different methods for cancer tissue section images. Cancerous and normal tissues share morphological and color similarities, thereby amplifying the challenges associated with accurate segmentation. However, MCU-Net’s focus on capturing local information enables precise delineation of lesion boundaries, yielding superior results compared to other algorithms.

Fig. 8

Visual detail results on the GlaS dataset.

Fig. 9

Visual detail results on the BUSI dataset.

5.3 Ablation study

To verify the effectiveness of MCU Block, Improved upsampling and downsampling, and Improved MSCA in the MCU-Net proposed in this paper, we design several different sets of network models, and the results of each set of ablation experiments are obtained by comparing them on the GlaS dataset, and the results of the ablation experiments are shown in Table 7.

Table 7
Presentation of ablation study results

Number Method Params (M) FLOPs (G) Dice (%) mIoU (%) 300epoch (min)

1 UNet 4.32 30.28 90.8 83.7 20.3

2 Unet + Improved upsampling and downsampling 3.68 28.93 89.7 83.1 20.9

3 UNet + Improved upsampling and downsampling + ConvNeXt block 1.99 15.96 89.2 81.3 17.8

4 UNet + Improved upsampling and downsampling + MCU block 1.99 15.98 90.8 83.7 18.5

5 UNet + Improved upsampling and downsampling + MCU block + MSCA 2.21 20.39 91.1 83.8 19.7

6 MCU-Net (UNet + Improved upsampling and downsampling + MCU block + Improved MSCA) 2.19 19.73 91.8 84.7 19.2

Number	Method	Params (M)	FLOPs (G)	Dice (%)	mIoU (%)	300epoch (min)
1	UNet	4.32	30.28	90.8	83.7	20.3
2	Unet + Improved upsampling and downsampling	3.68	28.93	89.7	83.1	20.9
3	UNet + Improved upsampling and downsampling + ConvNeXt block	1.99	15.96	89.2	81.3	17.8
4	UNet + Improved upsampling and downsampling + MCU block	1.99	15.98	90.8	83.7	18.5
5	UNet + Improved upsampling and downsampling + MCU block + MSCA	2.21	20.39	91.1	83.8	19.7
6	MCU-Net (UNet + Improved upsampling and downsampling + MCU block + Improved MSCA)	2.19	19.73	91.8	84.7	19.2

Ablation study for an improved upsampl ing and downsampling: Our proposed method is based on UNet architecture, making the UNet in Experiment 1 the baseline model. In Experiment 2, we integrated the improved upsampling and downsampling layers into UNet. This resulted in a decrease in both model complexity and accuracy. However, the training process of this modified UNet became more stable. In Fig. 10, we present the loss decay curves for Experiment 1 and Experiment 2 during training. It’s evident from the graph that the UNet with the improved upsampling and downsampling layers exhibited a lower overall loss function and more stable fluctuations.

Fig. 10

Loss fluctuation graphs for Experiment 1 and Experiment 2.

Ablation study for MCU Block: In Experiment 3, the encoder-decoder part of UNet was entirely replaced with the ConvNeXt Block to reduce the number of parameters and complexity. When compared to the original UNet, the model parameters (Params) decreased from 4.32M to 1.9 M, FLOPs reduced from 30.28 G to 15.96 G, and the training time shortened to 17.8 minutes. However, there was a trade-off as the model’s performance indicators, Dice and mIoU, decreased by 1.6% and 2.4% respectively. This trade-off was a clear instance of sacrificing model performance in favor of model lightweights. In Experiment 4, when the model was transitioned to the MCU Block, the Params and FLOPs remained unchanged. Surprisingly, the Dice and mIoU scores remained consistent with the original UNet’s performance. This observation confirms that the UNet with the MCU Block not only maintains the segmentation accuracy of the original UNet but also achieves lower model complexity.

Ablation study for Improved MSCA: Furthermore, in Experiment 5, the unimproved MSCA was incorporated into the network, resulting in Params and FLOPs of 2.21 M and 20.39 G respectively. The training time also increased to 19.7 minutes. However, there was no significant change in segmentation accuracy. This led us to believe that the larger convolutional kernels in the original MSCA were not effectively capturing semantic information from medical images. To address this issue, adjustments were made to the convolutional kernels within the MSCA. By introducing a 1×1 convolutional layer and a GELU activation function, the modified MSCA achieved remarkable results. The network’s Dice and mIoU scores reached 91.8% and 84.7% respectively, with a decrease in model complexity. This encouraging outcome suggests that the improved MSCA is more sensitive to medical image segmentation tasks, leading to a significant enhancement in segmentation accuracy.

6 Discussion

This study introduces an efficient convolutional neural network named MCU-Net, which is based on ConvNeXt to improve UNet. MCU-Net employs DW convolution and large convolutional kernels in both the encoder-decoder structure and the attention mechanism, aiming to reduce the overall model complexity. To address the potential reduction in segmentation accuracy due to decreased model complexity, we modified the original ConvNeXt Block. In doing so, we incorporated a GRN normalization layer and the GELU activation function to maintain segmentation precision during the task. Simultaneously, in the process of upsampling and downsampling, we replaced pooling layers with convolutional layers and integrated BN layers. This approach mitigates the network’s sensitivity to significant parameter changes during backpropagation, ensuring the stability of parameter updates throughout training. Additionally, we introduced the Multiscale Convolutional Attention (MSCA) module, which captures local features across multiple scales, thereby enhancing the overall segmentation accuracy of the network.

MCU-Net was validated on two publicly available datasets, including GlaS and BUSI. Comparative experiments showed that the algorithm outperformed existing state-of-the-art algorithms in terms of the model performance metrics Dice and mIoU, the model complexity metrics Params and FLOPs, as well as the training time, demonstrating that MCU-Net not only possesses excellent segmentation accuracy, but also that it has a lower model complexity. Notably, our ablation experiments revealed that the addition of MCU Block, the MSCA attention mechanism, and the modified up-and-down sampling module had a positive impact on the model’s performance.

Indeed, our approach has its limitations. In comparison to the classical UNet architecture, our method hasn’t significantly reduced training time. This can be attributed to the addition of the GRN normalization layer in the MCU Block to maintain segmentation accuracy. If a more concise convolutional module with high performance could be introduced, the lightweight nature of the network would be further enhanced. Furthermore, when compared to traditional desktops or servers, mobile devices operate within diverse operating systems and software environments. Adaptations and optimizations are necessary to ensure seamless execution of the model on mobile devices. Therefore, a crucial direction for future work is the ongoing refinement and enhancement of model performance, ensuring that segmentation algorithms can be successfully deployed and executed smoothly on embedded devices.

7 Conclusion

In instant POC applications, large image segmentation networks are not efficiently deployed to portable devices with limited clinical computing resources. To address this challenge, based on a conventional convolution neural network and attention mechanism, we propose MCU-Net, a deep learning network with simplicity, low parameter count, and low computational complexity. The MCU Block has been re-engineered, featuring the incorporation of a GRN normalization layer and the integration of a GELU activation function, building upon the foundation of the ConvNeXt Block. The MCU Block yields a substantial reduction in model parameters while upholding consistent segmentation performance. Within the upsampling layers, MCU-Net substitutes convolutions for pooling and integrates BN layers to enhance training stability. Additionally, we introduce the Multiscale Convolutional Attention module (MSCA) after the skip connections, further refining the MSCA’s structure and convolutional channels. The enhanced MSCA captures more local features across multiple scales, resulting in superior performance. Comparative experiments and visualization studies on the GlaS and BUSI datasets can demonstrate that we achieve more competitive results in terms of the balance between parameters, computational complexity, and performance. In the future, we aim to further optimize our model’s performance, reducing training time and parameters. We also plan to embark on the deployment of the model on embedded devices.

Footnotes

Acknowledgments

The research was supported by the CAMS Innovation Fund for Medical Sciences (CIFMS) (2022-I2M-C&T-B-035). National High Level Hospital Clinical Research Funding, 2022-PUMCH-A-121.

References

Smallwood

and Dachsel

, Point-of-care ultrasound (POCUS): unnecessary gadgetry or evidence-based medicine? Clinical Medicine 18(3) (2018), 219–224.

Zanobetti

, Scorpiniti

, Gigli

, Nazerian

, Vanni

and Innocenti

, Point-of-Care Ultrasonography for Evaluation of Acute Dyspnea in the ED, Chest 151(6) (2017), 1295–1301.

Gaspari

, Weekes

, Adhikari

, Noble

V.E.

, Nomura

J.T.

, Theodoro

and Woo

, Emergency department point-of-care ultrasound in out-of-hospital and in-ED cardiac arrest, Resuscitation 109 (2016), 33–39.

Duggan

N.M.

, Jowkar

, Ma

I.W.Y.

, Schulwolf

, Selame

L.A.

, Fischetti

C.E.

, Kapur

and Goldsmith

A.J.

, Novice-performed point-of-care ultrasound for home-based imaging, Scientific Reports 12(1) (2022), 20461.

Kuoy

, Glavis-Bloom

, Hovis

, Yep

, Biswas

, Masudathaya

and Norrick

L.A.

, Point-of-Care Brain MRI: Preliminary Results from a Single-Center Retrospective Study, Radiology 305(3) (2022), 666–671.

Lee

and DeCara

J.M.

, Point-of-Care Ultrasound, Current Cardiology Reports 22(11) (2020), 149.

Ronneberger

, Fischer

, Brox

Brox, U-Net: Convolutional Networks for Biomedical Image Segmentatio, In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015:18th International Conference, Munich, Germany, October 5–9, 2015, Proceedings, Part III 18, pp. 234–241. Springer International Publishing, (2015).

Zhou

, Siddiquee

M.M.R.

, Tajbakhsh

and Liang

, UNet++: A Nested U-Net Architecture for Medical Image Segmentation, IEEE Transactions on Medical Imaging 39(6) (2019), 1856–1867.

Oktay

, Schlemper

, Folgoc

L.L.

, Lee

, Heinrich

, Misawa

Attention U-Net: Learning Where to Look for the Pancreas, arXiv preprint arXiv:1804.03999, (2018).

10.

Milletari

, Navab

, Ahmadi

S.A.

V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation, In 2016 fourth international conference on 3D vision (3DV), Ieee, (2016).

11.

Huang

, Lin

, Tong

, Hu

, Zhang

, Iwamoto

UNet 3+: A Full-Scale Connected UNet for Medical Image Segmentation, ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), ICASSP 2020 –2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain: IEEE, (2020).

12.

Khanna

, Londhe

N.D.

, Gupta

and Semwal

, A deep Residual U-Net convolutional neural network for automated lung segmentation in computed tomography images, Biocybernetics and Biomedical Engineering 40(3) (2020), 1314–1327.

13.

Dosovitskiy

, Beyer

, Kolesnikov

, Weissenborn

, Zhai

, Unterthiner

AnImage isWorth 16x16Words: Transformers for Image Recognition at Scale 2021, arXiv preprint arXiv:2010.11929, (2020).

14.

Chen

, Lu

, Yu

, Luo

, Adeli

, Wang

TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation, arXiv preprint arXiv:2102.04306, (2021).

15.

Wang

, Chen

, Ding

, Li

, Yu

, Zha

TransBTS: Multimodal Brain Tumor Segmentation Using Transformer, In Medical Image Computing and Computer Assisted Intervention–MICCAI 2021:24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part I 24, pp. 109–119. Springer International Publishing, (2021).

16.

Valanarasu

J.M.J.

, Oza

, Hacihaliloglu

, Patel

V.M.

Medical Transformer: Gated Axial-Attention for Medical Image Segmentation, In Medical Image Computing and Computer Assisted Intervention–MICCAI 2021:24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part I 24, pp. 36–46. Springer International Publishing, (2021).

17.

Cao

, Wang

, Chen

, Jiang

, Zhang

, Tian

Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation, In European conference on computer vision, pp. 205–218. Cham: Springer Nature Switzerland, (2022).

18.

Han

, Wang

, Chen

, Guo

and Liu

, A Survey on Visual Transformer, IEEE Transactions on Pattern Analysis and Machine Intelligence 45(1) (2023), 87–110.

19.

Liu

, Mao

, Wu

C.Y.

, Feichtenhofer

, Darrell

, Xie

A ConvNet for the 2020s, In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11976–11986, (2022).

20.

Howard

A.G.

, Zhu

, Chen

, Kalenichenko

, Wang

, Weyand

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications, arXiv preprint arXiv:1704.04861, (2017).

21.

Guo

, Lu

, Hou

, Liu

, Cheng

and Hu

, Segnext: Rethinking convolutional attention design for semantic segmentation, Advances in Neural Information Processing Systems 35 (2022), 1140–1156.

22.

Ioffe

, Szegedy

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, In International conference on machine learning, pp. 448–456. pmlr, (2015).

23.

Long

, Shelhamer

, Darrell

Fully Convolutional Networks for Semantic Segmentation, In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440, (2015).

24.

Jha

, Riegler

M.A.

, Johansen

, Halvorsen

, Johansen

H.D.

DoubleU-Net:ADeep Convolutional Neural Network for Medical Image Segmentation 2020, In 2020 IEEE 33rd International symposium on computer-based medical systems (CBMS), pp. 558–564. IEEE, (2020).

25.

, Shen

, Sun

Squeeze-and-Excitation Networks, In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141, (2018).

26.

Woo

, Park

, Lee

J.Y.

, Kweon

I.S.

CBAM: Convolutional Block Attention Module, In Proceedings of the European conference on computer vision (ECCV), pp. 3–19, (2018).

27.

Huang

, Wang

, Huang

, Wei

, Liu

CCNet: Criss-Cross Attention for Semantic Segmentation, In Proceedings of the IEEE/CVF international conference on computer vision, pp. 603–612, (2019).

28.

Wang

, Wu

, Zhu

, Li

, Zuo

, Hu

ECANet: Efficient Channel Attention for Deep Convolutional Neural Networks, In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11534–11542< (2020).

29.

Zhang

, Yang

Sa-net: Shuffle attention for deep convolutional neural networks, In ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2235–2239. IEEE, (2021).

30.

Zhang

, Zhong

Xi.

, Li

, Liu

, Ji

, Li

and Wu

, BCU-Net: Bridging ConvNeXt and U-Net for medical image segmentation, Computers in Biology and Medicine 159 (2023), 106960.

31.

Peng

, Luan

, Zhang

Segmentation of fundus vascular images based on a dual-attention mechanism, arXiv preprint arXiv:2305.03617, (2023).

32.

Han

, Jian

and Wang

G.G.

, ConvUNeXt: An efficient convolution neural network for medical image segmentation, Knowledge-Based Systems 253 (2022), 109512.

33.

Valanarasu

, Patel

V.M.

Unext: Mlp-based rapid medical image segmentation network, In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 23–33. Cham: Springer Nature Switzerland, (2022).

34.

, Zhang

, Ren

, Sun

Deep residual learning for image recognition, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778, (2016).

35.

Ding

, Zhang

, Han

and Ding

, Scaling up your kernels to 31x31: Revisiting large kernel design in cnns, In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11963–11975, (2022).

36.

Hendrycks

, Gimpel

Gaussian error linear units (gelus), arXiv preprint arXiv:1606.08415, (2016).

37.

Glorot

, Bordes

, Bengio

Deep sparse rectifier neural networks, In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 315–323, JMLR Workshop and Conference Proceedings, (2011).

38.

Woo

, Debnath

, Hu

, Chen

, Liu

, Kweon

I.S.

, Xie

Convnext v2: Co-designing and scaling convents with masked autoencoders, In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16133–16142, (2023).

39.

Guo

, Lu

, Liu

, Cheng

, Hu

Visual attention network, Computational Visual Media (2023), 1–20.

40.

Sirinukunwattana

, Pluim

J.P.W.

, Chen

, Qi

, Heng

, Guo

and Wang

, Gland segmentation in colon histology images: The glas challenge contest, Medical Image Analysis 35 (2017), 489–502.

41.

Al-Dhabyani

, Gomaa

, Khaled

and Fahmy

, Dataset of breast ultrasound images, {Data in brief 28 (2020), 104863.

42.

Loshchilov

, Hutter

Decoupled weight decay regularization, arXiv preprint arXiv:1711.05101, (2017).

A Lightweight convolutional medical segmentation algorithm based on ConvNeXt to improve UNet

Abstract

Keywords

1 Introduction

2.1 Previous UNet-based approach

3.1 MCU-net overall network architecture

4.1 Dataset

5.1 Comparison with state-of-the-art

Table 5 Metrics of Different Methods on GlaS Dataset Method Params (M) FLOPs (G) Dice (%) mIoU (%) 300epoch (min) UNet 4.32 30.28 90.8 83.7 20 UNet++ 9.16 94.6 90.8 83.8 44 AttUNet 8.72 101.9 91 84 40 ResUNet 3.68 28.92 90.4 83.2 23 ConvUNeXt 3.5 24.68 91.4 84.5 32 MCU-Net 2.19 19.73 91.8 84.7 19

7 Conclusion

Footnotes

Acknowledgments

References