Abstract
Accurate X-ray Computed tomography (CT) image segmentation of the abdominal organs is fundamental for diagnosing abdominal diseases, planning cancer treatment, and formulating radiotherapy strategies. However, the existing deep learning based models for three-dimensional (3D) CT image abdominal multi-organ segmentation face challenges, including complex organ distribution, scarcity of labeled data, and diversity of organ structures, leading to difficulties in model training and convergence and low segmentation accuracy. To address these issues, a novel multi-stage training and a deep supervision model based segmentation approach is proposed. It primary integrates multi-stage training, pseudo- labeling technique, and a developed deep supervision model with attention mechanism (DLAU-Net), specifically designed for 3D abdominal multi-organ segmentation. The DLAU-Net enhances segmentation performance and model adaptability through an improved network architecture. The multi-stage training strategy accelerates model convergence and enhances generalizability, effectively addressing the diversity of abdominal organ structures. The introduction of pseudo-labeling training alleviates the bottleneck of labeled data scarcity and further improves the model's generalization performance and training efficiency. Experiments were conducted on a large dataset provided by the FLARE 2023 Challenge. Comprehensive ablation studies and comparative experiments were conducted to validate the effectiveness of the proposed method. Our method achieves an average organ accuracy (AVG) of 90.5% and a Dice Similarity Coefficient (DSC) of 89.05% and exhibits exceptional performance in terms of training speed and handling data diversity, particularly in the segmentation tasks of critical abdominal organs such as the liver, spleen, and kidneys, significantly outperforming existing comparative methods.
Keywords
Introduction
The distribution of human abdominal organs is complex, with the main organs and structures within the abdominal cavity, including liver, gallbladder, stomach, spleen, inferior vena cava, and aorta. The normal liver contour is smooth, and its shape varies with respect to the cross-sectional position. The porta hepatis region typically contains a significant amount of adipose tissue, presenting an irregular shape or resembling a low- density polygon. This area includes the hepatic artery, portal vein, and bile duct structures. The portal vein, located posteriorly, is relatively large, while the hepatic artery is situated anterior to it, and the bile duct (mainly the common hepatic duct) is located anteriorly and laterally. Post-contrast, the portal vein and aorta are easily identified, appearing as round high-density shadows positioned anterior to the inferior vena cava. Abdominal X-ray CT images often exhibit artifacts and noise due to the limitations of imaging equipment and organ peristalsis. Additionally, some organ tissues may appear partially blurred, and the edges of lesions can be unclear, presenting considerable challenges for organ segmentation. It is fundamental to accurately segment abdominal organs from CT images for diagnosing abdominal diseases, planning cancer treatment, and formulating radiotherapy strategies.
Early multi-organ segmentation tasks relied on the region-growing segmentation algorithms, which are semi-automated methods and have achieved promising performance based on the correct positioning of seed points. With the exceptional performance of convolutional neural networks (CNNs) in image processing and segmentation, they have gradually been applied to medical image segmentation. A classic example is the U-Net architecture proposed by Ronneberger et al. in 2015. 1 It incorporates skip connections between upsampling and downsampling layers, enabling the direct transfer of features extracted by downsampling layers to upsampling layers, thereby enhancing pixel-level localization and segmentation performance. However, the U-Net architecture struggles to capture inter-slice correlations, leading to the risk of under-segmentation. To address these limitations, research into the U-Net structure has deepened, resulting in the development of various encoder-decoder architectures, such as U-Net++ and U-Net 3 + . These new network structures continue to refine the U-Net framework, yet with the advancements in deep learning, alternative high-accuracy network models like Vision Transformers (ViT) have emerged. To achieve comparable accuracy with these models, it is necessary to introduce new modules and innovations. A 3D deeply supervised network (3D- DSN) was developed to address liver segmentation issues, demonstrating commendable accuracy. The development of segmentation tasks is also evolving, transitioning from general segmentation research to specialized model studies focused on specific areas such as the brain, chest, abdomen, and pelvis. In recent years, as the pace of development in multi-organ segmentation for brain and chest regions has gradually slowed, an increasing number of researchers have begun to concentrate on abdominal multi-organ tasks. Despite the emergence of excellent new network structures such as 3D-DSN, studies by Geert et al. 2 indicate that segmentation tasks typically feature prolonged convergence times and excessive iteration counts. Moreover, to ensure accurate convergence of the models, a substantial amount of labeled data is required to enhance model performance. However, in practical applications, labeled datasets are often severely limited. 3 Based on the aforementioned considerations, segmentation tasks commonly face two issues: first, the convergence of the model requires numerous iterations and is time-consuming; second, the labeled datasets are often insufficient to achieve optimal convergence results in practical tasks.
Elnaz et al. noted in 2015 4 that multi-stage training can produce regularization effects and enhance generalization capabilities, thereby accelerating model convergence. In a subsequent study by Yuan et al. in 2018, 5 an explanation and justification were provided for the question “why does staged training accelerate the reduction of test error on SGD?” In recent years, increasing research has focused on integrating multi-stage training into multi-organ segmentation tasks. The results of these studies consistently demonstrated that multi-stage training significantly accelerates model convergence. The method of training neural networks using pseudo-labels in a semi-supervised manner (self-training with pseudo-labels) was first proposed by Lee et al. 6 This approach leverages high-confidence pseudo-labeling samples to enhance the model's fitting ability and improve robustness, particularly in scenarios with limited source domain datasets. Subsequent advancements include entropy minimization, 7 proxy labeling, 6 noise contrastive estimation, 8 meta-pseudo-labeling, 9 multi-view training,10,11 collaborative training, 12 and tri-training. 13 These methods represent successful attempts in the field of pseudo-labeling generation. Recently, the medical image segmentation field has faced challenges due to variations in the same organ's condition across different individuals and ages. This variability poses difficulties regarding the sufficiency and diversity of labeled data, while pseudo-labeling training offers an effective solution to address these challenges.
By reviewing the history and current status of abdominal organ segmentation, this paper traces the evolution from early region-growing algorithms to U-Net and its variants, and onto the latest deep learning models such as 3D-DSN and Vision Transformers. It also highlights the developmental trends in this field and the existing challenges, including long training times, reliance on extensive labeled data, and the diversity of organ structures. As a result, We propose an innovative method that combines multi-stage training with pseudo-labeling techniques and a novel network model (DLAU-Net), specifically designed for 3D multi-organ segmentation in the abdomen. We employ multi-stage training to conduct model training at different granularities, effectively addressing the limitations of the traditional models in feature learning. Additionally, through pseudo- labeling techniques, the model can achieve high-accuracy segmentation even with a limited number of labeled data. The key contributions of this study are summarized as the follows:
The DLAU-Net network model is proposed to significantly enhance the performance of downstream segmentation tasks and advance the accuracy, efficiency, and adaptability of the algorithm. The multi-stage training techniques are applied to accelerates model convergence, improve generalization capabilities, and aid in addressing segmentation challenges posed by the diversity of organ structures. The pseudo-labeling techniques are used to effectively resolve the issue of scarce labeled data, enhance the model's generalization ability, and improve training efficiency.
Related work
Abdominal multi-organ segmentation tasks typically focus on few larger organs, such as the esophagus and aorta, as well as the kidneys, liver, spleen, left kidney, right kidney, and pancreas.14,15 Models that perform exceptionally well in these tasks are often based on U-Net and its variants. Although the U-Net is classic, it is difficult to effectively capture inter-slice correlations and is prone to under-segmentation. To address these limitations, various U-Net variants have been developed, including U-Net++, 16 V-Net, 17 nnU-Net, 18 ResU-Net, 19 and DenseU-Net. 20 In the work “Deep Layer Aggregation”, 21 Yu et al. proposed modifications to U-Net's long skip connections and the insertion of new convolutional layers and short skip connections within the encoder and decoder. These modifications enable the model to learn features at different depths, thus avoiding the manual process of determining the optimal model depth. The U-Net++ combines both short and long skip connections to improve feature acquisition and mitigate information loss. The nnU-Net adapts dynamically to various datasets, automatically tuning hyperparameters, thereby enhancing model generalizability and reducing training time and human efforts. The V-Net, which employs a 3D convolutional network for prostate volume segmentation, also introduced the Dice loss function to address the imbalance between foreground and background voxels.
However, with the increasing clinical demands and the growing complexity of segmentation tasks, it proves insufficient to solely rely on innovations in model architecture to meet the requirements for high-precision segmentation, particularly in the context of multi-organ segmentation in the abdomen with issues of data scarcity and annotation costs. Consequently, a series of efficient training strategies have been introduced. Deep supervision, also known as intermediate supervision, involves adding an auxiliary classifier as a branch at certain intermediate hidden layers of a deep neural network to supervise the main network. This technique, 22 introduced by Lee et al. in 2014, helps to ensure thorough training of the model and addresses issues related to vanishing gradients and slow convergence in deep neural networks. However, early applications of deep supervision showed suboptimal performance when the network structure was not sufficiently deep or when the auxiliary classifier was a traditional SVM model, particularly in very deep networks. In 2015, Wang et al. proposed an approach 23 to apply deep supervision in deeper network structures. This approach not only retains the feature representation capability of the main network, enabling more diverse expressions, but also promotes multi-level interactions between different branches through an optimization formula with a probabilistic prediction matching loss. This ensures a more robust optimization process and improved expressive capability. In recent years, studies such as those by Zhu et al. 24 have integrated U-Net with Deep Supervision for medical image segmentation tasks, achieving notable results. In abdominal multi-organ segmentation tasks, due to the high cost of data annotation and data scarcity, pseudo-labeling techniques are widely used to augment training datasets. Pseudo-labeling techniques involves using the model's predictions as “pseudo” labels to include unlabeled data in the training process. Research has highlighted a method using pseudo-labeling techniques and noise injection in medical image translation tasks to enhance model generalizability and accuracy. Zhang et al. 25 introduced a pseudo-labeling-based semi-supervised learning method for medical image segmentation, demonstrating the effectiveness of pseudo-labeling techniques in improving model performance through experiments. Arazo et al. 26 explored the application of pseudo-labeling techniques in deep semi-supervised learning, discussing the potential biases introduced by pseudo-labeling techniques and proposing the corresponding solutions. These studies indicate that pseudo-labeling techniques can aid in expanding training datasets and enhancing model performance in abdominal multi-organ image segmentation tasks. In the context of abdominal multi-organ image segmentation, multi-stage training is an effective strategy for handling complex tasks, large-scale data, or multiple sub-tasks. Tao et al. proposed a multi-scale hierarchical training method 27 to address the challenges of high resolution and multi-organ complexity in medical image segmentation tasks. Through staged training, this method effectively improves segmentation accuracy and robustness. Research indicates that multi-stage training can help models better manage complexity and diversity in abdominal multi-organ image segmentation tasks, leading to improved segmentation outcomes.
Inspired by the aforementioned methods, we propose an enhanced multi-stage training approach that integrates variants of U-Net, deep supervision, and pseudo-labeling techniques to address challenges such as data scarcity and difficulties in model training, ultimately enhancing model performance and generalizability in abdominal multi-organ image segmentation tasks.
Methods
To achieve better overall segmentation performance of multiple abdominal organs of different sizes, we developed a lightweight VGG-13 network 28 as the encoder. This was implemented on the U-Net architecture by incorporating attention mechanisms 29 and deep supervision.30,31 The multi-stage training and pseudo-labeling techniques were introduced to enhance model performance. The multi-stage training method involves initially conducting coarse training using a large dataset, followed by fine-tuning with fully labeled data from the same dataset. The pseudo-labeling method entails generating the first batch of pseudo-labeling data from model predictions using the optimal weights during fine-tuning. The subsequent pseudo-labels are derived from the pseudo- labels of previous generation. The optimal model obtained during the pseudo-labeling training process will be used as the final model for the entire training process. 32
Proposed methodology
X-ray CT images are first preprocessed. It involves intensity normalization by clipping each image to intensity values within [1, 99] percentile range. Z-score normalization is then applied based on the mean and standard deviation of intensity values across the entire training dataset. Additionally, 3D CT images are sliced into 2D images according to their channel count. The architecture of the DLAU-Net is shown in Figure 1. The core model consists of a U-Net architecture, integrated with deep supervision learning modules and attention mechanisms. Forward skip connections are maintained from the decoder stages to the corresponding encoder stages. In the encoder stage, we employ the VGG-13 encoder along with batch normalization layers. The fundamental layer module is composed of multiple layers, including 3 × 3 convolutional layers (conv) with a stride of 1, padding of 1, followed by batch normalization (BN) and rectified linear unit (ReLU) activation. This pattern is repeated twice, followed by 2 × 2 max pooling (MP). The en-coder includes a sequence of four of the above pattern ([conv, BN, ReLU] × 2 + MP), as illustrated in Figure 1. Compared to the original VGG-13, the top layers, including fully connected layers and softmax, are omitted. The fifth [conv, BN, ReLU] × 2 pattern in the original VGG-13 is utilized as the central part for the separation and expansion paths.

Architecture of the proposed DLAU-Net. In the diagram, up-conv refers to transposed convolution, attention is a combination of the ReLU and Sigmoid functions, and bilinear interpolation is performed.
To achieve a symmetric U-Net architecture, the decoder branch is expanded in a manner similar to the encoder, by adding batch normalization layers and additional feature channels, and incorporating attention modules after the concatenation operations. The added attention module can be expressed as:
After performing pixel-wise multi-label segmentation, post-processing is conducted. During the post-processing, the greatest connected segmentation regions of the voxels are retained and labeled according to 13 organs, such as liver, spleen, and pancreas. Additionally, the network independently processes axial slices to generate 2D segmentation masks, which are subsequently restored to a 3D volume at the end of the process.
Model training process
The model training process consists of three main steps, as shown in Figure 2. (Step 1) Multi-stage Training: Initially, the training is conducted in phases. The first phase involves coarse training, where the model learns large-scale features from the images. Subsequently, fine training follows, enabling the model to grasp finer and more interconnected features. (Step 2) Pseudo-labeling Training: In the pseudo-labeling training, the first-generation pseudo-labeling training is conducted to further educate the model on fine features that were not fully learned in the first phase. Subsequent iterations of pseudo-labeling training continue to refine these labels, progressively enhancing the model's accuracy in recognizing fine features. (Step 3) Optimal Model Selection: Finally, using the evaluation criteria defined in this study, the optimal model weights are identified from those obtained after pseudo-labeling training. This process yields a model with the best performing.

Illustration of the model training process.
Multi-stage training
The phase-wise training is divided into two stages as depicted in Figure 3: coarse training and fine training. In the coarse training stage, a dataset comprising 1066 CT scans with fewer than 13 label categories and 250 CT scans with exactly 13 label categories is used as the coarse training dataset. This dataset is employed to train the original DL-attention-U-Net model, aiming to achieve the optimal model for coarse training. In the fine training stage, only the 250 CT scans containing exactly 13 label categories are utilized as the fine training dataset. This fine training dataset is used to further train the model based on the optimal model obtained from the coarse training stage, aiming to achieve the optimal model for fine training.

Coarse and fine training stages in the multi-stage training.
Pseudo-labeling training
Figure 4 shows the process of pseudo-labeling training. It is conducted in an iterative manner, with the process divided into two stages: the first generation pseudo-labeling training and subsequent iterations. The method employs self-training iteration based on the Dice coefficient criteria. (1) First Generation Pseudo-labeling Training Stage: In this stage, the optimal model from fine training predicts labels for 1066 CT images with fewer than 13 label categories and 884 unlabeled CT images. This prediction results in 1950 pseudo-labeled CT mask data containing exactly 13 label categories. (2) Subsequent pseudo-labeling Training Stage: Using the pseudo-labeled mask data from the first generation and their original images, along with 250 CT images already labeled with exactly 13 categories, the original DL-attention-U-Net model is trained to obtain the first generation pseudo-labeling model. (3) Iterative Process: The first generation pseudo- labeling model then predicts labels for the next batch of 1066 CT images with fewer than 13 label categories and 884 unlabeled CT images, generating 1950 pseudo-labeled CT mask data for the second generation. This process repeats iteratively (N times, where N is a hyperparameter) to train subsequent generations of pseudo-labeling models. (4) Final Optimal Pseudo-labeling Model: After N iterations of pseudo-labeling training, the optimal pseudo-labeling model is achieved based on the iterative improvement guided by the Dice coefficient.

The first generation and the subsequent pseudo-labeling training during the pseudo-labeling training phase.
Loss function
We use the cross-entropy loss function (LCE) in our network, and it is defined as follow:
Experiments and results
Dataset
Experimental data are derived from the FLARE 2023 Challenge that is an extension of FLARE 2021–2022,33,34 aiming at advancing foundational models for abdominal disease analysis. Segmentation labels includes 13 organs and different abdominal pathologies. In this study, image data from 2200 CT scans are utilized, with mask data restricted to those covering abdominal organ labels, excluding masks containing only tumor labels. Case-specific classifications within the dataset are detailed in Figure 5. The training data are gathered under licenses from over multiple sources, including TCIA, 35 LiTS, 36 MSD, 37 KiTS,38,39 autoPET,40,41 TotalSegmentator, 42 and Abdominal CT-1 K. 43 The training set includes 2200 CT scans. The validation set consists of 50 CT scans officially released during the competition's online validation phase, incorporating annotations for 13 distinct abdominal organs. Organ annotations are generated by utilizing ITK-SNAP, 44 nnU-Net, 45 and MedSAM. 46

Case-specific distributions of the Flare 2023 dataset.
To further evaluate the model's generalizability, extended validation experiments are conducted utilizing the publicly available Automatic Cardiac Diagnosis Challenge (ACDC) dataset [https://www.creatis.insa-lyon.fr/Challenge/acdc/ The ACDC dataset comprises 150 fully annotated clinical cases. In this study, we specifically employ this dataset for downstream multi-organ segmentation tasks to focus on three cardiac structures: left ventricle (LV), right ventricle (RV), and myocardium (Myo).
Evaluation metrics
Two widely used evaluation metrics, average accuracy (AVG) and Dice Similarity Coefficient (DSC), are used to assess the segmentation performance of the model. The formulas of these metrics are as follows.
Experimental setup
The iteration settings for the three main training stages of the experiment are as follows. (1) During the multi-stage training, coarse training involves 100 iterations using 1316 CT images and their mask data. Fine training includes 50 iterations using 250 CT images and their mask data. (2) During the training phase with the attention module, the model is trained on a total of 1316 CT images and mask data for coarse training, and 250 CT images and mask data for fine training. This training process involves a total of 150 iterations. (3) For pseudo-labeling training, the first-generation pseudo- labeling training utilizes 1950 CT image data. The fine training model generates the first-generation pseudo-labeling masks. These 1950 new pseudo-labeling data and 250 CT image data serve as the original data for 200 iterations. Subsequent pseudo-labeling training involves predicting new pseudo-labeling data using the model trained with previous pseudo-labeling data, followed by 50 iterations of training for subsequent iteration. This process continues for a total of n times (where n is a hyperparameter, in this case, 5) for subsequent pseudo-labeling training. Specific hardware and software environment requirements for the experiment are detailed in Tables 1 and 2.
Development environments and requirements.
Training protocols.
Learning rate and loss weights sensitivity analysis
The learning rate is empirically set to Lr = 10−5, with loss weights W ′=1, W1 = 0.8, W2 = 0.7, W3 = 0.6, W4 = 0.5 in our experiment. We conduct two sets of sensitivity analysis. For learning rate sensitivity, loss weights are fixed, the learning rate increases by 10 times (Lr = 10−4) and decreases by 10 times (Lr = 10−6), respectively. For loss weight sensitivity, the learning rate is fixed, and loss weights are adjusted by ± 0.1 increments (W ′=1.1, W1 = 0.9, W2 = 0.8, W3 = 0.7, W4 = 0.6 and W ′=0.9, W1 = 0.7, W2 = 0.6, W3 = 0.5, W4 = 0.4, respectively).
Figure 6 shows the variations of training and validation cross-entropy across epochs under fixed loss weights (1, 0.8, 0.7, 0.6, 0.5). When the learning rate is 10−5(orange line), the model demonstrates relatively stable convergence with minimal fluctuations, achieving the lowest cross-entropy values in both training and validation. When the learning rate is 10−6 (blue line), the model exhibits pronounced fluctuations in both training and validation, poor stability, and high cross-entropy values. The performance of the green line, corresponding to the learning rate of 10−4, lies between the former two cases.

Learning rate sensitivity analysis. (a) the evolution of training cross-entropy with respect to epochs. (b) the evolution of validation cross-entropy with respect to epochs.
A similar trend is observed in Figure 7 when analyzing the loss weight sensitivity at a fixed learning rate (Lr = 10−5). The loss weight configuration (1, 0.8, 0.7, 0.6, 0.5) shows relatively small fluctuations, with most cross-entropy values remaining at low levels, indicating better generalizability, compared to the other two alternative weight setups.

Loss weight sensitivity analysis. (a) the evolution of training cross-entropy with respect to epochs. (b) the evolution of validation cross-entropy with respect to epochs.
In summary, the configuration with Lr = 10−5 and loss weights (1, 0.8, 0.7, 0.6, 0.5) achieves optimal stability and performance in both training and validation phases.
Ablation experiment
Ablation experiments are conducted on the FLARE 2023 dataset to evaluate the contributions of each component in our method. The baseline model is the classic U-Net architecture. In addition to the baseline, two training modules, multi-stage training and pseudo-labeling training, are incorporated, along with an attention mechanism improvement module. The experiments systematically compare the performance from the baseline model to the final model which includes baseline + multi-stage training + attention + pseudo-labeling training (details shown in Table 3).
Experimental results of the ablation experiments.
The baseline model in this study has an organ average AVG value of 63.90% and average DSC value of 62.40%. Our complete baseline + multi-stage training + pseudo- labeling training + attention model achieves the best average AVG value of 90.50% and the best average DSC value of 89.05%, showing significant improvements over the baseline model. Compared with the baseline + pseudo-labeling training + attention model, a remarkable increase of 17.95% for AVG and 17.71% for DSC is obtained by our ultimate model respectively. This highlights that the multi-stage training played a vital role in enabling the model to effectively learn the features of most organs and improving the model performance. Our model moves up 7.95% for AVG and 8.94% for DSC respectively in comparison with the baseline + multi-stage training + attention model, indicating the importance of pseudo-labeling training in enhancing the segmentation accuracy by utilizing the unlabeled data. Our proposed method also makes the model gain 2.33% and 1.27% growth for AVG and DSC values respectively over the baseline + multi-stage training + pseudo-labeling training. This demonstrates that the optimization with attention structures enhances the salient features for organ segmentation to a certain extent. These results show that by incorporating multi-stage training, pseudo-labeling training and attention mechanism techniques, our proposed model could effectively promote the model's learning of large-scale features of 13 organs, edge information of organs, and spatial relationships between organs. Ultimately, our strategy outperforms other configurations, validating its effectiveness.
Comparative experiment
We compare our method with several classical or advanced models using Dice coefficient as the evaluation metric. The comparison includes U-Net, DenseU-Net, ResU-Net, and nnU-Net. Comparative experiments are conducted using the original model methods as described in their respective papers and the same dataset in our study. Models like U-Net and nnU-Net are classical and serve as benchmarks in the field of medical image multi-organ segmentation, whereas DenseU-Net and ResU-Net are considered more advanced with superior performance in their model methodologies.
Tables 4 summarizes the DSC metric for different methods in segmenting the abdominal multi organs. The results show that our method outperforms others on the whole. We achieve a DSC value of 89.05%, which is higher than the traditional U- Net by 6.29%, and exceed DenseU-Net (which emphasizes fine-grained features) and ResU-Net (which focuses on model stability and depth) by 4.9% and 2.44%, respectively. Although our DSC is slightly lower than that of nnU-Net on some smaller and label-scarce organs, our model's overall DSC surpasses that of nnU-Net by 0.09%. This demonstrates that our approach is superior to nnU-Net in the task of abdominal multiorgan segmentation.
The experimental results of different methods in terms of DSC metric.
The qualitative comparison of the abdominal multi-organ segmentation results is in Figure 8, illustrating the performance of different models. In comparison to the U-Net, DenseU-Net, and ResU-Net models, our method achieves the best segmentation performance across 13 organ levels. The classical U-Net model performs least effectively in segmenting small, coherent organs such as the pancreas, adrenal glands, gallbladder, and esophagus, almost failing to learn their distinctive features. While the DenseU-Net and ResU-Net models show some improvement in learning features of these organs, they still struggle with precise edge segmentation of these small organs. Compared to the nnU- Net model, our method outperforms it in the majority of organs' segmentation, although our method slightly lags behind in segmenting organs like the inferior vena cava, left and right adrenal glands, gallbladder, and stomach. The nnU-Net shows marginally better segmentation results for organs like the inferior vena cava and adrenal glands, but it exhibits noticeable redundancy and lacks clear learning of organ edge information. Overall, for large or structurally simple organs, such as the liver, spleen, inferior vena cava, aorta, and left kidney, our model achieves highly overlapping segmentation with ground-truth masks, along with sharper and more precise boundaries. In contrast, other models exhibit minor inaccuracies in regional segmentation and boundary blurring. For small or anatomically complex organs including the pancreas, gallbladder, right adrenal gland, stomach, and duodenum, our model outperforms U-Net, DenseU-Net, and ResU- Net in capturing fine details and organ edges. While nnU-Net also demonstrates competitive performance for some small organs, its segmentation suffers from boundary redundancies and reduced edge clarity. Our approach surpasses these limitations, proving its superior ability to learn spatial relationships and edge features among these small and coherent organs.

Representative results of our method and several state-of-the-art models for abdominal multi-organ segmentation.
Generalization experiment
Generalization experiments are conducted on the ACDC dataset, with comparative validations against the TransUNet 47 and Swin-UNet. 48 The dataset is partitioned into three subsets: 100 cases for training, 30 for validation, and 20 for testing. The experimental results are recorded in Table 5. TransUNet achieves the highest accuracy in LV segmentation. Our method obtains the best accuracy in RV segmentation. Swin-UNet yields the highest accuracy for Myo segmentation. Notably, our method achieves the best overall average DSC score among these three approaches, along with more balanced performance across all three regions (LV, RV and Myo). This aligns with its robust performance on the FLARE2023 challenge dataset. Additionally, Table 5 also statistically analyzes the model complexity (parameter count, Params) between different models. Our model has the fewest parameters, proving its advantage in computational efficiency.
DSC scores of different models on ACDC dataset.
Discussion
This study proposes a novel training framework and network model to improve the accuracy and training speed of abdominal multi-organ segmentation under conditions of limited labeled data and training time. To this end, we design a segmentation training framework based on multi-stage training and pseudo-labeling techniques, and introduce an innovative DLAU-Net network model, to address the challenges in abdominal multiorgan segmentation.
The DLAU-Net model consists of a U-Net structure, deep supervision learning, and attention modules. Compared to the traditional U-Net, the DLAU-Net model incorporates an attention mechanism that allows the model to dynamically focus on more important regions, ignoring irrelevant background or noise. This improvement enhances the model's precision in boundary segmentation and fine-grained targets. The addition of deep supervision modules helps to improve gradient propagation and enhance feature learning diversity. By introducing auxiliary losses at different layers, the deep supervision module effectively mitigates the vanishing gradient problem, ensuring that each layer receives sufficient gradient signals to better learn low-level image features. The network learns multi-level features across different layers, aiding in a more comprehensive understanding of the input data and improving the model's overall representational capability.
Due to the abundance of organ features in abdominal multi-organ segmentation tasks, achieving high accuracy with only a robust network model is challenging. Therefore, we combine the DLAU-Net model with a training framework that integrates multistage training and pseudo-labeling strategies to better address the challenges in abdominal multi-organ segmentation. According to the ablation study, after incorporating the multi-stage training strategy, the DSC value is significantly improved by 16.73%. With the addition of pseudo-labeling training, the DSC is increased by an additional 9.92%. These improvements in accuracy clearly demonstrate that the multi-stage training effectively helps the model quickly converge by learning the general features of abdominal organs in the early stages of training, while the pseudo-labeling allows the model to learn finer, more detailed features, further enhancing overall accuracy.
Comparative experiments and the qualitative analysis further confirm the effectiveness of our method. These results indicate that our method is superior to the existing methods in quantitative metrics, and it also has the potential to provide more precise and reliable segmentation results in practical applications.
Despite the promising results achieved in abdominal multi-organ segmentation tasks, it is still suboptimal for the performance of our proposed method on small and highly deformable organs. Future efforts could involve more diverse datasets to validate its generalization, or develop targeted modules for segmentation of those challenging organs.
Conclusion
In this paper, we develop a novel multi-organ segmentation method specifically designed for abdominal CT images. Utilizing the DLAU-Net network, pseudo-labeling training, and a multi-stage training strategy, the network achieves rapid convergence and demonstrates sufficiently accurate multi-organ segmentation performance on the Flare2023 dataset. This method provides a novel and effective approach for clinical diagnostic scenarios such as abdominal disease diagnosis, cancer treatment, and radiotherapy planning. Despite achieving commendable segmentation results, there are limitations in segmenting smaller and strongly adherent organs, such as the inferior vena cava and adrenal glands. Moving forward, our team will focus on targeted research for these challenging organs and explore the advanced techniques, such as domain adaptation, federated learning, and multi-model ensemble methods.
Footnotes
Funding
This research was funded in part by the National Natural Science Foundation of China under Grant Nos. 61902282 and 62071330.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
