Abstract
Background:
Breast cancer diagnosis relies on accurate lesion segmentation in medical images. Automated computer-aided diagnosis reduces clinician workload and improves efficiency, but existing image segmentation methods face challenges in model performance and generalization.
Objective:
This study aims to develop a generative framework using a denoising diffusion model for efficient and accurate breast cancer lesion segmentation in medical images.
Methods:
We design a novel generative framework, PalScDiff, that leverages a denoising diffusion probabilistic model to reconstruct the label distribution for medical images, thereby enabling the sampling of diverse, plausible segmentation outcomes. Specifically, with the condition of the corresponding image, PalScDiff learns to estimate the masses region probability through denoising step by step. Furthermore, we design a Progressive Augmentation Learning strategy to incrementally handle segmentation challenges of irregular and blurred tumors. Moreover, multi-round sampling is employed to achieve robust breast mass segmentation.
Results:
Our experimental results show that PalScDiff outperforms established models such as U-Net and transformer-based alternatives, achieving an accuracy of 95.15%, precision of 79.74%, Dice coefficient of 77.61%, and Intersection over Union (IOU) of 81.51% .
Conclusion:
The proposed model demonstrates promising capabilities for accurate and efficient computer-aided segmentation of breast cancer.
Keywords
Introduction
Breast cancer is the most common cancer among women, with increasing global incidence rates. Malignant cells in the mammary glands can easily infiltrate neighboring tissues and metastasize through vascular and lymphatic pathways [40]. Several factors can increase breast cancer risk, including age, genetics, family history, diet, alcohol consumption, obesity, lifestyle, physical inactivity, and endocrine factors [13]. The complex pathogenesis of breast cancer means effective prevention is difficult. The 5-year relative survival rate for breast cancer is 99% when detected early, while the five-year survival rate for late-stage is only 10% . Therefore, early detection and diagnosis of breast cancer is the key to improving patient survival [39]. Ultrasound scan is a common breast cancer screening tool due to its non-invasive, radiation-free, and low-cost characteristics [18, 46]. It employs sound waves to show internal breast structure, revealing tumor location, size, and shape to assist in diagnosis. To analyze disease progression, doctors manually outline abnormal regions and inspect lesion features. They then design therapeutic plans based on these analyses. Segmenting lesions is therefore a vital step. However, manual annotation is extremely labor-intensive and time-consuming. Computer-aided diagnosis can automate analysis to reduce clinician workload and boost efficiency [42].

Breast ultrasound images with expert annotations from the BUSI dataset are displayed. The figure shows six benign samples in the left two columns and six malignant samples in the right two columns. Evident labeling biases can be observed in the experts’ manual annotations.
Several studies have designed automated systems for breast cancer diagnosis through image processing and machine learning [24, 54]. Traditionally, methods apply filters, morphological operations, thresholding, and edge detectors to roughly locate malignant regions based on pixel information [35]. However, these preliminary segmentation results then require extensive manual correction by experts, demanding substantial time and specialized knowledge. Improving upon prior methods, machine learning solutions employ feature engineering to extract texture, shape, density and other relevant patterns. These features are fed into classifiers that automatically label and outline tumor areas[7]. Compared to strict rules from image processing, models like support vector machines and random forests better handle variability across complex cases. Such systems also continually optimize to boost segmentation efficacy [31]. However, machine learning methods are highly dependent on hand-designed features, resulting in poor generalization.
The advent of deep learning has provided superior solutions for clinical decision support, especially in breast tumor segmentation. Deep learning methods leverage deep learning architectures like convolutional neural networks (CNNs), taking raw pixels as input and predicting pixel-level segmentation masks[17]. They obviate manual feature engineering by automatically learning hierarchical features and patterns directly from the images. Typical networks like U-Net [37] and Mask R-CNN [14] can be trained end-to-end to locate and segment lesions given sufficient annotated datasets [57]. Compared to traditional and machine learning methods, deep learning can effectively handle complex conditions and continually optimize performance with new data. Current research focuses on designing efficient architectures and acquiring sizable high-quality annotated datasets to further boost segmentation efficacy.
However, most current algorithms employ discriminative models for breast lesion segmentation, exhibiting three key limitations. First, their high data dependency for learning features and patterns leads to poorer generalization on novel inputs. Second, solely discerning cues for deterministic predictions ignore overall data distributions, reducing sample utilization efficiency while heightening sensitivity to subtle feature variations that introduce substantial errors. Finally, discriminative models have restricted uncertainty modeling capabilities, generating point estimates without effectively quantifying predictive confidence.
Denoising diffusion probabilistic models (DDPMs), as the latest representatives of generative models, have attracted considerable attention and achieved remarkable progress in computer vision[16, 25, 49]. Diffusion models aim to characterize data points as high-dimensional Gaussian distributions, then learn linear transformations through a denoising procedure to gradually reverse the diffusion process from noise to pristine data. This enables the sampling divers, high-quality outputs. DDPMs have demonstrated superior generative capabilities for tasks like image synthesis and reconstruction [2, 50]. In particular, score-matching-based DDPMs have achieved state-of-the-art semantic segmentation by refining predictions over multiple iterations, unlike single-step discriminative models. Despite limited labels, diffusion models exhibit strong generalization. For medical image segmentation like tumor segmentation, DDPMs have shown pronounced advantages by evaluating instance-wise uncertainty through sampling [44]. This improves output reliability and reduces misdiagnosis risks, rendering them well-suited for clinical adoption.
Drawing on the diffusion model DDPM, we introduce a novel framework for breast tumor segmentation utilizing diffusion principles. This approach is specifically designed to tackle prevalent issues such as the irregular geometries and indistinct edge features of malignant tumors, as well as the conventional limitations in capturing the comprehensive data distribution of tumor visual traits. First, we construct a conditional diffusion model built upon DDPM, which can perform stochastic sampling walks in the image space to produce diverse reasonable segmentations. Furthermore, we devise an Inter-branch Prediction Alignment module to reduce the randomness in predictions. This is achieved by introducing semantic consistency constraints to promote the model generating more consistent distribution results in the two branches. Moreover, some malignancies exhibit irregular shapes and blurred boundaries, making it difficult for the model to rapidly glean salient features from such hard cases. This impedes precise tumor delineation. Therefore, we devise a progressive data augmentation strategy that incrementally tutors the model from simple to challenging instances. Experiments demonstrate superior performance over U-Net and transformer-based architectures, corroborating the framework’s capacity for robust breast lesion characterization amid variability. The contributions of this work can be summarised as follows: We propose a diffusion-based framework for breast tumor segmentation. This generative perspective enables more effective learning and representation of intricate global image distributions, facilitating the sampling of diverse yet plausible segmentation outputs. We introduce an inter-branch prediction alignment module with semantic consistency constraints to reduce output randomness and promote greater coherence in segmentations. We devise a progressive data augmentation strategy to incrementally handle samples ranging from simple to complex cases, tackling challenges posed by irregular tumor shapes and blurred margins.
Breast mass segmentation
Several studies have focused on improving the accuracy of breast tumor segmentation. Shen et al. [38] introduced an improved U-net and trained it on the previous augmented dataset for breast mass segmentation. Liu et al. [24] combined edge-based features and morphologic feature information using a support vector machine to detect breast tumors in ultrasound images. Wilfrido et al. [10] evaluated four CNN-based semantic segmentation models, including Fully Convolutional Network (FCN) with AlexNet network, U-Net network, SegNet using VGG16 and VGG19 networks, and DeepLabV3+ using ResNet18, ResNet50, MobileNet-V2, and Xception networks. The segmentation models based on SegNet and DeepLabV3+ obtain the best results. Additionally, Or et al. [3] focus on using deep neural network architecture to address challenges in super-resolution ultrasound imaging of the microvasculature in breast lesions. Peng et al. [30] apply the semisupervised fuzzy clustering algorithm to lesion segmentation in mammography. Zhang et al. [55] proposed a boundary-oriented network (BO-Net) that enhances tumor segmentation by capturing weak boundaries and extracting multi-scale and efficient feature information. Radhi and Kamil [33] developed an augmented U-Net model that showed superior segmentation results for breast ultrasound images, with high accuracy and IoU values. These advancements in ultrasound imaging techniques contribute to the early and accurate diagnosis of breast cancer.
Advanced algorithms based on deep learning have become the dominant technique for breast mass segmentation. Li et al. [22] proposed the RGC-ASPP-Net model, which enhances the segmentation of breast X-ray images by considering contextual information related to breast masses, leading to improved accuracy in computer-aided diagnosis systems. Xiangqi et al. [27] incorporated weakly supervised constraint loss for training an MRI breast mass segmentation network using partial annotations, considering different sizes of masses while suppressing noise through an end-to-end differentiable noise suppression loss. Kangning et al. [23] performed coarse-grained localization to select regions of interest and subsequently conducted fine-grained segmentation in high-resolution images, for breast cancer diagnosis in screening mammography. Xiang et al. [53] devised a multi-threshold segmentation approach to delineate breast masses and extract connected components for mass segmentation and identification. Qi et al. [32] employ a two-stage end-to-end architecture with a trunk sub-network for multi-scale feature selection and a structurally optimized refinement sub-network for mitigating impairments such as noise and inter-subject variation via better feature exploration and fusion. Eduardo et al. [6] leveraged three symmetry regularization strategies to enhance the generalization capability of breast cancer screening models, addressing the challenge of sparse data. These latest works have advanced the field through novel deep network designs [32, 52], weakly and self-supervised paradigms [28], multi-task learning [21, 42, 52], and model optimization approaches [6, 27]. However, existing methods predominately employ discriminative learning methods that are limited in modeling the whole distribution. In contrast to discriminative models, our diffusion framework enables more effective learning and representation of intricate global image distributions.
Diffusion models
Diffusion models have garnered substantial interest owing to their capacity to capture intricate data distributions through iterative diffusion and denoising, thereby enabling high-fidelity image synthesis and latent space modeling [9, 36]. Compared to GANs, diffusion models boast enhanced training stability and superior mode coverage. They have been shown to achieve image sample quality superior to the current state-of-the-art generative models. By comprehending the Markovian dynamics involved in the reverse diffusion process from Gaussian noise to data [16], diffusion models have achieved remarkable success on various computer vision tasks including super-resolution, inpainting, and colorization [43]. This stems from their ability to glean structural information latent in incomplete observations from full data. Additionally, diffusion models employ continuous latent variables, circumventing mode collapse, a common training deficiency of GANs. Current research is focused on extending diffusion models beyond generative modeling to discriminative tasks such as semantic/instance segmentation. These innovations provide new opportunities for diffusion models in computer vision.
Diffusion models based on denoising score matching have achieved excellent performance for medical image segmentation. DDPM is trained to denoise noisy data by learning a score function that matches gradient directions for clean and noisy images. At test time, stochastic gradient descent on the score can invert the diffusion process to generate high-quality samples. For breast tumor segmentation, DDPM can model the distribution of realistic lesions and sample varied segmentation masks [44]. By leveraging the generative modeling capabilities, DDPM can avoid some limitations of discriminative models and provide useful uncertainty information. Current research on diffusion models for medical imaging is focused on improving sampling efficiency and segmentation accuracy [19]. Recent studies have witnessed promising progress in leveraging diffusion models for discriminative tasks such as image segmentation [12].
Diffusion models for image segmentation
Numerous works have explored the application of conditioned diffusion models to semantic segmentation tasks and have made remarkable progress. Baranchuk et al. [4] conducted research revealing that during the Markov chain steps of the reverse diffusion process, the intermediate activation in neural networks contains rich semantic information. This finding underscores the utility of diffusion models for semantic segmentation. Wolleb et al. [45] proposed a semantic segmentation method based on diffusion models, achieving lesion segmentation in medical images. Their approach utilizes image priors to generate image-specific segmentation and leverages random sampling to compute pixel-level segmentation uncertainty maps. Gu et al. [11] introduced the DiffusionInst framework, which represents instances as instance-aware filters, expanding the applicability of diffusion models to instance segmentation tasks. Wu et al. further improved diffusion segmentation models with MedSegDiff and MedSegDiff-V2 [47, 48], achieving state-of-the-art performance in medical image segmentation. Rahman et al. [34] explored the application of diffusion models in segmenting ambiguous images, harnessing the power of expert groups to generate multiple credible outputs by learning the distribution of collective insights. Xing et al. [51] introduced the Diff-UNet model, which applies diffusion models to 3D volume segmentation, including tasks like multi-modal brain tumor segmentation in MRI, liver tumor segmentation, and multi-organ CT volume segmentation. Ma et al. [26] proposed DiffusionSeg, which employs diffusion models for unsupervised object discovery, saliency segmentation, and object localization. Their framework consists of a synthesis-exploitation approach, addressing data insufficiency by synthesizing abundant images and using AttentionCut for mask generation. In the exploitation stage, inversion techniques map images back to diffusion features, which downstream architectures can directly utilize. Kim et al. [20] introduced the DARL model, a diffusion adversarial representation learning approach. It leverages a denoising diffusion probabilistic model with adversarial learning for self-supervised vessel segmentation, effectively capturing vessel-related semantic information. Tang et al. [41] proposed the MGCC framework, which utilizes latent diffusion models to generate synthetic medical images for semi-supervised learning, reducing the burden of manual annotation. The framework employs a two-stage approach, with the first stage generating synthetic medical images and the second stage enhancing representation by introducing global context noise perturbations. In summary, the applications of diffusion models in various image analysis tasks, ranging from semantic segmentation to object discovery and vessel segmentation, have demonstrated their effectiveness in improving image segmentation.
Methods
Conditional diffusion model
DDPMs constitute a burgeoning class of deep generative architectures, attaining state-of-the-art performance in manifold image processing tasks. The diffusion model includes two primary phases: forward diffusion and reverse denoising. In the forward pass, the data distribution gets gradually diffused via the imposition of Gaussian noise to disturb the salient structures. By learning to reverse this diffusion process, the model implicitly captures likelihoods while denoising noisy inputs back to segmentation outputs.

An illustration of the proposed method. (a) Forward diffusion: by gradually adding noise to disturb the mask to Gaussian distribution; Reverse diffusion: Progressively denoising to infer the mask distribution. (b) The overall architecture of PalScDiff. The conditional encoder extracts salient visual features from the ultrasound image to provide context information. The reverse diffusion adopts the U-Net framework (as shown in the green box) that fuses condition features to denoise noisy mask from step t + 1 to t. A condition decoder is used to estimate the segment objective.
Specifically, DADiff introduces a sequence of Gaussian noises to the input masks in the forward pass and reconstructs the breast tumor mask in the reverse pass to predict the noise. Specifically, given a training mask x0 with corresponding condition image
The reverse process transforms the latent variable distribution p
θ
(x
T
) to the data distribution p
θ
(x0). It trains a predictor ∈
θ
to predict the noise ∈ by the following objective:

The process of progressive augmentation learning strategy. (a) Stage 1: easily-segmented samples with clear lesion boundaries are leveraged to pre-train the initial model. (b) Stage 2: Difficult samples are split into three groups. Applying data augmentation techniques like adding noise and deformation. The augmented data then further trains the pre-trained model incrementally to handle hard cases. This progressive curriculum facilitates precise segmentation even for irregular and ambiguous tumors.
In diffusion-based semantic segmentation, each pixel is treated as a discrete unit and categorized exclusively. However, tumors often comprise multiple contiguous pixels. Independent per-pixel prediction overlooks the correlations and coherence within the tumor region, resulting in segmentation outputs with high randomness and discontinuity. To address the issue of randomness and inconsistency stemming from isolated pixel-wise prediction, inspired by consistency regularization in weakly supervised learning, we introduce the Inter-branch prediction alignment (IBPA) module that aims to alleviate the high variance in predictions caused by poor annotation quality.
Specifically, the IBPA employs a multi-branched network architecture with two parallel predictors that share the encoder and decoder features of the diffusion model but not the prediction layer parameters. These predictors are composed of a 1 × 1 convolutional layer with groups, batch normalization, and a non-linear activation function. We obtain the segmentation outputs from each branch separately and incorporate a consistency constraint between them in the loss function that penalizes divergent predictions for the same input, encouraging the two predictors to converge to more unified outputs. We adopt KL divergence and Binary Cross-Entropy loss to achieve consistency constraint, which can be defined as:
The IBPA module offers a lightweight solution to enhance the coherence of segmentation networks. By incorporating consistency constraints between multiple decoder branches, Inter-Branch Prediction Alignment can effectively reduce fragmentation in segmentation outputs, enabling higher continuity and uniformity within tumor regions and other targets of interest. This helps improve overall segmentation quality, especially for medical imaging tasks where continuity and consistency are crucial for model performance. The proposed technique aids in overcoming challenges diffusion models may face when operating on discrete pixel data, to better capture geometric structure and contiguity information of targets. IBPA imposes regional consistency that is faithful to semantic layouts. The multi-branch regularization provides a simple yet effective approach to make diffusion model predictions more cohesive without drastic architectural changes. The seamless integration equips the model with the capacity to produce segmentations with improved consistency and integrity.
Due to the irregular shapes and blurred boundaries of malignant tumors in some images, it remains challenging for models to swiftly capture the effective knowledge from these hard samples, resulting in inaccurate tumor segmentation. The main difficulty stems from the ambiguous boundaries of malignant tumors, which may be attributed to various factors including size, shape, location of tumors and image quality. To address this issue, we propose a progressive augmentation learning strategy that guides the model to learn the problematic samples in a dataset incrementally from easy to hard. This strategy equips our diffusion model with the capability to accurately predict the mask distribution for blurred images.
Small target size: Fine-grained features cannot be sufficiently learned from tiny target regions. Complex shapes: Highly variable and irregular shapes are hard to model. Poor image quality: Factors like noise, blur and low contrast interfere with segmentation. Ambiguous boundaries: Unclear edges between targets and background lead to segmentation errors. Inaccurate annotations: Wrong ground truth labels result in learning bias.
Grounded on these perspectives, we consider sample difficulty in the dataset and collect challenging cases with blurred edges, complex morphology and labeling errors to build a hard sample pool. Specifically, at the initial training stage, we pre-train the diffusion model on easy samples to acquire basic segmentation capabilities. We then progressively introduce simple instances from the difficult pool, which have small target areas and slightly blurred edges. The model is fine-tuned on these with a lower learning rate to gradually adapt. As training continues, the ratio of difficult samples is steadily increased, incorporating those with more obscured boundaries, annotation errors, and complex morphologies. Data augmentation via flipping and scaling is utilized to enrich sample diversity and enhance model generalization. Furthermore, noise injection and boundary erosion are applied to synthesize blurred data. The curriculum schedule modulates difficulty by controlling the blurring degree.
By emphasizing these hard examples during training, the model learns robust features and policies to handle tough cases frequently encountered in real-world applications. This incremental learning approach enables robust feature learning and precise segmentation of malignant tumors.
The model is trained with two loss terms - the diffusion loss L
diff
and the alignment loss L
Align
. The diffusion loss L
diff
ensures reliable sample generation, containing noise prediction loss, Mean Square Error (MSE) and Binary Cross Entropy (BCE) loss. It can be formulated as:
The alignment loss L
Align
provides regularization by minimizing the disagreement between dual-branch predictions, which encourages coherent outputs. The joint optimization of diffusion and alignment losses allows for generating high-quality samples while enhancing segmentation performance. The overall loss function for the breast mass segmentation task is defined as follows:
Dataset
In this paper, we evaluate the proposed method using the breast dataset BUSI [1], which collects breast ultrasound images from 600 patients between 25 and 75 years old, covering 780 images with an average resolution of 500 × 500 pixels. According to the relevant domain knowledge, BUSI datasets are divided into three categories: normal, benign and malignant. Furthermore, the dataset also provides expert manual annotation of tumor regions in benign and malignant images. All images and ground truth are saved in PNG format. During the experiment, we use 70% of the dataset as the training set, 15% as the validation set, and the rest as the test set. In this work, normal samples are also used to train the model but do not participate in validation and testing. Additionally, we design a specific data augmentation and model training strategy in section 3.3. Table 1 dataset shows the number of data of original BUSI and after augmentation.
The number of samples of raw and augmented data in training, validation, and test sets
The number of samples of raw and augmented data in training, validation, and test sets
We adopt the same specification as the MedSegDiff baseline, including a learning rate of 1e-4 and AdamW optimizer. Training leverages two NVIDIA A100 GPUs with a batch size of 8 across 130000 steps. We save model parameters every 10000 iterations. The checkpoint that achieves the best validation performance is selected for final testing. The image and mask are normalized to [0, 1] and converted into tensors fed into the model. We utilize 1000 diffusion steps for both training and testing, in alignment with recent practices [47].
Evaluation Metrics
We evaluate the breast ultrasound tumor segmentation performance using pixel-wise accuracy (ACC), precision (P), Dice Similarity Coefficient (DSC), and Intersection over Union (IoU). Given the binary ground truth mask X and predicted mask
True Positives (TP): Pixels correctly predicted as positive class. False Positives (FP): Pixels incorrectly predicted as positive class. True Negatives (TN): Pixels correctly predicted as negative class. False Negatives (FN): Pixels incorrectly predicted as negative class. Where ∩ indicates intersection and ∪ denotes union. DSC measures the ratio of overlap between the ground truth and prediction over their combined area, ranging from 0 to 1, with 1 denoting complete overlap. IoU calculates the proportion of the intersection over the union of the true and predicted regions. Higher values for both metrics indicate better alignment.
Compared to pixel accuracy, DSC and IoU better account for shape consistency between ground truth and prediction. Therefore, we adopt DSC and IoU as evaluation metrics to quantify segmentation precision in this study of breast tumor segmentation from ultrasound images.
Ablation analysis
Table 2 presents an extensive ablation study quantifying the individual contributions of each model component. The * denotes the standard data augmentation method, which is compared against the proposed PAL strategy. Using the baseline conditional MedSegDiff model alone yields 92.11% accuracy, 74.04% precision, 73.76% DSC and 78.14% IoU. Integrating the proposed IPA module improves the performance across all metrics to 95.04% accuracy, 77.09% precision, 76.58% DSC and 79.72% IoU by imposing consistency constraints to reduce output randomness. Applying the PDA strategy also enhances model performance beyond standard augmentation. Ultimately, the complete framework with all components delivers a further improvement across all metrics, significantly outperforming both the baseline and individual ablation models. Specifically, the complete model increases DSC by 3.85% and IoU by 3.36% compared to the baseline, verifying that each component contributes meaningfully to superior overall performance. The comprehensive ablation study thus quantitatively demonstrates the necessity of the proposed modules in achieving better segmentation accuracy.
Ablation experiments. Baseline represents the information bottleneck, IPA means Inter-branch Prediction Alignment module, and PAL is the progressive Augmentation Learning strategy
Ablation experiments. Baseline represents the information bottleneck, IPA means Inter-branch Prediction Alignment module, and PAL is the progressive Augmentation Learning strategy
We compare the proposed algorithm with several classical and advanced medical image segmentation models. The U-Net [37] is a seminal convolutional neural network composed of a contracting encoder pathway to capture context and a symmetric expanding decoder pathway for precise localization. Att-Unet [29] augments the U-Net architecture with attention gates to focus on target structures. Nested UNet [58] further extends U-Net by introducing nested dense connections between the encoder and decoder for improved gradient flow. More recently, vision transformers have shown promising performance by modeling long-range dependencies. TransU-Net [8] incorporates Transformer encoders into convolutional decoders for global context modeling. Similarly, Swin-Unet [5] adapts hierarchical Transformers as backbones to enhance feature representations. Each model makes progress by improving upon previous limitations, either integrating attention mechanisms or augmented pathways. For a fair comparison, none of the models used pre-trained parameters of ImageNet. The recent breast tumor segmentation models like HiFormer-L [15], MDF-Net [32], and FBAT [56] are also compared in this study.
All results are shown in Table 3. Overall segmentation accuracy, as measured by pixel-level accuracy (ACC), is highest for TransU-Net at 96.47%, with our proposed method achieving a competitive accuracy of 95.15% . However, our method achieves the best performance across the other key metrics of precision, Dice similarity coefficient, and Intersection over Union (IoU). In terms of precision, our method attains the leading score of 79.74%, indicating its strength in minimizing false positive errors. This suggests our consistency regularization helps enhance discrimination capability. For segmentation similarity, our approach again excels with a Dice score of 77.61% . While Dice is slightly lower than MDF-Net, our approach significantly outperforms it on Iou. This implies our PAD strategy captures inter-class representations well to improve boundary delineation. Finally, our framework realizes the top IoU performance of 81.51%, showing its aptitude for handling class imbalance and rare categories. The DSC gap between the best and second-best model TransU-Net is also significant at 5.78% . In summary, while TransU-Net achieves the highest raw accuracy, our proposed technique strikes a balance between precision and generalization to deliver leading scores across the other vital segmentation metrics.
Performance comparison of breast mass segmentation
Performance comparison of breast mass segmentation
Figure 4 presents segmentation results on three benign cases (first three rows) and three malignant cases (last three rows) using several models. Qualitative results demonstrate accurate lesion localization and sharp boundary delineation by the proposed model, likely attributable to the progressive data augmentation strategy that enables precise segmentation even near lesion borders. Additionally, the first benign example exposes inaccurate partitions by U-Net and Nest-Unet due to abundant interfering contexts sharing similarities with the target. Contrary to benign tumors, malignant masses often exhibit blurred boundaries, posing challenges for segmentation as evidenced by the uncertain edges from TransUNet in the fourth case (4-th row). Unlike discriminative models, the proposed model approximates the underlying joint probability distribution of the data to restore indistinguishable diseased regions instead of shrinking segments. Consequently, it delineates lesions to the greatest extent possible despite boundary ambiguity. (Moving forward, adversarial training and ensemble approaches could further enhance segmentation fidelity across lesion types. Additional patient data covering diverse scenarios would also consolidate model generalizability to real-world clinical applications.)

Segmentation results. From left to right: Original breast ultrasound images; Ground truth masks; Segmentation results using U-Net, Attention U-Net, Nest-UNet, TransUNet, Swin-Unet, and the proposed method.
Table 4 shows the segmentation performance over 5 times of sampling using the proposed diffusion framework. Additionally, "ensemble" means the integration of 5 sampling results via the majority voting method. We observe fluctuating metrics in the single inference results, in terms of accuracy (93.31% - 94.62%), precision (78.62% - 79.97%), Dice similarity (74.78% - 76.47%), and IoU (78.44% - 80%). The second inference achieves the highest individual scores overall. However, aggregating predictions using a 5-run majority voting ensemble yields considerable improvements in accuracy to 95.15%, competitive precision of 79.74%, Dice score of 77.61%, and leading IoU of 81.51% . The disparities between the minimum and maximum individual metrics span 1.31% (Accuracy), 1.35% (Precision), 1.69% (Dice), and 1.56% (IoU). Meanwhile, the ensemble surpasses the best single pass by 0.53% (Accuracy), 0.23% (Precision), 1.14% (Dice), and 1.51% (IoU). This quantifies the variation stemming from stochastic sampling while validating that ensembling multiple samples improves robustness. The fusion of repetitive sampling addresses variability to produce superior overall segmentation performance. In conclusion, the diffusion segmentation model exhibits results fluctuations, but aggregating multiple outputs promotes accuracy, precision, and region similarity over single sampling.
Quantitative evaluation of the effect of the number of experts on model performance on the RIGA dataset
Quantitative evaluation of the effect of the number of experts on model performance on the RIGA dataset
Figure 5 shows repeated sampling results from rounds 1 to 5. It can be seen that there are differences between the sampling results each time, especially near lesion boundaries or irregular contours. Such variability induces predictive uncertainty, yet enables exploring multiple plausible segmentations. Stochasticity and diversity in the diffusion model facilitate complete posterior characterization, mitigating representation bias. In addition, results show that the ensemble method with majority voting helps reduce ambiguity to output coherent results, with an improvement of Dice and IoU by about 2% . When there is a lack of clarity in shape or margin, this approach can alleviate wrong predictions. Quantitatively, combining possible outputs enhances the measurement of a single round, achieving state-of-the-art performance as the overall knowledge provides robust tumor features.

The segmentation results obtained by multiple sampling and samples ensemble. From left to right: Original breast ultrasound images; Ground truth masks; sampling round 1 to 5 outputs; final aggregated result using majority voting.
Accurate breast tumor segmentation from medical images provides crucial information for cancer screening, diagnosis, and treatment planning. However, complex morphology and boundary ambiguity hinder segmentation performance, and popular discriminant methods have limited generalization and uncertainty modeling capabilities. In this paper, we propose a generative framework PalScDiff for breast lesion segmentation using a conditional denoising diffusion probabilistic model. Stochastic sampling enables diverse yet plausible outputs while an Inter-branch Prediction Align module with consistency constraints reduces randomness. A progressive augmentation Learning strategy handles segmentation challenges even for irregular, blurred cases. Comprehensive experiments demonstrate the improved performance over previous segmentation models including U-Net, TransUNet, Swin-Unet and recent models of breast tumor segmentation. Accuracy reached 95.15% while Dice score attained 77.61% to balance sensitivity and specificity. Moreover, multi-round sampling and integration facilitate segmentation performance and robustness improvement. While the approach demonstrates improved performance over existing models, limitations include the complexity of tumor morphology and multi-modal prior knowledge. Future work will explore multi-modal data integration and further enhancements to the model’s robustness and generalization capabilities.
