Abstract
Background:
Deep learning has transformed medical imaging by enabling earlier and more accurate disease diagnosis. Lesion and tumor segmentation, essential for analyzing and tracking morphological changes, is commonly done with U-Net variants, though these often lack cross-domain generalization and do not fully leverage foundation models like the Segment Anything Model (SAM), which still requires manual intervention to define the region of interest (ROI).
Objectives:
To enhance generalization and reduce manual intervention by combining the automatic optimization of nnU-Net with the precision of SAM.
Design:
Experimental evaluation of a hybrid segmentation framework for lung nodule analysis.
Methods:
We propose a novel approach integrating the automatic optimization capabilities of nnU-Net for lesion detection with the high-precision segmentation of SAM, eliminating the need for manual intervention by the clinician. The method was evaluated on the LIDC-IDRI dataset, a widely recognized benchmark for lung nodule segmentation.
Results:
Our approach produces more anatomically coherent segmentations than nnU-Net alone. In many cases, the resulting boundaries more closely reflect true nodule morphology than individual expert annotations, despite high inter-expert variability.
Conclusion:
The proposed integration of nnU-Net with SAM enables fully automated lesion segmentation without manual intervention. The method improves generalization and accuracy across medical imaging domains, achieving expert-level performance in pulmonary nodule segmentation.
Introduction
Image segmentation involves dividing an image into regions with the goal of extracting the areas of interest. It is considered a classic problem in computer vision. To extract regions of interest, characteristics such as grayscale levels, color, and geometric shapes are considered. Segmentation has been applied in various fields, including satellite imagery, autonomous driving, and medical image analysis. Segmentation is of particular importance in medical imaging because it helps locate regions most affected by a disease and serves as a support tool for medical professionals. In the construction of a computer-aided diagnosis (CAD) system, segmentation is one of the initial stages of image processing and acts as a precursor to the disease classification model. Semantic segmentation is commonly applied in medical image analysis to label each pixel in the image, providing precise information about the shape and size of the lesion.
Regarding segmentation methods, we find traditional methods, such as thresholding, region-based, and edge-region-based approaches. These methods use image characteristics and mathematical concepts to achieve fast segmentation. However, traditional techniques generally offer less precision than deep learning methods because deep learning models can automatically extract features from images. Among the deep learning methods we find Fully Convolutional Networks (FCN)-based methods, such as U-Net, 1 an autoencoder architecture for medical imaging segmentation. More recent advancements have been made using Vision Transformers,2-4 which employ attention mechanisms to focus on the most important features in the image.
One of the main drawbacks of many proposed deep learning architectures is their lack of generality when applied to new datasets with different medical imaging modalities. In the literature, we found specific U-Net modifications that perform well for particular problems; however, they lack versatility. Foundation models address this issue by training transformer-based architectures on large-scale datasets with diverse image types. In recent years, foundation models have revolutionized the field of Artificial Intelligence (AI). These types of models can effectively and efficiently solve several tasks across different domains, excelling in Natural Language Processing (NLP) and Computer Vision (CV) problems. The most prominent foundation models are large language models like GPT-45 and multimodal large language models like Gemini, 6 which accept images and audio. In contrast, diffusion models, such as Stable Diffusion, 7 can generate all types of images from a prompt.
The emergence of these foundation models has enabled the development of new general-purpose segmentation models that allow the segmentation of various objects across any domain. These models often incorporate a prompt, such as a user-provided click or bounding box, to precisely indicate the segmented area. This has led to the development of a new field within segmentation known as interactive segmentation.
A prominent example of this approach is the SAM model, which was released in 2023. For medical imaging, MedSAM 8 was introduced in 2024, a model based on SAM and trained on a large dataset of > 1.5 million medical images. Although these models achieve high segmentation accuracy, they require manual user intervention through a prompt. This prompt can be a mouse click, bounding box, or text input. Another advantage of SAM lies in its zero-shot approach, where it is not necessary to train a model with data from a specific domain to perform segmentation. This approach has been applied to various imaging modalities, 9 such as CT, 10 Magnetic Resonance Imaging (MRI), 11 and pathological images. 12 Although SAM results approach state-of-the-art performance, segmentation becomes challenging in more specific problems, such as weak boundaries, very small lesions, and complex, irregular shapes. This finding is particularly evident in pulmonary nodule segmentation. Therefore, it is necessary to train or fine-tune the SAM model with such images to achieve optimal results.
The training pipeline is another important aspect to highlight. In most segmentation problems, it is necessary to manually optimize and configure the entire process, including preprocessing, normalization, hyperparameters, and architecture configuration. This process must be performed for each dataset. To address these issues, the nnU-Net framework 13 was published in 2021 by Isensee et al With nnU-Net, it is not necessary to manually configure and adjust the entire pipeline. Different components of the pipeline, such as hyperparameters and architecture configuration, adapt according to the properties of the dataset and image modality.
The nnU-Net framework automates configuring and training CNNs for medical image segmentation via a systematic approach. It begins with dataset feature extraction, known as data fingerprint, where the image modality and intensity value distribution are analyzed. Based on this data fingerprint, rule-based parameters are established, such as normalization tailored to each image modality. In Computed Tomography (CT) scans, global z-score normalization is applied. The framework then automatically determines the network topology and batch size based on the available GPU memory. The loss function and key hyperparameters, such as the learning rate, number of epochs, and optimizer, are fixed. Finally, the network is trained using the previous configuration.
In this paper, we propose a novel approach that combines optimization of the deep learning pipeline within the nnU-Net framework with the SAM model’s precise segmentation capabilities. Using nnU-Net, we automatically and efficiently adapt the deep learning pipeline to detect regions of interest in the image, where lesions are located. These relevant areas detected by nnU-Net are then used as input for the SAM model, eliminating the need for manual intervention and saving time. Moreover, with our proposal to extract regions of interest from the images locally, we avoid the previously mentioned problem of dealing with weak boundaries and small and complex lesions without the need for specific training or fine-tuning. This new approach was applied for lung nodule segmentation.
The remainder of the paper is organized as follows: Section 2 reviews the current state-of-the-art techniques for medical image and pulmonary nodule segmentation. Section 3 details the methodology used in this study, including the dataset, preprocessing techniques, and architectures employed. Section 4 presents the obtained results and compares them with other state-of-the-art methods. Finally, Section 5 offers conclusions and outlines future research directions.
Lung Cancer Overview
Lung cancer is a significant health problem and the leading cause of cancer-related death worldwide. The latest GLOBOCAN 2022 14 estimates the incidence and mortality of different types of cancer and is produced by the International Agency for Research on Cancer (IARC). In addition, lung cancer is the leading cause of cancer-related death, accounting for more than 1.8 million deaths (18.7% of all cancer types). In 2024, 611 720 people died of cancer, of whom 125 070 (20.44%) died of lung cancer in the United States. 15 In Europe, the situation does not improve. 16 Lung cancer remains the leading cause of cancer-related deaths among men, with 153 032 predicted deaths. For women, the predicted mortality is 84 402 compared with the 76 041 deaths observed in 2018. The prediction for 2050 indicates that the number of cases diagnosed will increase. Therefore, early diagnostic methods are crucial to reduce disease prognosis and improve patients’ quality of life.
Lung cancer is diagnosed through physical examination, biopsy, or imaging using magnetic resonance imaging (MRI) and computed tomography (CT). 17 A pulmonary nodule is an abnormal lung area. Pulmonary nodules are common findings, detected in approximately 30% of chest CT scans and 1.6 million patients annually in the United States. They are categorized as small solid (<8 mm), larger solid (≥8 mm), and subsolid, including ground-glass and part-solid nodules. 18 At least 95% of all pulmonary nodules identified are benign, but the risk of malignant tumors increases with nodule size, from < 1% for nodules < 6 mm to 64%–82% for nodules >20 mm.18,19 Nodules with >10 mm are considered large nodules, while nodules <3 mm are micronodules. 20 Other risk factors include patient age, smoking history, and nodule characteristics such as irregular borders and growth rate21,22
Lung nodule segmentation solutions can be divided into 2 categories: traditional segmentation and deep learning methods. Morphological operations, active contours, and region-growing are common traditional segmentation methods.23-25 Although traditional methods are resource-efficient, Deep Learning (DL) techniques provide superior results. Thanks to deep learning techniques, it is possible to extract relevant features that allow pixel-by-pixel classification in an image, thereby allowing for more precise segmentation. These techniques can be applied to different types of images, such as CT and histopathological images.26,27 In recent years, autoencoder architectures, such as U-Net, have been used to solve this problem, specifically in medical imaging. Using these segmentation models, a radiologist can detect nodules that might otherwise go unnoticed, assist in the final diagnosis, and even study changes in nodule size over time. Lung nodule segmentation in computed tomography (CT) images remains a challenging task due to the variety of nodule shapes, sizes, and densities, as well as their similarity to surrounding structures. First, large databases with nodules with different characteristics are required. Due to the sensitive nature of such images, public databases are scarce. The most commonly used public databases are LIDC-IDRI 28 and LUNA16. 29 The segmentation model requires precise annotations from experts, which is time-consuming. These segmentation masks must be labeled correctly to indicate the nodule’s exact shape for generalization to the learning process. Recent innovations include multi-crop CNNs, 30 dual-branch networks,31,32 and region-based fast marching methods. 33 Although these techniques have promising results, challenges remain in achieving high sensitivity with low false-positive rates, managing different types of nodules, and developing robust models applicable to diverse patient databases. 34
Related Works
Currently, lung nodule segmentation is challenging due to its complexity. There are 2 clear divisions when solving this problem: traditional segmentation methods and deep learning-based methods. Among the traditional segmentation methods, we found morphological operations, active contours, and region-growing.23-25 However, although traditional methods are efficient and do not require as many computing resources, their results are considerably inferior to those of deep learning methods. Deep learning-based methods offer better results than traditional techniques due to the automatic extraction of image features. Deep learning methods are divided into 2D and 3D architectures. Most proposals in the literature work with 3D architectures to extract spatial information from voxels. Although in many cases these architectures perform better than 2D architectures, the results are similar. Additionally, 3D architectures involve much more design complexity and require more computational resources.
Among the most used architectures, U-Net 1 stands out, which achieves good results in medical image segmentation problems thanks to the use of skip connections. However, the base U-Net model fails to extract enough features for precise segmentation. To address these issues, several modifications to the architecture and the use of different preprocessing techniques have been proposed. Chaudhry et al 35 propose a 2D base model of U-Net and used transfer learning through pretraining on the LIDC-IDRI dataset 28 to segment nodules in the UniToChest dataset, achieving good results. We also observe the use of residual blocks in the literature to prevent relevant information loss and the use of Atrous convolution36,37 to obtain multiscale features. In this way, nodules of different sizes can be detected.38,39 Using these 2 techniques, Halder and Dey 38 achieve one of the highest segmentation results in the state of the art, with 97.15% of Dice score in cropped images from the LIDC-IDRI dataset. Zhou et al 40 introduced U-Net++, an extension of U-Net that incorporates additional depth levels, inter-level skip connections, and deep supervision to enhance multi-scale learning. Isensee et al 13 proposed the nnU-Net framework, which focuses on optimizing preprocessing and hyperparameters instead of modifying the architecture. Therefore, they propose several 2D and 3D U-Net base architectures for segmentation problems of all kinds of medical images. Building upon this framework, our statistical analysis using the UniToChest dataset concluded that the inclusion of residual connections and the application of windowing as a preprocessing technique are significant factors in enhancing nnU-Net segmentation performance for pulmonary nodules. 41
A hybrid approach is also common when addressing this problem. Nguyen et al 42 propose a new backbone based on U-Net that allows nodule detection and segmentation, thanks to a multi-branch attention auxiliary learning mechanism. On the other hand, Jiang et al 31 fused features from original images and images with features into a dual-branch network and a multidimensional fusion module. Wang et al 43 manage to extract features simultaneously in 2D and 3D CT images. Some studies combine traditional techniques with deep learning methods, as Yang et al 44 did. In this proposal, they first train a 3D U-Net architecture for segmentation. Then, the model is refined using active contours, achieving 90% of Dice score in the LIDC-IDRI dataset.
New techniques based on transformer networks have been proposed in recent years. The main idea of this approach is to use attention mechanisms to focus on relevant information in the image, improving the segmentation model’s accuracy. Hou et al 4 propose SMR-UNet, which integrates self-attention mechanisms that extract multi-scale features and replaces U-Net’s convolutional blocks with residuals to avoid gradient explosion. Wang et al 45 use a swin transformer block to extract global information from the image by leveraging self-attention. Similarly, Cao et al 46 proposed Swin-Unet, a pure transformer-based U-shaped architecture that replaces convolutional blocks with shifted windows to capture long-range semantic information.
Preprocessing is necessary when working with CT scans before using a model. In many works, cropping is performed to extract the ROI of the nodule as preprocessing.4,31,37,43,45 By extracting the nodule’s ROI, better results are often obtained than using the complete CT image, as unnecessary information is eliminated. CT scans use Hounsfield units (HU), and many authors normalize these values by removing irrelevant information from the lung area.3,35 Another relevant aspect of preprocessing is the prior segmentation of the lung parenchyma, the area where the nodules are located.3,25,39 To segment the lung parenchyma, there are traditional algorithms like watershed 47 and accurate deep learning models like U-Net R231 in the lungmask package 48 with over 90% Dice score.
Among the proposals that leverage SAM for medical image processing, several approaches have been developed. Fine-tuned architectures, such as MedSAM 8 extend the foundation model by incorporating diverse medical image datasets. To address the domain shift between natural and medical images, adapter-based methods have gained popularity; for instance, Wu et al 49 proposed a medical SAM adapter to integrate domain-specific knowledge without fine-tuning the entire heavy backbone. Similarly, recent works like DeSAM 50 focus on decoupling prompt learning to enhance generalization across different modalities. Other approaches eliminate manual user intervention by modifying the prompt encoder. Among these architectures is AutoSAM, 51 which removes the prompt encoder of SAM and replaces it with a custom encoder trained on the same input image, thereby avoiding the need for specific fine-tuning. However, neither of the abovementioned methods optimizes the training pipeline for medical image segmentation as effectively as nnU-Net does.
Among the approaches that integrate nnU-Net with SAM, nnSAM 52 stands out. This model combines both architectures by concatenating the embedding of the input image in SAM with the latent representation learned by the nnU-Net encoder. This integration leverages the knowledge encoded in the foundation model and the optimization capabilities of nnU-Net, allowing the combination of both models to exploit their respective advantages.
Few datasets are publicly available due to the sensitive nature of the data. LIDC-IDRI 28 is the most used database in the literature for segmentation, followed by LUNA16. 29 In 2022, new databases were launched such as UniToChest, 35 which contains more images than previous databases.
Methodology
Dataset
In this study, we used the database “The Lung Image Database Consortium (LIDC) and Image Database Resource Initiative (IDRI)”. 28 This dataset is one of the most widely used in the literature and is a reference database for lung nodule segmentation and classification. The LIDC-IDRI dataset contains 1018 CT scans obtained from 7 different academic institutions. Images were taken from various CT scanner models from different companies.
The dataset contains the following 3 main categories of nodules and annotations:
Nodules ≥ 3 mm: Lesions considered as nodules with the greatest in-plane dimension between 3 and 30 mm.
Nodules < 3 mm: Lesions considered as nodules with the greatest in-plane dimension less than 3 mm.
Non-nodule ≥ 3 mm: Other pulmonary lesions with the greatest in-plane dimension ≥ 3 mm were not considered nodules.
To annotate the images, 12 thoracic radiologists from 5 LIDC-IDRI institutions (Weill Cornell Medical College, University of California, Los Angeles, University of Chicago, University of Iowa, and University of Michigan) participated in the project. Each scan was independently reviewed by 4 of them in 2 phases. Each CT scan contains 1 or more nodules. Each nodule may be annotated by one, two, three, or 4 radiologists, as not all of them may agree on whether it should be considered a nodule. The first phase was a masked review, and the second was an unmasked review, in which each radiologist reviewed their mask along with the masks of the other radiologists. In this final phase, the radiologist could modify the mask and make a final decision.
The dataset contains 7371 lesions marked as nodules by at least 1 radiologist. Of these lesions, 2669 were nodules larger than 3 mm. Finally, 928 of the nodules larger than 3 mm, were marked by all 4 radiologists.
Preprocessing
For image processing, we used the pylidc library.
53
This library facilitates access to CT images, sized
Regarding expert annotations, a consensus mask was generated, representing 50% agreement among all annotations (see Figure 1). In this way, we consider the opinions of all experts for model training. Additionally, since we work with 2D images and DICOM images are 3D, we extracted the volume of the annotation and obtained the central slice to avoid redundancy of nodules during training. In addition, the DICOM pixel data pixel values are raw values obtained from the scanner manufacturer. These values must be converted to HU for a correct representation of lung tissue. In this scale, lung parenchyma values range from −700 to −600, while water is 0. To convert the scanner values to HU, we used the following equation:
where PixelData represents the pixel values from the scanner, and RescaleSlope and RescaleIntercept are scaling and offset values required to convert the raw values to HU. In all images in the dataset, the RescaleSlope value is 1, while the RescaleIntercept value ranges between 0, −1000, or −1024, depending on the scanner manufacturer. These values are stored in the DICOM image metadata. The HU values were then normalized to the range [0, 1] to facilitate training.

Consensus annotation masks representing the agreement among radiologists for the region of the nodule.
Because the LIDC-IDRI dataset excludes defined training, validation, and test splits, a random and stratified split was performed according to each nodule’s class. To ensure a representative evaluation, we utilized a fixed stratified split rather than k-fold cross-validation. This strategy strictly retains the underlying class distribution, ensuring that evaluation metrics are not artificially inflated by frequent or simple cases. Therefore, the stratified test set serves as a reliable and robust benchmark for assessing the model’s performance across diverse anatomical structures. The dataset of 2642 nodules was divided into 1902 for training, 265 for validation, and 475 for testing. The database contains pathological information for a limited number of patients. However, the DICOM metadata include information on the nodule’s malignancy, with 5 classes (from least to most malignant):
1: Highly unlikely
2: Moderately unlikely
3: Indeterminate
4: Moderately suspicious
5: Highly suspicious
Each expert labels each nodule with the corresponding malignancy grade. To study the reliability of inter-rater agreement, we calculated Cohen’s kappa for 2 radiologists and Fleiss’ kappa for 3 and 4 radiologists. The results were 0.1757 for 2 radiologists, 0.1721 for 3 radiologists, and 0.1940 for 4 radiologists, indicating very low agreement. Low concordance is common in challenging interpretation tasks, such as nodule classification, where ambiguity is high. In our experiments, the majority vote determines the final class of nodules. In some cases, a significant difference in opinion was observed, with the same nodule being marked as both class 1 and class 5. To address this issue, while creating the classification model for validation, a new class 0 was created to account for such cases.
Method Overview
Figure 2 describes the design of the proposed method, which is divided into 3 phases. Figure 3 illustrates the complete pipeline using an example. In the first phase, the proposed method inputs raw medical images. This image is fed into nnU-Net, which extracts the dataset characteristics, such as image modality and intensity distribution, to automatically configure the training pipeline, automating data normalization, U-Net architecture configuration, and some hyperparameters. In the nnU-Net architecture, the localization of medical anomalies is extracted, which generates a mask.

Proposed method for lung nodule segmentation integrating nnU-Net and SAM. Given a CT lung scan, in the first phase, the lung nodules’ localization is obtained by training a nnU-Net. Next, in the second phase, the ROI of the nodules is extracted and the original image is cropped. Bounding box coordinates are extracted from the mask using an expansion factor. In the third phase, the bounding box coordinates and the cropped image are passed as input to the prompt and image encoders of the SAM, respectively. The SAM mask decoder returns a fine-tuned mask, which is reconstructed into a new mask to match the original image. The architecture design is based on Isensee et al 13 and Kirillov et al. 54

Complete end-to-end pipeline. Workflow of the proposed method illustrated with a concrete nodule example.
Next, the mask accuracy was improved using the foundation SAM model, which was trained on several images from different domains. This model requires manual user intervention by introducing a point bounding box or describing what is to be segmented. However, if an image with very small regions to segment compared with the total image is introduced, as is the case with lung nodules, precise segmentation is not achieved. To solve this problem, a local enhancement algorithm is proposed in the second phase of the proposed method. First, we extracted the region of interest (ROI) of the lesion, centering it in the image with a margin. A margin is left to add context to the lesion’s surroundings, allowing SAM to easily detect edge and shape differences.
Next, the binary mask generated by nnU-Net serves as a region proposal generator to automate the prompting process. Specifically, the OpenCV contour detection algorithm is employed to identify the lesion boundaries within the mask. From these contours, the spatial bounding box coordinates
Finally, in the third phase, the SAM model receives the ROI image as input. The image embedding contains a representation of the image’s most relevant features. This embedding serves as the input to the SAM mask decoder, which generates a refined and precise mask. The refined mask is smaller than the original mask; thus, the mask is reconstructed by expanding its dimensions to match the original CT scan resolution (
Architectures
The nnU-Net Architecture
This method, specifically developed to work on semantic segmentation problems for medical images, stands out for its automatic configuration of the entire process of building the model. Preprocessing, architecture configuration, training hyperparameters, and postprocessing are all done automatically based on dataset features and heuristic rules. The main idea of this method is that the impact on performance is more significant in the pipeline configuration than in the modifications that can be made to an architecture.
Figure 2 shows the design of nnU-Net applied to our proposal. First, the dataset features (data fingerprint) are extracted. During this process, the image modality and intensity value distribution are checked. Next, rule-based parameters are established, such as normalization applied to each of the images. When working with CT images, a global z-score normalization is performed across the entire dataset, calculating the mean of the foreground and the standard deviation. The batch size, patch size and network topology are also determined based on the available GPU memory. The hyperparameters, such as the learning rate, optimizer, loss function, and data augmentation techniques are fixed.
The nnU-Net framework uses 3 architectures depending on the image type. All 3 architectures are based on the original U-Net configuration. 1 First, training is performed on a U-Net architecture for 2D images, followed by a 3D architecture with full image resolution. Finally, training is conducted with a cascaded 3D architecture for high-resolution images, performing segmentations at low resolution and refining them with the previous full-resolution model. The results of the 3 models were combined into an ensemble, and the final prediction was made after the postprocessing stage. In this study, we used 2D image only.
SAM
The SAM is a state-of-the-art foundation model for segmenting different shapes and objects. As a foundation model, it requires manual user intervention in the form of bounding boxes, mouse clicks, and text to extract the segmented region, which is its main disadvantage. SAM is trained on an enormous database comprising 11 million images from various domains and 1 billion masks. In addition to its accuracy, its main advantage is that it enables zero-shot learning. This allows the model to segment images that have not been seen before accurately. In the medical imaging context, the proposed approach can be generalized to different types of medical images.
Training Procedure
The experiments were conducted using a system equipped with an NVIDIA GeForce RTX 4070 GPU with 12 GB of memory, 128 GB of RAM, and a 12th Gen Intel® Core™ i9-12900K CPU. For image configuration, the preprocessing explained in the Dataset and Preprocessing subsections has been carried out. Regarding the loss function and hyperparameters, default values are chosen in nnU-Net (see Table 1). The loss function is a combination of Dice and cross-entropy. The optimizer used is Stochastic Gradient Descent (SGD) with Nesterov momentum of 0.9. We have modified the number of epochs to 200. In this study, we used the ViT-B (Vision Transformer Base) model checkpoint. This variant was selected to maintain an optimal balance between segmentation accuracy and computational efficiency.
nnU-Net Configuration in the Experiments.
Results and Discussion
To evaluate the performance of the proposed method, we used standard segmentation metrics, such as the Dice Similarity Coefficient (DSC) and Intersection over Union (IoU). To validate the contribution of SAM to segmentation accuracy, we conducted a comparative study with nnU-Net by removing the SAM refinement component from our proposal.
In Table 2, a comparison is made between the proposed method and the nnU-Net and YOLO11 instance segmentation models. In our evaluation, the nnU-Net model and the proposed approach achieved comparable results, with a difference of approximately 5% in segmentation metrics. Segmentation metrics DSC and IoU of our model were compared to nnU-Net using the Wilcoxon signed-rank test for paired samples, with each image as an observation. Results show that nnU-Net significantly outperforms our model (P < .01), indicating greater agreement with expert reference masks. Although nnU-Net demonstrated numerically higher metric scores, this can be attributed to its closer alignment with expert annotations, which may inherently contain errors or inconsistencies. In contrast, our proposed model prioritizes the refinement of nodule shapes, particularly at the edges, which better preserves nodule morphology. This focus on shape integrity ensures that the segmentation results are not only precise but also more representative of the true nodule morphology, even if it occasionally results in lower metric scores when compared to imperfect ground truth (GT) annotations. As shown in Table 2, although our method yields lower DSC scores due to boundary expansion, it achieves a higher Average Recall (0.6903) compared to nnU-Net (0.6841). This indicates that the proposed pipeline prioritizes minimizing False Negatives, ensuring that the entire tumor volume, including peripheral irregularities, is captured, which is a clinical priority for oncological diagnosis. We conducted a qualitative analysis by visualizing various nodules to further illustrate this advantage, as shown in Figures 4 and 5. These visual comparisons highlight how our approach produces smoother and more consistent boundaries, which are crucial for clinical applications where accurate shape representation is paramount. Compared to YOLO11, we evaluated the segmentation performance across various variants, with YOLO11m-seg yielding the best results. However, our proposal and nnU-Net significantly outperform YOLO11 in terms of segmentation metrics. This is primarily because YOLO failed to detect 260 of test images, substantially lowering its metrics. YOLO struggles to detect small or complexly shaped nodules, highlighting the advantage of our 2-stage method, which first segments and then classifies, particularly excelling with small or less visible nodules.
Results of Segmentation Model.

A visual comparison between models and GT. Lowest DSC and IoU values.

A visual comparison between models and GT. The highest DSC and IoU values.
Figures 4 and 5 display the visualization of nodules with varying sizes and shapes. For a more detailed pixel-level analysis, Figures 6 and 7 present a comparison between the segmentation outputs of our proposed method, GT masks, and nnU-Net. Different colors are used to highlight specific differences: red pixels indicate regions detected exclusively by the GT mask, green pixels represent areas identified by nnU-Net, and yellow pixels correspond to regions detected by our proposed method that are not captured by either the GT or nnU-Net masks. The comparison was conducted by analyzing the lowest and highest metric values. The segmentation masks generated by nnU-Net and our model exhibit similarities. However, the proposed model demonstrates superior adaptation to nodule morphology, particularly in labeling border pixels that nnU-Net fails to detect. Compared with the ground truth, our model occasionally enhances expert annotations by completing pixels that belong to the nodule. These discrepancies with expert annotations lead to lower metric scores because the model’s output does not fully align with the manually annotated ground truth.

Pixel-level comparison of the proposed method, GT annotations, and nnU-Net. Lowest DSC and IoU values.

Pixel-level comparison of the proposed method, GT annotations, and nnU-Net. The highest DSC and IoU values.
Because segmentation is often a prerequisite for classification in CAD systems, we evaluated the impact of our segmentation model on the classification of nodules based on malignancy and compared its performance with nnU-Net and YOLO. To achieve this, we cropped the resulting masks using the SAM model to the corresponding cropped CT images. Then, we evaluated these images using several classification models to observe their performance. The number of classes was reduced to 3 by merging class 3 (indeterminate) with class 0 (created for nodules without a consensus classification). Nodules previously categorized as classes 1 and 2 were reclassified as class 1 (benign), whereas those in classes 4 and 5 were reclassified as class 2 (malignant) due to their high malignancy probability. Table 3 presents the results of the different models for expert-annotated masks, the nnU-Net model without SAM, and our proposed approach.
Classification Model Results with Applied Segmentation Masks.
As shown in Table 3, while the overall Accuracy and F1-scores are comparable between the proposed method and the nnU-Net baseline, a consistent trend is observed in the Area Under the Curve (AUC) metric. Our method achieves higher AUC scores across all tested classification backbones, with EfficientNet B0 reaching the highest AUC of 0.8223 compared to 0.8105 for nnU-Net. This consistent improvement suggests that the segmentation masks generated by our pipeline preserve subtle boundary features that are diagnostically relevant. Although these features may penalize the geometric DSC and IoU scores due to mismatch with smoothed ground truth, they appear to provide valuable morphological cues that facilitate the distinction between benign and malignant nodules. Although Cohen’s and Fleiss’ kappa coefficients were low, the classification model achieved strong performance, indicating robustness to inconsistencies in expert annotations. Finally, Table 4 shows that YOLO11m-seg achieved an AUC of 0.7210, lower than the proposed pipeline. This performance difference is largely driven by the missed detections and segmentation challenges observed in the single-stage model for this specific dataset. In this context, the proposed hybrid approach demonstrated greater robustness for small lesion analysis, ensuring that difficult nodules were correctly segmented and classified where the general-purpose model struggled.
Classification Results on Detected Regions Using YOLO11m-seg.
The proposed methodology is general and can be easily adapted to other imaging modalities and tasks, such as skin lesion or organ segmentation in MRI. Adapting the approach would only require retraining nnU-Net with task-specific annotated data to learn the characteristics of the new imaging modality. This flexibility suggests that our framework could serve as a valuable tool across clinical automatic segmentation applications. Regarding the comparison with state-of-the-art models, unlike standard MedSAM workflows that rely on manual prompts, our proposal automates the process while retaining another clinical advantage: radiologists can intervene and manually adjust the bounding box if the automatic detection fails by nnU-Net. We maintained nnU-Net as the primary baseline due to its automation standards. While Transformer-based models such as Swin-Unet are valid alternatives, our hybrid pipeline is architecture-agnostic and allows for their potential integration as backbones in future work.
Finally, we acknowledge certain limitations in this study. A significant constraint in this research domain is the general scarcity of public datasets available for lung nodule segmentation. Furthermore, datasets that provide both pixel-level segmentation masks and malignancy classification attributes are exceptionally rare, which restricted our experimental validation to the LIDC-IDRI dataset. Additionally, the pipeline relies on the initial detector, meaning prompt variability can impact refinement. Future work will aim to evaluate the proposed framework on emerging datasets as they become available.
Conclusions
In this paper, we have introduced a novel approach for integrating the nnU-Net architecture with the SAM model. This proposal not only enables the automatic segmentation of lesions in medical images without requiring manual prompt by radiologists but also can be applied across medical imaging domains. Furthermore, by leveraging the nnU-Net architecture, there is no need for manual configuration of the architecture or training pipeline, thereby generalizing the problem and saving time.
The proposed approach was applied to pulmonary nodule segmentation. Pulmonary nodules are particularly significant because their early detection can improve the prognosis of the disease. In the proposed architecture, nodules of varying sizes can be detected for subsequent analysis.
The proposed model offers several design advantages. First, it is independent of the base architecture used to generate the initial segmentation mask. This means that if a modified version of nnU-Net or another segmentation architecture is employed, it will still work with SAM because bounding boxes for the region of interest are automatically extracted. Thus, training an encoder and integrating it into the architecture would not be necessary, thereby avoiding dependency. Second, the proposal focuses on the local-level region of interest. Instead of extracting bounding boxes for small lesions from images with a large amount of information, bounding boxes are extracted from the cropped image of the ROI. This allows the SAM model to achieve a more precise segmentation of the affected area.
The proposed segmentation model is particularly useful for the development of a complete CAD system for segmentation and classification using the generated masks. The results obtained from the various classification models indicate that the proposed model achieves performance comparable to that of an expert in classifying the disease.
Finally, in future work, we will continue to explore the application of this approach to different types of medical images, considering their various sizes and shapes. Other architectures will also be used to evaluate segmentation performance.
Footnotes
Author Contributions
Alejandro Jerónimo contributed to the investigation, conceptualization, methodology, data curation, visualization, software development, original draft writing, and review and editing of the manuscript. Ignacio Rojas was involved in conceptualization, formal analysis, funding acquisition, project administration, provision of resources, validation, supervision, investigation, and manuscript review and editing. Olga Valenzuela contributed to conceptualization, formal analysis, funding acquisition, project administration, resources, validation, supervision, and investigation.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Grants PID2024-160318OB-I00 and PID2021-128317OB-I00 funded by MICIU/AEI/ 10.13039/501100011033 and by ERDF, EU, and PCI2023-146016-2 funded by MICIU/AEI/ 10.13039/501100011033 and by the European Union
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data Availability Statement
This paper uses LIDC-IDRI dataset
28
(
).
