Abstract
Introduction:
Endoscopic tumor ablation of upper tract urothelial carcinoma (UTUC) allows for tumor control with the benefit of renal preservation but is impacted by intraoperative visibility. We sought to develop a computer vision model for real-time, automated segmentation of UTUC tumors to augment visualization during treatment.
Materials and Methods:
We collected 20 videos of endoscopic treatment of UTUC from two institutions. Frames from each video (N = 3387) were extracted and manually annotated to identify tumors and areas of ablated tumor. Three established computer vision models (U-Net, U-Net++, and UNext) were trained using these annotated frames and compared. Eighty percent of the data was used to train the models while 10% was used for both validation and testing. We evaluated the highest performing model for tumor and ablated tissue segmentation using a pixel-based analysis. The model and a video overlay depicting tumor segmentation were further evaluated intraoperatively.
Results:
All 20 videos (mean 36 ± 58 seconds) demonstrated tumor identification and 12 depicted areas of ablated tumor. The U-Net model demonstrated the best performance for segmentation of both tumors (area under the receiver operating curve [AUC-ROC] of 0.96) and areas of ablated tumor (AUC-ROC of 0.90). In addition, we implemented a working system to process real-time video feeds and overlay model predictions intraoperatively. The model was able to annotate new videos at 15 frames per second.
Conclusions:
Computer vision models demonstrate excellent real-time performance for automated upper tract urothelial tumor segmentation during ureteroscopy.
Introduction
The gold standard for treating upper tract urothelial carcinoma (UTUC) has historically been nephroureterectomy. 1 However, modern endoscope and laser technologies enable a minimally invasive means of diagnosis and treatment of UTUC. 2 Specifically, advancements in endoscopic visualization and flexible scopes facilitate retrograde intrarenal surgery and laser ablation of tumors. Completely treating tumors, however, can be challenging due to the limited field and depth of view (10 and 6 mm on average, respectively) of current flexible endoscopes. 3 Furthermore, bleeding and debris during treatment can obscure tumors and prevent complete ablation. 4,5 Further improvements in tumor localization and visualization during ablation could improve operative efficiency and tumor treatment.
Computer vision, a subfield of artificial intelligence that focuses on analysis of visual data, may improve endoscopic visualization of upper tract urothelial tumors during endoscopic surgery. Using a labeled set of endoscopic video frames, deep learning-based computer vision models can be trained to automatically identify and track UTUC tumors during endoscopic surgery. Similar tools have shown feasibility in identifying bladder tumors during surveillance cystoscopy. 6,7 Moreover, we have previously demonstrated the feasibility of such techniques during localization and treatment of kidney stones. 8 In a similar manner, computer vision methods could enhance tumor identification and localization during endoscopic treatment.
In this study, we developed and compared computer vision-based models for automated segmentation of UTUC tumors during endoscopic treatment. We additionally evaluated the highest performing model's ability to segment areas of ablated tumor. The highest performing model was also implemented in the operating room, and we assessed real-time performance speed of an automated video overlay of UTUC tumor.
Materials and Methods
Cohort
After approval from the Institutional Review Board, we recorded endoscopic surgical video from 20 patients during UTUC treatment from August 2022 to May 2023 from two institutions. All patients were 18 years of age or older, and either had suspicion for UTUC tumor based on preoperative CT imaging or had been referred for endoscopic management after initial diagnosis. A flexible ureteroscope (Karl Storz FLEX-XC or FLEX-X2S) was used for each case. Tumor position and size information based on preoperative imaging was manually recorded.
Dataset and model training
We recorded raw ureteroscopy videos of UTUC tumor treatment at standard resolution (1920 × 1080 pixels) in 22 patients. All videos were edited to specifically highlight both initial tumor identification and inspection of areas of ablated tumor. The portion of the video that depicted tumor identification included complete endoscopic evaluation of the localized tumor. The portion of the video that illustrated areas of ablated tumor included evaluation of the area immediately after completion of laser treatment. All videos were reviewed for quality, extracted at 5 frames per second [FPS], cropped, and then resized to 512 × 512 pixels. Two videos were excluded for video quality issues. The frames (n = 3387) were manually annotated to identify both UTUC tumors and areas of ablated tumor by two endourologists (Cohen's Kappa agreement of 0.87 based on pixel annotation).
The annotated frames comprised a “ground truth” data set for model training and evaluation. We used MakeSense.ai (San Francisco, CA) for analysis of the annotated frames. 9 We trained two independent models for tumor and ablation segmentation, respectively. We randomly selected 80% of the frames (n = 2969) for supervised model training and reserved 10% for validation and 10% for testing (n = 209 frames each). The validation set was randomly selected to monitor model training performance. Training was a standard iterative pipeline with forward propagation and loss computation, followed by back-propagation with frequent validation checks. Models were independently evaluated for segmentation of both UTUC tumor and areas of ablated tumor.
To evaluate and optimize a computer vision model for segmentation, we compared three state-of-the-art deep neural network architectures: U-Net, U-Net++, and UNeXt. 10 –12 UNet and UNet++ were implemented using pretrained segmentation models in Pytorch using ResNet backbones. The UNext model was constructed based on standard implementations. 13 U-Net is a fast and precise network architecture for high-resolution imaging segmentation. 10 U-Net++ expanded upon this architecture by implementing “skip connections,” for potentially enhanced feature extraction. 11 UNeXt aimed to improve upon U-Net with the introduction of binary classifiers (i.e., perceptron blocks) instead of convolutional blocks to improve performance and drastically increase computational speed. 12 Specifically, we tuned hyperparameters as follows: 0.001 learning rate, 0.0001 weight decay (with AdamW optimizer), 0.00001 minute learning rate (with CosingAnnealingLR), Binary Cross Entropy Loss Batch size of 16, and 100 epochs for training. Figure 1 shows the workflow for model training and testing.

Workflow for model training and testing.
Evaluation metrics
We evaluated the model performance for segmentation of UTUC tumors and areas of ablated tumor. Each model was trained using binary cross entropy (BCE, evaluation of the dissimilarity between the automatic and manual segmentations) loss and compared via mean dice similarity coefficient (DSC, a spatial overlap index) score. 14,15 We evaluated the average area under the receiver operating curve (AUC-ROC) of the best performing model for both UTUC and ablated tumor segmentation across all annotated frames. We further computed the overall pixel accuracy, sensitivity, and specificity for the best models. All analyses were done using comet.ml (New York, NY).
We further evaluated the model's performance in real-time in the operative room. Predicted annotations were displayed using a heatmap showing pixelwise probabilities of segmented tumors and areas of ablation. We assessed the real-time performance speed of an automated video overlay of UTUC tumor in the operating room. We report our findings following the Standardized Reporting of Machine Learning Applications in Urology framework to enhance reproducibility, comparability, and interpretability of our results. 16 In addition, we have made our models available for public use and validation. 17
Results
Postoperative biopsy results revealed a mix of low-grade (47%) and high-grade (53%) tumors used during training and analysis. Tumor characteristics are reported in Table 1. Digital and fiber optic ureteroscopes (Karl Storz FLEX-XC or FLEX-X2S) were used in 17 and 3 cases, respectively. Twelve out of the 20 cases underwent treatment with laser ablation. The average surgical video time of tumor identification was 36 seconds (standard deviation ±58 seconds). The average video time depicting the area of tumor ablation was 24 seconds (standard deviation ±16). The U-Net model achieved the highest DSC and lowest BCE loss (mean ± standard deviation of 0.78 ± 0.08 and 0.23 ± 0.07, respectively) compared to U-Net++ (0.73 ± 0.10 and 0.15 ± 0.07) and U-Next models (0.55 ± 0.09 and 0.40 ± 0.26). Thus, the U-Net model was selected for further testing.
Tumor Characteristics from Each Video
The U-Net model demonstrated excellent performance for UTUC tumor segmentation with a mean AUC-ROC of 0.96 ± 0.06 and an overall pixel accuracy, sensitivity, and specificity of 0.98, 0.76, and 0.99, respectively. The model also showed good performance for segmentation of the area of ablated tumor with AUC-ROC of 0.90 ± 0.03 and DSC of 0.50 ± 0.05, and an overall pixel accuracy, sensitivity, and specificity of 0.87, 0.50, and 0.95, respectively (Fig. 2). Example frames from videos containing the raw input, prediction, and corresponding heatmaps are shown in Figure 3. Subanalysis showed decreased model performance for fiberoptic ureteroscope video (AUC-ROC of 0.60 ± 0.09, and DSC of 0.55 ± 0.11).

Summary statistics of the U-Net model during tumor and ablated tissue segmentation. Values include pixel-based evaluation of the model compared with the manually annotated frames. Area under the receiver operating curve and dice similarity coefficient values reported from the average of all annotated frames with standard deviation. Each light blue curve represents model performance for a single frame. Overall model accuracy, sensitivity, and specificity values reported across annotated frames.

Qualitative results of automated segmentation during endoscopic UTUC surgery demonstrating (left to right) the original endoscopic video image, manual segmentation performed by the surgeons (blue overlay), the contour of the automated segmentation (green overlay), and a heatmap prediction demonstrating the raw probability output per pixel. Blue and red pixels represent low and high probability of correct segmentation, respectively.
During intraoperative testing, the model annotated new videos at 15 FPS and maintained segmentation performance while displaying a video overlay outlining the tumor (Fig. 4).

Discussion
We show that computer vision models can automatically segment UTUC tumors during endoscopic surgery. Specifically, our model demonstrated good segmentation performance for tumors and areas of ablated tumor. We further implemented these models intraoperatively and observed maintained performance while displaying an automated overlay at 15 FPS. Note that we visualize the contour of the automatically segmented tumor to avoid obstructing the video feed during active treatment. Taken together, these findings demonstrate the potential for improved endoscopic visualization during UTUC treatment.
Despite advances in endoscopic technology, UTUC recurrence rates are high. 1,18 Current guidelines recommend endoscopic surgery as the initial management for low-risk disease and select patients with high-risk disease who cannot undergo radical nephroureterectomy. 1 However, even large and more complex tumors can be effectively managed endoscopically. 19 –21 For example, Scotland et al. evaluated the effectiveness of ureteroscopy with laser ablation for UTUC lesions that were larger than 2 cm (∼20% of which were high grade) and found a 65% progression-free survival and 75% overall survival at 5 years. 19 Despite this, endoscopic management of UTUC remains underutilized compared to nephroureterectomy (index case rates of 11% vs 56%), even with a decrease in risk of complications, renal insufficiency, and dialysis dependence. 20 –24
Several challenges complicate endoscopic tumor ablation. Effective tumor ablation is impacted by endoscopic visibility and localization of tumors, manually performed by the surgeon in the current paradigm. Factors such as debris and blood clots limit visibility and impact tumor treatment. In addition, most upper tract tumors are multifocal, and tracking multiple tumors in different parts of the branched renal collecting system is difficult and affected by clinician experience. Furthermore, effective tumor ablation is limited by endoscopic visibility due to the limited field and depth of view of current ureteroscopes. Often, to achieve complete tumor ablation in complex cases, multiple surgeries are required to adequately locate and visualize multiple tumors (or the full extent of larger tumors). 24 Tools that could enhance visualization and localization of tumors and areas of prior ablation, therefore, could improve outcomes after endoscopic UTUC surgery. Moreover, they could be used to improve endoscopic surgical education and assist less experienced surgeons.
We have previously evaluated computer vision models for kidney stone disease and demonstrated good performance both during stone localization and fragmentation. 8,25 In addition, we have previously identified automated performance metrics from computer vision models and associated them with surgeon experience during endoscopic stone surgery. 26 Moreover, similar computer vision models have been applied to endoscopic evaluation and treatment of bladder cancer. 27,28 Shkolyar et al. demonstrated a computer vision-based deep learning algorithm that could accurately detect bladder cancer during white light cystoscopy. 28 The algorithm has the potential to aid in training and clinical decision-making during cystoscopy and transurethral resection of bladder tumor by leveraging widely available software, rather than specialized equipment. Similarly, our models show the potential for improving decision-making during endoscopic UTUC treatment. Furthermore, in the future, this technology could be integrated with endoscopic robotic platforms and inform automated processes for tumor detection and treatment.
There are several limitations in this study. First, our model is trained on a limited dataset. The training set may not robustly represent the variety of endoscopic findings that could be seen during UTUC surgery. However, the model performed well despite the limited number of videos. More videos could be integrated in the workflow for model training to improve performance in the future. In addition, most videos used for training were from digital ureteroscopes. This likely explains the decrease in model performance during segmentation of fiberoptic videos. Newer ureteroscope technologies have been developed with an emphasis on digital video (i.e., disposable scopes) due to improved video quality. Likewise, we often utilize digital scopes for endoscopic UTUC treatment to improve visualization. Still, future models could incorporate more fiberoptic videos for generalizability.
In this study, we used tumors identified by expert surgeons as the ground truth to train our model. We thus do not know whether our model can detect tumors that are difficult or ambiguous to detect with the human eye and further testing would be required to evaluate the model's ability to identify these lesions. Moreover, the model currently segments surgical video at 15 FPS in the operating room using conventional graphics cards, which can lead to a slight lag time in overlaying the outline intraoperatively. However, the model can run on more advanced graphics cards (e.g., NVIDIA RTX A6000 Ada GPU) at 30 FPS or greater, which would improve the speed and make real-time accurate segmentation feasible. Finally, the current model was trained from single segmented frames without temporal context.
In future work, we will further improve the models by incorporating time through leveraging consecutive frames, which can improve performance in frames with partial or full occlusions. Despite these limitations, our study demonstrates the potential of computer vision models for segmentation of UTUC tumors during endoscopic surgery. Larger and more varied datasets will allow for more clinical applications of the models.
Conclusion
We have developed and evaluated a computer vision model that can segment tumors and areas of ablated tissue during endoscopic surgery for UTUC. The model was implemented intraoperatively and showed good performance. Moreover, our demonstration of a video overlay intraoperatively shows the potential for enhancing endoscopic visualization of UTUC, which could improve outcomes and tumor recurrence.
Footnotes
Acknowledgments
Authors' Contributions
D.L.: methodology, software, formal analysis, and writing-original draft, A.R.: data curation and writing-original draft, N.P.: data curation and writing-original draft, A.N.L.: data curation, conceptualization, and investigation, M.P.: data curation, writing- review and editing, and project administration, N.S.: conceptualization, data curation, and visualization, I.O.: conceptualization, methodology, writing-review and editing, supervision, and funding acquisition, N.K.: conceptualization, data curation, methodology, writing-review and editing, supervision, and funding acquisition.
Author Disclosure Statement
D.L., N.K., and I.O. have funding through the National Institute of Health. N.S. has funding through the Department of Defense.
Funding Information
The work was funded by Training Program for Innovative Engineering Research in Surgery and Intervention Project No. 3T32EB021937-08S1 (I.O. and D.L.), NIDDK R21DK133742: A Navigational System for Endoscopic Kidney Stone Surgery (N.K. and I.O.), and Paracelsus Medical University Research and Innovation Fund (2022-FIRE-004, M.P.).
