Abstract
Background:
Early intravesical recurrence after transurethral resection of bladder tumors (TURBT) is often caused by overlooking of tumors during TURBT. Although narrow-band imaging and photodynamic diagnosis were developed to detect more tumors than conventional white-light imaging, the accuracy of these systems has been subjective, along with poor reproducibility due to their dependence on the physician's experience and skills. To create an objective and reproducible diagnosing system, we aimed at assessing the utility of artificial intelligence (AI) with Dilated U-Net to reduce the risk of overlooked bladder tumors when compared with the conventional AI system, termed U-Net.
Materials and Methods:
We retrospectively obtained cystoscopic images by converting videos obtained from 120 patients who underwent TURBT into 1790 cystoscopic images. The Dilated U-Net, which is an extension of the conventional U-Net, analyzed these image datasets. The diagnostic accuracy of the Dilated U-Net and conventional U-Net were compared by using the following four measurements: pixel-wise sensitivity (PWSe); pixel-wise specificity (PWSp); pixel-wise positive predictive value (PWPPV), representing the AI diagnostic accuracy per pixel; and dice similarity coefficient (DSC), representing the overlap area between the bladder tumors in the ground truth images and segmentation maps.
Results:
The cystoscopic images were divided as follows, according to the pathological T-stage: 944, Ta; 412, T1; 329, T2; and 116, carcinoma in situ. The PWSe, PWSp, PWPPV, and DSC of the Dilated U-Net were 84.9%, 88.5%, 86.7%, and 83.0%, respectively, which had improved when compared to that with the conventional U-Net by 1.7%, 1.3%, 2.1%, and 2.3%, respectively. The DSC values were high for elevated lesions and low for flat lesions for both Dilated and conventional U-Net.
Conclusions:
Dilated U-Net, with higher DSC values than conventional U-Net, might reduce the risk of overlooking bladder tumors during cystoscopy and TURBT.
Introduction
Bladder tumors have a high intravesical recurrence rate of 31% to 78% after transurethral resection of bladder tumor (TURBT). 1 Even for low-grade and low T stage tumors, follow-up cystoscopy is recommended for early detection of recurrent tumors after surgery. 2,3 Overlooking bladder tumors during follow-up cystoscopy after TURBT may lead to tumor upstaging and upgrading and is strongly related to a worsened prognosis. 4 An overlooked tumor necessitates additional TURBT, radical cystectomy, and chemotherapy. Preventing overlooking of tumors is important to mitigate the patient's burden and contributes to reducing medical costs, given that bladder tumors have higher lifetime treatment costs than other tumor types. 5
Early recurrence of a tumor after TURBT relates to inadequate observation and incomplete resection during TURBT. 6 Although conventional white-light imaging (WLI) cystoscopy is the standard method to detect bladder tumors, its diagnostic sensitivity and specificity range from 60% to 70%, 7 and WLI cystoscopy occasionally overlooks lesions in 10% to 20% of patients. 8 Narrow-band imaging and photodynamic diagnosis (PDD), developed to improve bladder tumor detection and reduce the recurrence after TURBT, are also less objective and less reproducible.
Before PDD surgery, patients are required to take an oral dose of photoactive porphyrin precursors that occasionally causes adverse effects, including hepatic dysfunction and nausea. 9
An artificial intelligence (AI) diagnostic system for bladder tumors with highly accurate and reproducible results is desirable to prevent overlooking lesions when performing cystoscopy or TURBT. The AI diagnostic systems have been reported to be effective for pathological diagnoses and fundus examinations, with a diagnostic accuracy comparable to that of an expert. 10 –13 Segmentation is a commonly used AI method to detect and present the location and shape of the object in the image by distinguishing every pixel as true or false.
Among the AI-based segmentation methods, U-Net is a well-known AI segmentation network that achieves a good performance in biomedical image segmentation. 14 Further, the combination of dilated convolution layers with conventional deep neural networks improves segmentation accuracy. 15 This technique may be advantageous for distinguishing bladder tumors of various sizes, numbers, and morphologies in an image.
However, it is unknown whether U-Net with dilated convolutions, termed Dilated U-Net, improves the diagnostic accuracy for bladder tumors in cystoscopic images.
We aimed at assessing the accuracy of the dilated convolution segmentation method to reduce the number of overlooked tumors in cystoscopic images and at comparing its accuracy with that of the conventional U-Net.
Materials and Methods
Image set preparation
Among the patients who underwent TURBT at Kyushu University Hospital from April 2014 to December 2019, we included those with TURBT videos. We excluded those for whom the video was unavailable due to incorrect storage, or if tumor observations were disturbed due to hemorrhage, halation, or overlapping with the endoscopic instrument during TURBT. TURBT was performed by using a resectoscope (OES Pro; Olympus Medical System, Co., Ltd., Tokyo, Japan).
To obtain bladder tumor and normal mucosa images, we converted the TURBT videos into frame images. The converted images were automatically extracted every 5 to 10 frames by using conversion software that we created using python programming language. All extracted images were cropped and resized to 512 × 512 pixels. After resizing, we chose appropriate images for building the segmentation system by using the following inclusion criteria.
Clear images were those in which the tumor and its surface were clearly observed and recognized. Blurred images were those in which the tumor or normal mucosa surface was unfocused but could be recognized. We excluded poor-quality images in which it was difficult to recognize the tumor surface or normal mucosa due to lack of focus. Multiple images were occasionally obtained from one tumor because the location, angle of tumor observation, and tumor size varied depending on how the inspectors observed it. Therefore, all these images were considered different. The images that met the inclusion criteria were defined as original images.
We created ground truth images from the original images by delineating the tumor's margins using GNU Image Manipulation Program software version 2.10. 16 All tumors were categorized into four classes based on pathological diagnosis made by pathologists in our hospital: Ta, T1, T2 or more, and carcinoma in situ. All processes were manually and continuously conducted by one urologist with TURBT certification (Fig. 1).

TURBTs and image acquisition processes. TURBT consists of five main steps. We have converted videos of the first and second steps into images before the tumor is resected (red rectangle). We have included clear and some blurred images (red circles) for AI training and testing. White represents the tumor area, and black the non-tumor area. AI = artificial intelligence; TURBT = transurethral resection of bladder tumor.
We performed data augmentation by using horizontal flip, vertical flip, and random rotation to increase image variation and number.
AI segmentation system architecture
Conventional U-Net includes four consecutive down-sample blocks followed by four consecutive up-sample blocks, termed an encoder–decoder network architecture (Fig. 2a). The down-sample blocks are typically convolution networks and include a max pooling operation to extract useful features from the original images, 17 which were resized to 128 × 128 pixels. We implemented a 3 × 3 kernel size with a one-pixel stride moving window across the network, with a rectified liner unit (ReLU) activation function in the convolutional layer.

The up-sample block receives this extracted information and acquires spatial information by connecting with the higher-resolution feature maps in the decoder parts. The segmentation results are then displayed. As activation functions, we implemented Leaky ReLU in the up-sample blocks and sigmoid functions in the final convolution layer.
The settings of the hyperparameters have an impact on the accuracy of U-Net. 18 In this present study, we constructed the Dilated U-Net architecture by incorporating dilated convolution layers into the U-Net (Fig. 2a), and we compared its accuracy with that of hyperparameter-tuned U-Net. All network parameters of Dilated U-Net were fine-tuned by using the Adam solver 19 with the learning rate set to 1e-3, batch size to 32, and the maximum epoch count to 200. The kernel used in the dilated convolution layers inserts a zero between the original 5 × 5 kernels (Fig. 2b). These dilated kernels support obtaining more expanded features from the original images without resolution loss. 15
Training and validation
Overfitting, which represents good accuracy in study data but poor accuracy in test data, is a problem encountered during AI training. 20 Increasing the number of epochs causes overfitting, because the model continues learning after some epochs; the validation error will increase whereas the training error will continue to decrease. To prevent overfitting, the early stopping method that stops the training epochs when the loss of accuracy in test data is increasing or the dropout method that randomly drops units and relevant connections from the neural networks is often used. 20 Here, we used the early stopping method to prevent overfitting.
Further, overfitting may occur when the number of datasets are small because the learning accuracy depends on how the dataset is divided into training and test data. To evaluate the accuracy of U-Net and Dilated U-Net, we used the cross-validation method, a useful data resampling method to assess the generalization ability of predictive models and prevent overfitting. 21 This method was used to divide our image set into four subsets, because each subset was allocated an identical ratio of T stages. Three subsets were used as training data, and the fourth subset was used as validation data.
Training and validation were repeated four times, with each of the four subsets used once as validation data. The accuracy results were presented as the average of the fourfold cross-validation method.
Evaluation metrics
We assessed the segmentation accuracy by using a dice similarity coefficient (DSC) that measured the overlap area between bladder tumors in the original images and segmentation maps. 22 We calculated the following: pixel-wise sensitivity (PWSe), the proportion of tumor pixels correctly diagnosed using AI from the original images; pixel-wise specificity (PWSp), the proportion of normal pixels that the AI correctly diagnosed from the original images; and pixel-wise positive predictive value (PWPPV), the proportion of pixels diagnosed using AI as tumor pixels that were truly tumor pixels. The formula used for the calculation is given next:
(TP, true positive; TN, true negative; FP, false positive; FN, false negative).
Statistical analysis
We compared the accuracy between Dilated U-Net and the conventional U-Net by using the Wilcoxon signed rank test. All tests were two-sided, and a p-value <0.05 was considered statistically significant. All statistical analyses were performed by using JMP Pro15 software for Macintosh (version 15.1.0; SAS Institute, Inc., Cary, NC).
Ethics
The study was performed in line with the principles of the Declaration of Helsinki, as revised in 2013. Approval was granted by the institutional review board of Kyushu University Hospital (no. 30–580). The requirement for obtaining informed consent from the patients was waived, and an opt-out approach was used because this was a retrospective study.
Results
Table 1 shows the characteristics of the patients and tumors. We obtained 1790 histologically confirmed bladder tumor cystoscopic images from 120 patients; 1464 images (81.8%) contained non-muscle invasive bladder tumor of grades pTis, pTa, and pT1, and 329 images (18.4%) contained muscle invasive bladder tumor of pT2 grade. Elevated lesions were obtained from 1573 images (87.9%) and flat lesions from 161 images (9.0%). Further, 56 images (3.1%) had coexistence of elevated and flat lesions.
Patient and Tumor Characteristics in the Image Set
There is some overlapping.
The World Health Organization 2004/2016 classification systems
The PWSe, PWSp, PWPPV, and DSC of the Dilated U-Net were 84.9%, 88.5%, 86.7%, and 83.0%, respectively. The PWSe, PWSp, PWPPV, and DSC with Dilated U-Net had improved by 1.7%, 1.3%, 2.1%, and 2.3%, respectively, compared with that with conventional U-Net, and there were significant differences in the results between the two methods (Fig. 3a).

A comparison of the DSC by tumor morphology according to U-Net and the Dilated U-Net is shown in Figure 3b. The DSC of elevated tumors, flat tumors, and mix tumors with the Dilated U-Net was 85.7%, 57.3%, and 78.9%, respectively. The DSC with Dilated U-Net had improved by 2.6% and 2.9% for elevated and mixed lesions, respectively, when compared with conventional U-Net. In contrast, the DSC for flat tumors with U-Net was higher by 2.2% compared with that with Dilated U-Net. The accuracy of DSC for flat lesions was lower than that for elevated lesions with both methods.
Representative cases of bladder tumor detection using U-Net and Dilated U-Net are shown in Figure 4.

Representative bladder tumor detection by U-Net or Dilated U-Net. The blue, yellow, and gray regions represent the true positive, false negative, and false positive diagnoses performed by the two methods, respectively. Dilated U-Net detected tumors more correctly than U-Net (
Discussion
Our Dilated U-Net method demonstrated a higher tumor segmentation accuracy than the conventional U-Net method for diagnosing bladder tumors in cystoscopic images. The combination of U-Net with dilated convolution was beneficial and distinguished more tumors in cystoscopic images than the conventional U-Net. The Dilated U-Net might help reduce the incidence of early recurrence of bladder tumors after TURBT.
U-Net and Dilated U-Net are segmentation methods that use deep learning to mimic the structure of brain neurons and represent them digitally. They can learn large amount of input data, and extract features from the data independently by using multiple network layers. Increasing the amount of training data generally improves the AI accuracy by extracting more features. 21 Thus, accumulating additional cystoscopic images can improve the accuracy of our segmentation results; however, obtaining several images from a single institution and creating the corresponding ground truth images are arduous tasks. Alternatively, improving the AI architecture can enhance segmentation accuracy. 15 We improved the U-Net architecture by inserting dilated convolution.
Dilated convolution can help improve segmentation accuracy. Sun and colleagues reported a segmentation method to detect colorectal polyps in colonoscopic images by using dilated convolution 23 and have shown an improvement in DSC of 8.24% compared with that with U-Net. Gridach and Voiculescu reported a dilated convolution method that combines various dilation rate filters with dilation rates of 1 to 4. 24
We used filters with a dilation rate of 2; thus, our Dilated U-Net architecture could be improved by combining various dilation filters without increasing computation costs. This technology can contribute to improved segmentation accuracy and be applied for bladder tumor segmentation.
A few studies have reported the feasibility of dilated convolution or segmentation by using AI for cystoscopic images. Negassi and colleagues reported a U-Net segmentation system to detect bladder tumors in cystoscopic images with a DSC of 0.67. 25 However, a direct comparison of our results was difficult due to differences in the datasets and segmentation targets; however, the accuracy of our segmentation system was satisfactory.
Gosnell and colleagues developed a triage system that performs automated analysis of cystoscopy images classified as normal or diseased based on a specialized color segmentation system. 26 Their system achieved a zero false-negative rate and identified the image that included the tumor, whereas their system's false-positive rate was 50% due to inflammation and scarring in their benign cystoscopic images that resembled malignant tumors. They identified inconsistent brightness in cystoscopic images as a segmentation-related problem.
The light source's illumination intensity is affected by the distance and angle of the probe in relation to the object. Nonuniformity of illumination should be minimized to obtain good segmentation. 26 They overcame this issue by performing image processing. Brightness values differed in our images as well; thus, performing image processing might improve the accuracy of our method.
Although Dilated U-Net and conventional U-Net detected the flat lesion location in the images, correctly displaying the tumor margin in flat lesions was difficult when compared with elevated lesions (Fig. 4w–x), which might have been caused by data bias. 27 Our dataset included a limited number of suitable flat lesion images for AI training, because our video-to-image conversion method resulted in images that were unfocused or blurred, especially in case of flat lesions.
Moreover, the small number of flat lesion images in the training data resulted in image feature congruency-related problems in the test data. Therefore, the small dataset of flat lesions might worsen the segmentation accuracy using cystoscopic images. 27 A large number of and several high-quality flat lesion images might help correctly distinguish the tumor margins. To numerically compensate for the less number of flat lesion images, other data augmentation techniques including rotation, tilt and skew, 28 or an AI method termed Generative Adversarial Network, in which the AI learns features from the training data and creates an artificial image that resembles the training data, should be considered. 29
Our study had several limitations. First, we retrospectively collected the cystoscopic images from a single institution, and the sample size was small. Prospective clinical trials that include multiple institutions and compare the accuracy of AI methods with that of an expert in live TURBT surgery are needed. Second, one urologist conducted image selection and data annotation, possibly causing an imbalance of T-classification among the dataset and selection bias.
Finally, our image set included blurred images in which it was difficult to recognize the tumor surface clearly. Accuracy can be improved by preparing a dataset that includes only clear images 30 ; however, consistently obtaining clear images of the inside of the bladder is impossible in a clinical setting. Thus, we also included blurred images. In addition, sequential images generated from the video could be similar to each other. These similar sequential images were regarded as different if the location of the tumor was slightly altered in each image.
Consequently, the segmentation accuracy may not be correlated with the total number of images because the same features are extracted from similar images. Thus, our results should be interpreted with caution, and more investigations regarding AI diagnosis for patients with bladder tumor will be necessary.
Conclusion
The highly accurate detection of bladder tumors by combining conventional U-Net with Dilated U-Net might contribute to reducing the recurrence of bladder tumors by preventing instances of overlooked tumors during cystoscopy.
Footnotes
Authors’ Contributions
All authors contributed to project development. J.M. performed data and image collection and created the ground truth images. A.U., S.M., K.M., and R.K. created the source code. Y.O. and F.K. conducted the pathological diagnoses. All authors analyzed the data. J.M. wrote the first draft of the article. The artist was edited by K.M., A.U., S.K., J.I., and M.E. All authors read and approved the final article. All authors share collective responsibility and accountability for the results.
Acknowledgments
The authors would like to thank Editage (
Author Disclosure Statement
No competing financial interests exist.
Funding Information
This work was supported by the Japanese Foundation for Research and Promotion of Endoscopy. The funder had no role in the study design, data collection, data analysis, interpretation, or writing of the article.
