Abstract
Introduction:
Previous systematic reviews related to machine learning (ML) in urology often overlooked the literature related to endourology. Therefore, we aim to conduct a more focused systematic review examining the use of ML algorithms for the management of benign prostatic hyperplasia (BPH) or urolithiasis. In addition, we are the first group to evaluate these articles using the Standardized Reporting of Machine Learning Applications in Urology (STREAM-URO) framework.
Methods:
Searches of MEDLINE, Embase, and the Cochrane CENTRAL databases were conducted from inception through July 12, 2021. Keywords included those related to ML, endourology, urolithiasis, and BPH. Two reviewers screened the citations that were eligible for title, abstract, and full-text screening, with conflicts resolved by a third reviewer. Two reviewers extracted information from the studies, with discrepancies resolved by a third reviewer. The data collected were then qualitatively synthesized by consensus. Two reviewers evaluated each article according to the STREAM-URO checklist with discrepancies resolved by a third reviewer.
Results:
After identifying 459 unique citations, 63 articles were retained for data extraction. Most articles consisted of tabular (n = 32) and computer vision (n = 23) tasks. The two most common problem types were classification (n = 40) and regression (n = 12). In general, most studies utilized neural networks as their ML algorithm (n = 36). Among the 63 studies retrieved, 58 were related to urolithiasis and 5 focused on BPH. The urolithiasis studies were designed for outcome prediction (n = 20), stone classification (n = 18), diagnostics (n = 17), and therapeutics (n = 3). The BPH studies were designed for outcome prediction (n = 2), diagnostics (n = 2), and therapeutics (n = 1). On average, the urolithiasis and BPH articles met 13.8 (standard deviation 2.6), and 13.4 (4.1) of the 26 STREAM-URO framework criteria, respectively.
Conclusions:
The majority of the retrieved studies effectively helped with outcome prediction, diagnostics, and therapeutics for both urolithiasis and BPH. While ML shows great promise in improving patient care, it is important to adhere to the recently developed STREAM-URO framework to ensure the development of high-quality ML studies.
Introduction
Artificial intelligence (AI) involves testing and training computerized algorithms that aim to simulate human cognitive functions such as problem solving and learning. The applications of AI within the medical field include but are not limited to the diagnosis, management, and outcome prediction of health conditions. Machine learning (ML) is a subtype of AI that utilizes dynamic algorithms to analyze complex patterns and problems to then generate useful predictive outputs. ML can be categorized into supervised, unsupervised, and reinforcement learning approaches. 1 A supervised algorithm refers to one that is trained on a prelabeled dataset and is designed to solve classification or regression problems. 2
On the contrary, an unsupervised algorithm does not rely on the labeling of data when generating predictions, as it learns to recognize patterns from the input data on its own. 2 Unsupervised algorithms are often used for clustering problems, which relate to the grouping of data based on their similarities and differences. Finally, reinforcement learning operates by trial and error to fine-tune an algorithm's parameters so that it can achieve its designated goal. A more detailed explanation of these learning approaches can be found in the Supplementary Data S1.
ML has been widely adopted within the field of urology to help with the diagnosis, outcome prediction, and management of urologic conditions. 3,4 The number of studies utilizing ML to advance the field of urology are increasing. Given this recent surge, it is especially important for clinicians to better understand the fundamentals of the technology and learn how it can be applied to their clinical practices. Previous systematic reviews related to the application of AI in urology have been conducted. 5 –7 However, these reviews were either nonexhaustive or overlooked the literature related to endourology. Therefore, we sought to conduct a more focused systematic review examining the use of ML algorithms specifically for patients with benign prostatic hyperplasia (BPH) and urolithiasis. In addition, we are the first group to evaluate the quality of these articles using the newly developed Standardized Reporting of Machine Learning Applications in Urology (STREAM-URO) framework. 8
Methods
We conducted a systematic review according to a prespecified protocol, with reporting according to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) statement.
Search strategy
The search strategy was developed with the help of an experienced librarian (Lucy Kiester) and reviewed by a clinical expert (N.B.). References were identified through searches of MEDLINE, Embase, and the Cochrane CENTRAL databases from inception through July 12, 2021. Keywords and Medical Subject Heading terms searched included those related to ML, endourology, urolithiasis, and BPH. Additional information related to the search strategy used can be found in the Supplementary Data S1.
Study selection
Original research articles published in peer-reviewed journals discussing the application of ML for urolithiasis or BPH were included without any language restrictions. Studies were excluded if they (1) did not address the application of ML for urolithiasis or BPH; (2) were related to prostate cancer; (3) were reviews, case reports, commentaries, or conference proceedings; or (4) were not published in the English language. Studies related to prostate cancer were specifically excluded as there is often diagnostic confusion with BPH, and these were beyond the scope of this review. Two reviewers independently screened the citations that were considered eligible for title, abstract, and full-text screening, with conflicts resolved by a third reviewer.
Data extraction
Data related to general study and population characteristics were collected. In addition, data related to the ML algorithms used and their applications were retrieved. The complete list of information extracted can be found in the Supplementary Data S1. Two independent reviewers were responsible for extracting information from each included study. Any discrepancies in data extraction were resolved with the help of a third reviewer. In addition, two reviewers well-versed in ML (X.H.L. and W.X.L.) reviewed and assessed the data extracted from the included articles. These reviewers have experience with developing ML algorithms and have published ML studies in the past. 9 The data collected were then qualitatively synthesized by consensus.
STREAM-URO assessment
Each retained article was evaluated according to the STREAM-URO framework, which is a 26-item checklist designed to promote and ensure the development of standardized and high-quality studies within the urologic community. 8 Two independent reviewers were responsible for grading each included study. Discrepancies in grading were resolved with a third reviewer.
Results
Study selection
The initial search identified a total of 615 references. After removing all duplicate references, 459 unique citations remained. From this list, 93 articles remained following the initial title/abstract screening. Following the full-text review, 63 articles were retained for data extraction (Fig. 1).

PRISMA flow diagram of study selection. PRISMA = Preferred Reporting Items for Systematic Reviews and Meta-Analyses.
Study characteristics
Among the 63 articles retrieved, most studies consisted of tabular (n = 32) and computer vision (n = 23) tasks. Other tasks included signal processing (n = 5), natural language processing (NLP; n = 2), and time series modeling (n = 1). The two most common problem types were of classification (n = 40) and regression (n = 12). Other problem types encountered were segmentation (n = 5), object detection (n = 4), and entity recognition (n = 2). In general, most studies utilized neural networks (NN) as their ML algorithms (n = 36). Alternative algorithms included support vector machines (SVMs; n = 6), linear models (n = 2), nearest neighbors (n = 1), ensemble learning (n = 1), boosting (n = 1), and decision trees (n = 1) among many others (n = 15).
Moreover, among the 63 studies retrieved, 58 were related to urolithiasis, and 5 to BPH. With regard to the clinical applications of the studies related to urolithiasis, 20 were designed for outcome prediction (Table 1), 18 were related to stone classification (Table 2), 17 studies aided with diagnostics (Table 3), and 3 focused on therapeutics (Table 4). Among the studies related to BPH, two aimed to help with outcome prediction, two with diagnostics, and one with treatment (Table 5).
Applications of Machine Learning for Outcome Prediction of Kidney Stone Disease
3D = three-dimensional; ANN = artificial neural network; AUC = area under the curve; BMI = body mass index; COC = coefficient of correlation; DL = deep learning; HRQoL = health-related quality of life; kNN = k-nearest neighbors; LightGBM = light gradient boosting method; LR = logistic regression; ML = machine learning; MLP = multilayer perceptron; MVRA = multivariate regression analysis; NN = neural networks; PCNL = percutaneous nephrolithotomy; RF = random forest; RS = reference standard; SFS = stone-free status; SMOreg = sequential minimal optimization regression; SVM = support vector machine; SWL = shockwave lithotripsy; TA = texture analysis; TUL = transurethral lithotripsy; XGBoost = extreme gradient boosting trees.
Applications of Machine Learning for the Classification of Kidney Stones
COD = calcium-oxalate dihydrate; COM = calcium-oxalate monohydrate; EHR = electronic health records; NB Tree = Naive-Bayes Tree; NLP = natural language processing; PCR = principal component; PLS = partial least squares; PPV = positive predictive value; RBF = radial basis function; RMSE = root mean square error; SEP = standard error of prediction; UA = uric acid.
Applications of Machine Learning for the Diagnosis of Kidney Stones
GA = genetic algorithm; GU = genitourinary; US = ultrasound.
Applications of Machine Learning for the Treatment of Kidney Stones
NPV = negative predictive value.
Applications of Machine Learning Related to Benign Prostatic Hyperplasia
BOO = bladder outlet obstruction; BPH = benign prostatic hyperplasia; CI = confidence interval; DE = diagnostic efficiency of the decision rule.
Urolithiasis
Outcome prediction
Among the 58 urolithiasis studies, the majority (n = 20) aimed to predict outcomes. Outcomes included stone-free status (SFS), the detection of infection, predicting the spontaneous passage of stones, the optimization of kidney stone fragmentation, and the prediction of stone patients' health-related quality of life (HRQoL). A detailed description of these studies can be found in Table 1.
Eight studies helped predict the SFS of patients following shockwave lithotripsy (SWL). 10 –17 These studies used parameters, including patient age, stone location, stone volume, stone length, and Hounsfield units to build their algorithms. Among these studies, only one group incorporated stone texture analyses in their algorithm to help determine SFS. 11,17 The accuracy of the models used were as high as 99% in predicting SFS. 14 The study that achieved this accuracy level developed an artificial neural network (ANN) algorithm using data extracted from 203 patients that presented for SWL. In addition to these eight studies, two other studies aimed to predict SFS following percutaneous nephrolithotomy (PCNL). Both of these studies were conducted by Aminsharifi and colleagues. 18,19
In their initial study, they designed a ML algorithm to predict SFS and other postoperative complications following PCNL. The algorithm was found to predict SFS with an accuracy of 82.8%. Following these promising results, the group then validated the accuracy of their algorithm and compared its performance to two widely used nomograms for the prediction of SFS post-PCNL; the Guy's Stone Score and the Clinical Research Office of the Endourological Society nomogram. The authors concluded that the predictive performance of their ML-based algorithm was better than both nomograms. Finally, there was one study that predicted SFS and other postoperative outcomes following either SWL, ureteroscopy, or PCNL using an ANN. 20 The results retrieved were compared with traditional statistical methods and showed that ANNs were superior.
One study applied ML models to help with the prediction of infection. The authors used an extreme gradient boosting algorithm to identify patients with obstructed hydronephrosis at high risk of developing pyonephrosis. This model achieved this task with an accuracy, sensitivity, and specificity of 99%, 96%, and 100%, respectively. 21
The prediction of spontaneous stone passage using ML methods was attempted by four different groups. 22 –25 These studies utilized either ANNs or SVMs to help with this task. Of note, the most recent algorithm displayed an accuracy of 99% in estimating spontaneous stone passage rate. 22 The authors highlight stone size, body weight, pain score, serum C-reactive protein levels, and erythrocyte sedimentation rate as criteria that were superior than others in predicting spontaneous stone passage.
Another outcome that researchers often attempted to predict was stone fragmentation following SWL. There were three studies that attempted this task using an ANN. 26 –28 The first group conducted a pilot study in 2003 and showed that their algorithm effectively identified patients who were unlikely to benefit from SWL. 28 However, they emphasize that their limited sample size prevents them from making final recommendations. A more recent study verified this outcome on a larger sample and showed that their ANN could predict if a stone is fragmentable or not using noncontrast CT imaging. 27 They report that the chance of misclassifying a nonfragmentable stone was of 2.6%.
However, their algorithm was not tested to predict the fragmentation of multiple stones as well as stones larger than two centimeters. Finally, the study conducted by Goyal and colleagues compared the use of an ANN and multivariate regression analysis to predict renal stone fragmentation. 26 Using a coefficient of correlation, they showed that the ANN was superior to multivariate regression analysis in predicting renal stone fragmentation by SWL.
There were two studies that developed a decision support system to help patients and clinicians select an appropriate surgical treatment for the management of their kidney stones. Both studies focused on predicting surgical outcomes following PCNL. One of these systems was specifically designed for the use of PCNL to treat large stones. 29 Using multiple ML models, this system predicted SFS, surgical complications, and the need for ancillary surgical procedures. The algorithms used provided an accuracy of 95% in predicting the need for a blood transfusion or retreatment, and an accuracy of 85% in determining the need for stent placement post-PCNL. The other decision support system focused on predicting SFS following PCNL of staghorn stones to help urologists counsel their patients appropriately when selecting a treatment option. Within this system, the algorithm with the best accuracy in predicting SFS (81%) was the random forest (RF) classifier. 30
Additional studies predicted unique outcomes with the help of ML. For instance, a study conducted by Nguyen and colleagues estimated the HRQoL of kidney stone patients using clinical data retrieved from the Wisconsin Stone Quality-of-Life (WISQOL) questionnaire. 9 Their model effectively predicted stone patient's HRQoL using clinical information with an area under the curve (AUC) of 0.79 and 0.83 for patients within the lowest and highest quintiles of their HRQoL stratification.
Classification of stone type
A common application of ML for urolithiasis was in helping with the classification of stones. The studies retrieved (n = 18) showed that there are multiple ML algorithms that can help with stone classification, including NN, NLP, SVM, and computer vision, among others. A detailed list of the studies using ML for the classification of stones can be found in Table 2.
Three studies were designed to predict kidney stone type. One study used a combination of ML algorithms (ensemble learning) and another did so with an ANN. The study evaluating the use of ensemble learning achieved this task with an accuracy of 97.1%. 31 The study utilizing ANNs also evaluated the use of discriminant and logistic regression analysis. This study focused on predicting the risk of developing calcium oxalate stones specifically. Their results showed that ANNs were not superior to the classical statistical analyses used. 32 The remaining study conducted by Zheng and coworker extracted over 1000 radiomic features from CT scans to then develop and validate a nomogram designed to identify infectious kidney stones. This nomogram effectively achieved this task with an AUC of 0.842. 33
Three studies focused on distinguishing between two specific stone classes. For instance, a study carried out by Zhang and coworkers differentiated uric and nonuric acid stones. 34 This study attempted to achieve the latter by evaluating the texture feature of the stones using CT texture analysis. The data collected were then used to train a SVM classifier, which led to a diagnostic accuracy of 88% and 92% for uric acid and nonuric acid stones, respectively. Another study focused on identifying and distinguishing between pure and mixed kidney stones using a deep convolutional neural network (CNN) that was trained on intraoperative endoscopic images. 35
The authors used the unique physical features of the two stone types to train the NN. The algorithm's accuracy was higher than 87% for both the pure and mixed stones. The third study aimed to quantitatively differentiate stones composed of whewellite, weddellite, and carbonate apatite using ANN. 36 The authors of this study achieved this goal by measuring the infrared spectra of the different stone types.
Other classification studies aimed to distinguish between multiple classes of kidney stones. 37 One study attempted this task using a deep learning computer vision algorithm trained on digital camera images of stones. The overall weighted sensitivity in predicting stone type was of 85%. However, this varied for each stone type; the highest sensitivity was for uric acid stones (94%), and the lowest was for brushite stones (71%). 38 While the previous study used digital camera images of stones, a study led by Fitri and colleagues used a CNN trained on microCT images to classify the different classes of kidney stones. This group achieved this task with an accuracy of 99% and a classification error of 1.2%. However, the classification groups were broader and only consisted of uric acid, calcium, and mixture stones. 39
Two studies used dual energy CT as an imaging modality to train their model. The first study conducted by Grose Hokamp and coworkers used a NN to effectively predict the main component of pure and mixed kidney stones with an overall accuracy of 91%. 40 The second study applied multiple algorithms, including SVMs, RandomTree, ANN, and naive Bayes tree to distinguish uric acid from nonuric acid stones with an accuracy of 100%. Once distinguished, the algorithm subclassified these nonuric acid stones with an accuracy of 88%. 41 Other imaging modalities used in the automated classification of stones included dual-energy kidney, ureter, and bladder (DEKUB) X-ray imaging. In the latter study, a mean accuracy of 96% was achieved in appropriately classifying stones using linear discriminant analysis. 42
Interestingly, only one article applied an NLP algorithm to extract stone composition information from electronic health records. This algorithm provided a positive predictive value (PPV) >87.5% for all the possible stone compositions. The authors explain that most of the false positives were due to the mislabeling of urinary uric acid mentions as uric acid stones. 43 Other methods such as Raman spectroscopy, hyperspectral imaging, infrared spectroscopy, and using the microwave dielectric properties of stones were also used in conjunction with ML tools to classify the different types of kidney stones. 44 –48
Diagnostics
The identification and diagnosis of kidney stones was another common theme among the studies retrieved. These studies varied with respect to the imaging modality and ML algorithm used. A detailed list of the studies related to the applications of ML for the diagnosis of kidney stones can be found in Table 3.
Selvarani and Rajendran and Divya Krishna and colleagues used data retrieved from ultrasound imaging systems to train and test their algorithms. The former used a metaheuristic SVM classifier to enhance the quality of ultrasound images when detecting kidney stones. 49 The methodology used led to an accuracy of 98.8% in appropriately detecting kidney stones. The study led by Divya Krishna and colleagues also used SVMs. However, the goal of the study was not specific to stones, as it aimed to identify any kidney abnormality on ultrasound such as cysts or stones. 50 Nevertheless, this algorithm achieved their goal with a similar accuracy of 98.14%.
Two studies developed an algorithm specific to CT imaging. The model used by Langkvist and coworkers is notable as it was the first of its kind to develop a CNN algorithm for the detection of stones using three-dimensional data. In addition, it also specifically focused on the detection of ureteral stones which are more challenging to detect than kidney stones. This algorithm allowed for the detection of ureteral stones on CT scans with a sensitivity of 100% and a false positive rate of 2.7 per CT scan. 51
Cui and coworkers developed an algorithm to both detect and grade the severity of kidney stones using the S.T.O.N.E. (stone size, tract length, obstruction, number of involved calices, and essence/stone density) scoring system. The algorithm combined CNNs with thresholding methods and achieved stone detection with a sensitivity of 95.9% and a PPV of 98.7%. 52 The last study that focused on the detection of kidney stones was unique in its kind, as it discussed the development of a deep learning algorithm that identified stone precursors such as plaque and plugs on video endoscopic data. 53
An issue that is often encountered when diagnosing kidney stones on imaging is the misdiagnosis of ureteral stones as pelvic phleboliths. In this review, three studies focused on the differentiation of kidney stones to phleboliths. Lee and associates carried out this assessment on CT images using an ANN. This was achieved with an AUC of 0.85 for the shape and 0.88 for the texture parameters. 54 In comparison, the study conducted by De Perrot and colleagues differentiated phleboliths from kidney stones with an accuracy of 85.1% and AUC of 0.90 using a combination of radiomics and ML. 55 The third study led by Jendeberg and coworkers focused specifically on the distinction of distal ureteral stones with phleboliths using a CNN. The evaluation metrics of this study were superior to the previous two as it had an accuracy of 92% and AUC of 0.95. 56
In addition, this review included studies that predicted stone development. Chen and coworkers compared multivariate logistic regression and statistical ML methods' ability to predict the development of large kidney stones using data from laboratory testing results and detailed patient demographics. Their logistic regression model was superior to all other ML models with a sensitivity of 83% and a specificity of 56%. 57 Eken and associates compared ANN, genetic algorithms, and logistic regression analysis that used data such as patient's relevant medical history and clinical signs related to urolithiasis. Their ANN model was found to be the best model in predicting urolithiasis with a sensitivity and specificity of 94.9% and 78.4%, respectively. 58
Another group predicted the development of urolithiasis by developing a multidimensional algorithm based on statistical and ML models. The finalized algorithm included data related to patient demographics and known clinical diagnoses. After testing multiple models, this group found that their stepwise-selected model was most optimal with an AUC, sensitivity, and specificity as high as 0.90, 90%, and 82%, respectively. When applying ML techniques, the authors did not notice a significant increase in their performance. 59 Jungmann and coworkers propose a unique method to identify suspected cases of urolithiasis using NLP. The authors used this algorithm to identify keywords related to stone disease in radiology free-text reports to detect suspected cases of urolithiasis. 60 The last group predicted the incidence of stone disease using both discriminant analysis and ANN. 61 This study showed that the ML approach was superior to discriminant analysis in classifying participants known for stone disease.
Additional studies helped in predicting the location and risk of stone recurrence. One study aimed predicted the presence of upper urinary tract stones using an ANN with an accuracy of 100% on a testing sample of 68 records. 62 This model included data related to patient's history of kidney stone development, the presence of nephrocalcinosis on imaging, and biochemical data from urine cultures and 24-hour urine assays for citrate. Finally, regarding the risk of stone recurrence, there was one study that predicted the 5-year recurrence rate of kidney stones using an ANN model. 63 The input data for this model consisted of different serum and urine electrolyte levels. The algorithm accurately predicted the recurrence of stone disease 89% of the time.
Therapeutics
ML-based algorithms can also be used for therapeutic purposes in the treatment of kidney stones. There were three studies identified within this review that aimed to improve the treatment of urolithiasis with the help of ML. 64 –66 Two of these studies discussed SWL and one was related to PCNL. The detailed characteristics related to these studies can be found in Table 4.
One of the studies related to SWL developed a deep learning model designed to automate SWL treatment plans according to baseline patient characteristics. 64 The authors showed that their model was on par with physician planning of SWL. The other study related to SWL built a CNN to improve shocking accuracy. 66 Given that this was a pilot study with a small sample size, the results presented require further validation. Nevertheless, the authors showed that their algorithm improved the operator hit rate from 55.2% to 75.3%. Finally, the study related to PCNL predicted the optimal kidney stone localizing method using an ANN. 65 The authors showed that B-mode ultrasonography with X-ray was recommended for the localization of small renal stones, whereas the localization of simple and large stones only required one of the two methods, with X-ray being the ideal method.
Benign prostatic hyperplasia
This review found five studies examining the use of ML in the management of BPH. The detailed characteristics related to these studies can be found in Table 5.
Among the five studies retrieved, two aimed to help with the diagnosis of BPH. The first study used a computer vision-based system to diagnose BPH on histopathologic specimens with an accuracy of 93%. 67 The second study examined the use of a SVM algorithm to accurately measure prostatic volume on magnetic resonance imaging. 68 This algorithm predicted prostate volumes that were in accordance to planimetry. The other two studies retrieved aimed to help with outcome prediction related to BPH and its management. Shatalova and coworkers utilized NN to predict the risk of complications secondary to BPH surgery using the electrical resistance of biologic active points. This was achieved with a diagnostic sensitivity of 84% and a specificity of 93%. 69
Djavan and associates designed a NN to help predict the risk of symptomatic progression in patients with bladder outlet obstruction to then determine the factors that put patients at highest risk of disease progression. 70 The group highlighted that prostate-specific antigen, transition zone volume, and obstructive symptom score were associated with a high risk of disease progression. One of the studies evaluated BPH medical management. This study detected subtle histologic effects attributed to dutasteride treatment using a computer vision approach. They then retrieved features associated with patient's degree of responsiveness and developed a histologic score to determine if a patient would respond to dutasteride well or not. 71 Overall, this model was able to distinguish nontreated histologic prostate tissue from dutasteride-treated prostate tissue with an accuracy of 76%.
STREAM-URO assessment
Overall, the articles met 13.7 (standard deviation [SD] 2.7) of the 26 items included in the STREAM-URO assessment. On average, the urolithiasis and BPH articles met 13.8 (SD 2.6) and 13.4 (4.1) of the 26 STREAM-URO framework criteria, respectively. 8 Of the 63 articles, 62 (95%) met the background, objective, and label criteria of the framework. Only 3 of the 63 articles (4.8%) included an assessment of bias. In addition, only 23 articles (37%) compared their ML models to a reference standard. A detailed view of the STREAM-URO assessment can be found in Table 6.
Standardized Reporting of Machine Learning Applications in Urology Grading of All Included Articles
STREAM-URO = Standardized Reporting of Machine Learning Applications in Urology.
Discussion
Within the field of endourology (specifically urolithiasis and BPH), ML is applied to help with the prediction of many outcomes. The studies retrieved in this review demonstrated excellent evaluation metrics that often-outperformed clinicians, traditional statistical analyses, or other validated nomograms. 18 –20,29,30 In general, most of these studies were limited by their single-institution data pool and retrospective study design. In addition, only one of the included studies were validated on an external dataset. 25 External validation on a diverse dataset is important to ensure that a newly developed model is free of bias that may be inherently found within the dataset used to develop the model.
After having designed a NN, Parekattil and colleagues validated their model on an external dataset from six different institutions. The testing performed on the initial design institution dataset was found to have a prediction accuracy of 86% for stone passage and 87% for stone passage duration. 72 In comparison, the external validation of the algorithm led to an accuracy of 88% and 80% in predicting stone passage and stone passage duration, respectively.
This review highlighted the different ways that ML can be applied to help with the classification, diagnosis, predicting the risk of recurrence, and treatment of kidney stones. Overall, for the classification of kidney stones, the developed ML algorithms included used stones' inherent properties (texture, morphology, infrared spectra, and microwave dielectric properties) to develop accurate systems that have the potential to be faster and more affordable than traditional stone analysis. 35,44 –48 With regard to the diagnosis of kidney stones, this review showed that ML can be applied to different imaging modalities and effectively help diagnose kidney stones.
While ML was effective in diagnosing kidney stones, studies reported that there was an underestimation of individual stone measurements in comparison to manual assessments. 52,56 This is a factor that should be considered when building new ML algorithms. In this review, there were also studies that helped in predicting the development and risk of recurrence of stones. Key variables used for these predictions included hypertension, older age, calcium oxalate supersaturation, log-transformed protein percentage, a history of stones, the presence of nephrocalcinosis, and urine culture results. These algorithms have the potential to reduce the number of unnecessary radiographic testing for kidney stones in the acute care setting. Only three studies examined the application of ML for the treatment of kidney stones, two of which were published in the last year. 64 –66 These studies highlighted the potential that ML applications can have in providing personalized treatment plans. 64 –66
The application of ML for the management of BPH is also a novel and understudied field. In this review, only five studies were retrieved, wherein two aimed to help with outcome prediction of BPH, two targeted the diagnosis of BPH, and one helped with the treatment of BPH. 67 –71 Among the two studies aiming to aid with the diagnosis of BPH, none used ultrasound as an imaging modality, which is the only imaging modality recommended in certain society guidelines when undergoing surgical therapy for BPH. 73
The STREAM-URO framework aims to help urologists develop a better understanding of how to appropriately conduct standardized ML studies. In this review, half of the STREAM-URO criteria were not met for the kidney stone and BPH articles. Only three of the studies included within this article were published after the release of the STREAM-URO criteria, so few studies would have been able to consult the STREAM-URO framework. 13,30,33 This finding emphasizes the importance for authors to adhere to the recently developed STREAM-URO framework to promote the development of high-quality ML studies. In this review, almost all articles omitted a bias assessment in their study. Only three articles evaluated and compared the algorithm's metrics when stratified by factors such as age, gender, ethnicity, or socioeconomic status. This is especially important as ML algorithms may lack generalizability across diverse populations. 5
For instance, studies have shown that the performance of ML algorithms may vary according to race. 74 Therefore, it is especially important to evaluate ML algorithms in race subgroups to prevent the creation of disparities in care when implementing ML tools. Another frequently missed item was the omission of a reference standard. While demonstrating the feasibility of ML models has its purpose, it is important for investigators to compare these models with a reference standard to evaluate whether ML models are truly superior to the current standard of care. 8 Reference standards can be in the form of existing models, nomograms or traditional regression models that use similar features. Ultimately, these head-to-head comparisons can allow investigators to advance the field of ML.
Finally, other items that were poorly represented were related to the technical aspects of ML. Most studies omitted to describe the methods used when developing the final dataset and did not present the final ML model with the list of features and hyperparameters used. However, as urologists become accustomed to the STREAM-URO framework, they may develop a better understanding of how to appropriately conduct standardized ML studies and address the limitations identified in this study.
Limitations
This review helped identify common limitations related to the literature discussing the applications of ML for both BPH and urolithiasis. First, only one study was validated on an external dataset. Therefore, one should be cautious when interpreting the data presented in these studies, as the lack in external validation hinders the generalizability of their results. External validation of ML-based models is limited within health care, as it is difficult to ensure uniform data collection since electronic medical records and physician documentation varies across institutions. 75
In addition, within urology, there is poor insight into ML “black-box” models in comparison to statistical approaches. Therefore, this lack of insight has the potential to perpetuate biases if the ML models are left unchecked. Finally, the lack of standardized study design and reporting of results found within the retrieved studies prevented quantitative analyses from being carried out in this review. Fortunately, a framework designed specifically for ML studies within urology was recently published. 8 This framework provides guidelines for investigators within the urologic community to help promote the development of high-quality ML studies and address the limitations revealed in this review.
Conclusions
This systematic review highlighted the important role that ML can have within the field of endourology. Studies retrieved within this review effectively helped with outcome prediction, disease classification, diagnostics, and therapeutics for both urolithiasis and BPH. While ML shows great promise in improving patient care within the field of endourology, it is important for investigators to adhere to the recently developed STREAM-URO framework designed to promote the development of high-quality ML studies.
Footnotes
Acknowledgments
We thank Lucy Kiester for helping with the development of this project's search strategy. We acknowledge that the abstract related to this project was presented at the 2022 Northeastern Section of American Urological Association annual meeting and published in the Canadian Urological Association Journal (doi:
Authors' Contributions
Study concept and design: D.B., X.H.L., W.X.L., D.-D.N., and N.B. Acquisition of data: D.B., X.H.L., W.X.L., A.A., C.D., A.G., D.-D.N., and J.C.C.K. Analysis and interpretation of data: D.B., X.H.L., W.X.L., A.A., C.D., A.G., D.-D.N., and J.C.C.K. Drafting of the article: D.B. Critical revision of the article for important intellectual content: D.B., X.H.L., W.X.L., A.A., C.D., A.G., D.-D.N., J.C.C.K., B.C., D.S.E., K.C.Z., Q.-D.T., and N.B. Obtaining funding: None. Supervision: B.C., D.S.E., K.C.Z., Q.-D.T., and N.B. Other: None.
Author Disclosure Statement
All other authors report no relevant conflicts of interest.
Funding Information
No funding was received for this article.
Supplementary Material
Supplementary Data S1
Abbreviations Used
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
