Abstract
Objectives:
To perform a systematic review on artificial intelligence (AI) performances to detect urinary stones.
Methods:
A PROSPERO-registered (CRD473152) systematic search of Scopus, Web of Science, Embase, and PubMed databases was performed to identify original research articles pertaining to AI stone detection or measurement, using search terms (“automatic” OR “machine learning” OR “convolutional neural network” OR “artificial intelligence” OR “detection” AND “stone volume”). Risk-of-bias (RoB) assessment was performed according to the Cochrane RoB tool, the Joanna Briggs Institute Checklist for nonrandomized studies, and the Checklist for Artificial Intelligence in Medical Imaging (CLAIM).
Results:
Twelve studies were selected for the final review, including three multicenter and nine single-center retrospective studies. Eleven studies completed at least 50% of the CLAIM checkpoints and only one presented a high RoB. All included studies aimed to detect kidney (5/12, 42%), ureter (2/12, 16%), or urinary (5/12, 42%) stones on noncontrast computed tomography (NCCT), but 42% intended to automate measurement. Stone distinction from vascular calcification interested two studies. All studies used AI machine learning network training and internal validation, but a single one provided an external validation. Trained networks achieved stone detection, with sensitivity, specificity, and accuracy rates ranging from 58.7% to 100%, 68.5% to 100%, and 63% to 99.95%, respectively. Detection Dice score ranged from 83% to 97%. A high correlation between manual and automated stone volume (r = 0.95) was noted. Differentiate distal ureteral stones and phleboliths seemed feasible.
Conclusions:
AI processes can achieve automated urinary stone detection from NCCT. Further studies should provide urinary stone detection coupled with phlebolith distinction and an external validation, and include anatomical abnormalities and urologic foreign bodies (ureteral stent and nephrostomy tubes) cases.
Introduction
Kidney stone disease (KSD) is a frequent urologic condition affecting 10% of the population in developed countries. 1 KSD prevalence is now estimated to reach a 30% rate in 2050 in the U.S. warm areas according to a climate change-based predictive model. Moreover, KSD has a significant economic impact on health care systems, because of interventional management (25%) and (single [50%] or multiple [10%]) recurrence rates. 1,2 Acute renal colic (ARC) is the most common urologic emergency, causing 120,000 visits per year (1% of total emergency visits).
According to the National Institute for Health and Care Excellence (NICE) guidelines, a noncontrast computed tomography (NCCT) or an ultrasound (US) has to be performed within 24 hours after the initial visit. 3 This 24-hour delay is mainly explained by the lack of human resources to analyze NCCT or US in radiologic emergency departments. Therefore, an automatic stone detection could help patients, radiologists, and urologists to obtain an etiologic diagnosis at the initial stage of ARC, and to improve the ARC quality of care. 4
Artificial intelligence (AI) was born in the early 1950s with Turing's machine and statement “can machines think?” 5 AI is a scientific field that includes machine learning (ML), aiming to train a machine for specific tasks. Once completed training, the machine would autonomously execute the learned task. With a particular efficiency for learning and detecting text or images, deep learning (DL) represents an ML subcategory, which has been recently spreading among medical applications. From now on, researchers can easily access DL networks for computer vision of medical images. Four learning methods are frequently used for an algorithm or network training: supervised, unsupervised, self-supervised, and reinforcement learning. 6 Unsupervised learning learns from data without human supervision. Indeed, unsupervised ML models are given unlabeled data and allowed to discover patterns without any explicit instruction.
In self-supervised ML methods, the model generates its own supervision signal from auxiliary tasks such as predicting missing parts of the input data. Reinforcement learning learns to make decisions by interacting with the environment. The model receives feedback in the form of rewards or penalties based on its actions and fits its policy to optimize the reward over time. Finally, supervised learning, that is, the preferred method for DL, is given input and output data to the network. During training, the network will define the rules for the path between them. In the field of urinary stones, DL has been proposed to create surgical outcome predictions and also urinary stone detection and characteristics, from which ARC and nonacute interventional management of KSD could benefit. Indeed, the surgical planning of endourologic procedures should include stone location and size, according to the national and international guidelines. 3,7
Among other characteristics, urologists could take better decisions by having an accurate stone burden estimation without human intervention.
This systematic study aimed to review AI performances to detect urinary stones.
Methods
This study was conducted in accordance with the Preferred Reporting Items for Systematic Review and Meta-Analysis (PRISMA) checklist and PROSPERO registered (CRD473152). 8
Search strategy
A systematic review of literature was carried out on September 4, 2023, using the Scopus, Web of Science, Embase, and PubMed databases. No time period limited the literature research and every publication was considered for screening without restriction.
The search terms (“automatic” OR “machine learning” OR “convolutional neural network” OR “artificial intelligence” OR “detection” AND “stone volume”) were used. Reference lists of selected articles were checked manually for eligible additional articles.
Inclusion and exclusion criteria
Inclusion criteria were: (1) stone volume (SV) or stone detection or distinction with other anatomical structures as vascular or nonvascular calcifications as the main topic of the article; (2) automation or AI process (supervised learning, ML, convolutional neural network) as the main method; (3) full-text available for screening and analysis of the methodology or data; and (4) English-written publications only. Original studies as well as conference abstracts on in vivo studies were considered. Systematic reviews, editorials, and letters were not considered. Exclusion criteria were: (1) manual measurement of SV or manual stone detection only; (2) linear measurement of stones only; (3) ambiguous report of results such as the absence of accuracy on stone detection or SV; and (4) publications in any other language than English.
Data extraction
Two authors (F.P. and D.S.) extracted data independently using a standardized-item form. Conflicts were resolved by selective analysis and consensus. Included studies were assessed for study characteristics and relevant outcomes. Primary outcomes of interest were automation or AI processes and results in either in vitro or in vivo studies. Imaging modalities were also extracted when available, as well as patients' demographics.
Quality assessment: risk-of-bias
A risk-of-bias (RoB) assessment for nonrandomized studies was undertaken independently by two authors (F.P. and D.S.) using the validated Joanna Briggs Institute Critical Appraisal Checklist for nonrandomized studies. 9 The ROBVIS tool was used for a graphical representation, assuming the randomization bias was not applicable in all studies. 10 Finally, each included study underwent the Checklist for Artificial Intelligence in Medical Imaging (CLAIM). 11 Conflicts were resolved by consensus. Comprehensive description of the quality of bias assessment can be found in Supplementary Figure S1.
Statistical analysis
Considering the heterogeneity of the study outcomes and the lack of comparative trials, a meta-analysis was not performed. The structure of the article was decided based upon the consensus of authors.
Results
Literature search
The search identified 94 studies through the database searches and 1 additional study from reference lists of which 12 studies were selected for the final review. 12 –23 All studies were retrospective, including three multicenter 15,19,20 and nine single-center 12 –14,16 –18,21 –23 studies (Table 1). The selection process is detailed in Figure 1.

PRISMA flow diagram. PRISMA = Preferred Reporting Items for Systematic Review and Meta-Analysis.
Demographic and Imaging Characteristics
NA = nonavailable; NCCT = noncontrast computed tomography.
Quality assessment: RoB
RoB assessment was applicable to all included studies. Only one study had a high RoB, 21 with all other studies having moderate RoB (Supplementary Fig. S1a, b). According to the CLAIM checklist, 6 and 11 studies completed at least 80% and 50% of the 42 checkpoints, respectively (Supplementary Fig. S1c). The high RoB study was the only study not reaching 50% of CLAIM checkpoints. 21
Study design, participants
Among the included studies, all were retrospective and three were multicenter (25%) (Table 1). Studies were focusing on various locations of stones: kidney (5/12, 42%), ureter (2/12, 16%), or both (5/12, 42%). The main objective was always described clearly but differed among studies: stone detection (12/12, 100%), stone characterization (maximum diameter, volume, density, complexity) (5/12, 42%), and stone distinction from a differential diagnosis (2/12, 16%). Regarding the two last articles, one intended to automate distal ureteral stone distinction from phleboliths, 13 whereas the other one aimed to distinct urinary stones from vascular calcifications. 12 The description of included patients and NCCT varied significantly also among the publications. In regard of the population description, a control group (without urinary stones) was retrieved in four studies (33%).
Patients were selected from a stone former cohort in 10/12 (83%), from “another purpose” NCCT cohort in 1/12 (8%) studies 20 : six studies included adult 12,14,15,17,18,22 stone formers, while a single one included pediatric stone formers. 19 Two studies included ARC patients. 13,16 Mukherjee et al. used a multiorgan segmentation data set that included stone disease but was not designed as a dedicated stone former cohort, 23 while Elton et al.'s study used colonography computed tomography with incidental stones. 20 In the last one (Park et al.), we did not find details on the cohort constitution. 21
NCCT protocol
The imaging modalities were heterogeneously described among included studies. Authors reported at least the number of NCCT stations in eight publications. 12 –14,16,18,20,22,23 Indeed, no detail from NCCT protocols were found for two studies, 15,21 whereas Babajide et al. reported only the slice thickness (Table 1). 19 About NCCT characteristics, the slice thickness was the most reported characteristic (nine studies), followed by tube tension (seven studies). The remaining characteristics were the modulation/intensity +/− rotation duration and interval reconstruction rate. The radiation dose was not retrieved in any studies.
AI method
Method section of each included studies provided a detailed presentation of the AI process used for network training (Table 2). Network's names were systematically cited and if they were pretrained, newly or reshaped designed networks. The data set splitting used “validation” or “test” titles for internal validation in all studies that did not include an external validation. Data set splitting was described in 10 studies. Elton et al. conducted an external validation, whereas other included publications reported an internal validation. 20 Supervised training was the chosen method for DL network training in all studies. An automated NCCT annotation was done in two studies, in which an automated kidney segmentation followed by a threshold-based segmentation was realized. 20,23 One study aimed to define a composite method to differentiate vascular calcifications from stones (shape feature and texture criteria). 12
Artificial Intelligence Methods and Outcomes
ANN = artificial neural network; AUC = area under the curve; CI = confidence interval; CNN = convolutional neural network; DHV = difference histogram variation; GLCM = gray-level co-occurrence matrix moment; ML = machine learning; SGD = stochastic gradient descent; SV = stone volume.
The other studies ensured the ground truth definition by a manual annotation, that is, stone segmentation, of each NCCT certified by at least one radiologist or urologist. The network compilation and training characteristics were described in five studies, as well as overfitting dealing methods or data augmentation. 13,14,17,18,20,22
Outcomes and statistics
All included studies presented results from a described statistic method (Table 2). However, statistical contents were heterogenous with various calculations or criteria: Dice score, F1 score, sensitivity, specificity, false positive, accuracy, positive predictive value, negative predictive value, concordance, or correlation. For studies using a first step of kidney segmentation followed by threshold-based segmentation, excellent performances were reported: 0.968 Dice score with a manual-automated SV concordance of 0.995 [0.993–0.996] in Mukherjee et al.'s work, and sensitivity/specificity of 88% and 91%, respectively, with 0.95 correlation between manual and automated segmentation in the external validation cohort for Elton et al. 20,23 Finally in Cui et al.'s study, the kidney segmentation Dice was 0.97 with an excellent concordance. 15 One-step stone segmentation achieved stone detection and SV Dice scores of 0.79 and 0.66, respectively. 17,19
For stone detection, trained networks reported sensitivity and specificity rates ranging from 58.7% to 100% 13 –23 and 68.5% to 100%, 14,16 –21 respectively. According to the clinical context, sensitivity varied from 66.1% to 97.5%, 88%, 94% to 100%, 87.5% to 88% in studies that included adult, pediatric stone formers, suspected ARC, nonstone formers, respectively. In Park et al.'s study, a 90% sensitivity rate was reported but no clinical context was given. Studies that focused on suspected acute colic patients included also control cases without stones. 13,16 In three studies, eligibility criteria considered stone size limits. 18,20,23
Using the two-step segmentation methods, Mukherjee et al. limited the stone size as 3 to 250 mm, 3 while Elton et al. considered stone to be greater than 3 mm3, providing similar stone detection sensibilities (87%–88%). Caglayan and colleagues divided cases according to the axial diameter (0–10, 10–20, and >20 mm), showing higher sensitivities in larger stones. 18
Considering accuracy, six studies found 63% to 99.95% rates in the stone detection. 14,16 –18,21,22 Overall Dice score ranged from 0.83 to 0.97. 15,17 Focusing on the quantitative comparison between AI-automated and manual SV measurements, five studies reported included such comparisons. 19,20,22 –24 Overall, the correlation between network-generated and manual SV (ground truth) ranged from 88.44% to 99.5%. 20,22,23 Moreover, a 0.31 ± 0.92 mm3 SV difference was reported in Elton et al.'s study, with higher SV by manual measurements according to Babajide et al. 19,20
Furthermore, a distinction has to be made according to the NCCT annotation method used is these studies (one-step stone segmentation or two-step kidney segmentation followed by 130 HU-threshold segmentation). The latter one was associated with better AI-manual SV concordance/correlation that a one-step stone segmentation, even if we can legitimately consider 88.4% as a clinically impactful correlation for SV estimation. However, the two-step annotation method was only feasible on kidney stones and could not include ureteral stones consecutively.
Kidney and ureteral stone detections were associated with similar outcomes but distinct methods were available to achieve the ground truth definition (NCCT annotation). For kidney stones, an automated kidney segmentation and threshold-based segmentation was achievable, which could not be transposed to ureteral stones, because of the absence of contrast agent injection in NCCT. A single study aimed to differentiate distal ureteral stones from phleboliths. 16 Jendeberg and colleagues DL method was associated with a sensitivity, specificity, and accuracy of 94%, 90%, and 92%, and a higher accuracy than the mean radiologist accuracy (92% vs 86%, p = 0.03).
Discussion
NCCT and stone diagnosis
NCCT technologic characteristics
NCCT is a routine imaging modality for both initial diagnosis and follow-up of ARC. Overpassing US except for young people, children and pregnant women as the first-line imaging in case of suspected ARC, a low dose NCCT has to be offered urgently (within 24 hours of presentation), according to NICE guidelines. 3 If scrolling NCCT images is a common task for urologists, NCCT protocol knowledge is rarely widespread within the urology community. Overall, NCCT protocols include multiple parameters such as irradiation dose, intensity and tube tension, and slice thickness.
As the most clinically relevant one, NCCT irradiation dose refers to the ALARA principle (“As Low As Reasonably Achievable”) by obtaining the best possible information but with the safest parameters and lowest radiation exposure. 25 Differently speaking, ALARA means avoiding exposure to radiation that does not have a direct benefit to your purpose, even if the dose is small. Therefore, detailing irradiation dose (standard, low, or ultralow dose) in the NCCT protocol seems mandatory for any clinical or preclinical study.
To better understand intensity and tube tension, a historical point of view has to be developed. Using a rotating X-ray source on one side and a detector on the other side of the patient, NCCT is based on tissue attenuation to visualize organs in three dimensions. Initially reported by Wilhelm Conrad Röntgen in 1895, X-rays are created from electric current and acceleration between cathode and anode in a tube, with two main characteristics: intensity (of the current, mA) and tension (between cathode and anode, kV). 26
The last parameter is the slice thickness that can potentially refer to two distinct entities: first, detector slice thickness that used to describe the size of the individual components of the detector array, and correlates to thickness of the thin slice series. For example, if NCCT was acquired using a detector slice thickness of 2 mm, an image with voxel size less than 2 mm along the z-axis cannot be generated. On the contrary, reconstruction slice thickness determines the voxel depth of your multiplanar reconstructions. That represents how much data are included in a single slice. Commonly, slice thickness refers to the reconstruction slice thickness. Slice thickness ultimately determines the trade-off in image quality between spatial resolution (how clearly you can differentiate small changes in the image) and image noise (the standard deviation of the image). Thus, increasing slice thickness will decrease spatial resolution and image noise.
Overall, our research identified two publications that did not report details about the NCCT protocol. Thus, the described characteristics varied among studies, the slice thickness being reported in 75% of cases. Tension and intensity were more scarcely reported.
Impact of NCCT protocols on KSD
Low-dose NCCT has been proposed and achieved good performances in stone detection and measurements (sensitivity and specificity of 99% and 94%, respectively), by reducing intensity and exposure duration (mA and mAs, respectively). 27 An automated current modulation as in Lee et al.'s or recently Mukherjee et al.'s studies adapt the current to the tissue attenuation to avoid information loss in low-dose NCCT. 12,23 Brisbane et al. acknowledged low-dose NCCT for stone diagnosis or follow-up except in case of body mass index >30 kg/m2 because of tissue attenuation. 28 Moreover, image quality and accuracy tend to decrease with reduced tube current but in reasonable proportions for stone detection. Thus, NCCT using 140, 100, 60, 30, 15, and 7.5 mAs settings resulted in 98%, 97%, 97%, 96%, 98%, and 97% sensitivity, and 83%, 83%, 83%, 86%, 80%, and 84% specificity for small stone detection (3–7 mm) in cadaveric ureters, respectively. 29
These results are consistent with a recent animal study, confirming mA (intensity or dose) reduction is feasible for stones without losing information. 30 On its side, decreasing tube tension results in contrast enhancement (vascular/calcification/bones) and higher global attenuation, but 100 to 120 kV tension setting is efficient for calcium-based structure analysis in abdominal NCCT. 31
Lastly, low-dose NCCT frequently includes greater slice thickness, but even ultralow-dose NCCT can avoid slice thickness increase and consequently small stone misdiagnosis with an adequate protocol. 23
In summary, if describing the NCCT protocol for research purposes is required for qualitative analysis, low kVp and low-intensity settings seem acceptable for stone detection and quantification. The only parameter that could be mandatory to report is the slice thickness, which should not exceed 2 mm to avoid small stone misdetection. Moreover, a network trained with various NCCT protocols could present a better external validity.
AI efficiency in stone detection
Turing introduced AI in the well-known essay “The Imitation Game” in 1950. 5 Replacing the question “Can a machine think?” by the different steps of task learning as close as an anime mind could do, Turing acknowledged for the first time the concepts of “rules,” “stores,” and “control,” still used in current AI experiments. Thus, with almost infinite possible combinations, AI can automate tasks for which a model or network is trained with input and output material. Training consists in defining the rules to find a path from given native to annotated data. That being said, AI can achieve one action in several ways, as shown in the present review. In the field of stone detection, various methods and networks have been described for data preprocessing (annotation), with two predominating segmentation methods: kidney segmentation followed by threshold-based segmentation or direct stone segmentation.
These methods presented similar outcomes, but only the second one can achieve an efficient ureteral stone segmentation in NCCT. Indeed, in Park et al., the Fast R-CNN was able to segment the urinary tract, a second method (“watershed”) was needed to reach an 84% detection rate only. 21 Furthermore, segmentation seems to overpass other detection methods, with a simple quantification process. A segment is a three-dimensional region of interest (voxel) that can be recorded also as cubic millimeter (mm3). If currently the SV is not the gold-standard measurement (i.e., stone maximum diameter [SMD]) in international guidelines, SV seems more accurate to estimate the stone burden, especially for irregular or complex stone shapes. 3,7,32 Moreover, obtaining the maximum linear dimension, that is, SMD, its maximum density (HU) from a segment is feasible with several free user-friendly imaging software on daily practice. 33 –35
Our review found 2D and 3D networks with distinct architectures and reported better outcomes with 3D ones for stone detection. 22 When comparing several 3D networks, Li et al. reported the Res U-Net as the most efficient for kidney stones. Recently, Elton et al. first reported a trained network for kidney stone detection with a proper external validation, that is, data from another center that have never been shown by the network during training. 20 Four studies used cross-validation to improve an internal validation and increase the training data set. 15,17,21,22
The cross-validation method consists in multiple data sets splitting (k times, i.e., k-fold cross-validation) and multiple training phases with various positions of the validation fold in the data set. At the end of each cycle (training and validation), performance metrics are calculated to assess how the network is performing. It tries to demonstrate that the network performance is not due to the random split of data, and would perform similarly in real-world conditions, but is less robust than an external validation. 36
Transfer learning represents another technical aspect of AI processes. In addition to input and output data, a network and its architecture are required to achieve an efficient automated task. 36 Researchers are given two options: build their own network or reuse a previous network that has been trained for a similar task. The first option is more challenging than the second one because it requires high coding skills. On the contrary, the second one is easier to infer as ML has spread in the research community with a large offer of pretrained networks for segmentation or image classification. Our review recorded 10 studies that chose to use pretrained networks. 12,14,15,17 –23 The last two publications were focusing on a new task (phlebolith-distal ureteral stone distinction, 2021) or the first intending to achieve urinary stone detection (2018). 13,16
Up to date, no algorithm has been validated for urinary (kidney and ureter) stone detection and distinction between distal ureteral stones and phleboliths. Regarding this last task, a single study showed promising results (sensitivity, specificity, and accuracy of 94%, 90%, and 92% and a higher accuracy than the mean radiologist accuracy [92% vs 86%, p = 0.03]) but without external validation. 16 Furthermore, most studies excluded complex cases such the presence of foreign bodies (nephrostomy tubes, ureteral stents, hip prosthesis) or phleboliths. Indeed, solitary kidneys, atrophic kidneys, renal anomalies, calcified renal masses, renovascular calcifications, regional lymph node calcifications, metallic implants, pigtail ureteral catheters, percutaneous nephrostomy catheters, artifacts were excluded from Caglayan et al.'s study. 18 Therefore, integrating these trained networks in the clinical decision-making process does not seem feasible at the moment.
Moreover, studies primary objective differed on the location of the stone that was looked for detection, quantification, and distinctions from phlebolith or vascular calcifications, as said before. Furthermore, a single study conducted a proper external validation. Thus, to further advance in integrating trained neural network models in daily practice, further studies should focus on both ureteral and kidney stone detection, coupled with phlebolith distinction. Using a pretrained 3D network with both internal and external validations seems adequate for this purpose.
Design, outcomes, and statistics
As part of the heterogeneity in the reviewed data, participants, outcomes, and statistics varied among studies. First, the NCCT selection and database screening involved nonkidney stone formers in some studies, as shown in Elton et al.'s study with colonography CTs, for example. 20 The presence of a control group without urinary stones was reported in only six studies. 14,16 –18,20,21 In case of ureteral stones, the great variability of the stone location can justify the absence of a control group: all images without stones but an “empty” ureter can be considered control images. On the contrary, kidney stones' location varies slightly. Therefore, a control group without stone appears mandatory, but was described only for studies that used a direct stone segmentation. 17,18,20 As shown previously, the two remaining studies conducted a two-stage kidney stone segmentation method, explaining why authors judged unnecessary to include a control group. 15,23
Our research found also a high variability rate regarding the provided clinical data: no data, 12,13,21 clinical data (including age, gender, or urinary abnormality or variation), 14 –16,18,19,22,23 and stone characteristics (SMD, density [HU], or SV). 14 –20,22,23 Clinical and stone data are primordial in AI studies as well as in clinical trials for both outcomes analysis and generalization. Therefore, providing demographic and stone characteristics appears mandatory for further studies. Similarly, NCCT details were partially lacking in a non-negligible number of included studies, and a high degree of heterogeneity in NCCT protocols. However, AI-network performances differed in an acceptable range, but most studies conducted an internal validation that lowers the reliability of their findings for clinical practice.
Furthermore, the data annotation method varied among studies, without any detail on how the bone window was defined (manually or presetting) for stone segmentation. If a consensus has been reached on using the bone window for stone measurements on NCCT to avoid overestimation, it has been show that a manual bone window is reliable and accurate. 37 –39
Heterogeneity lay also in the outcomes and statistics. Among the 12 included studies, we recorded 11 different statistical criteria. On one hand, standard probability tests such as sensitivity, specificity, and positive and negative predictive values were frequently reported. Thus, they are common measures to assess the performance of diagnostic tests or classification models, such as ML networks. On the other hand, the Dice score, also known as the Sorensen-Dice coefficient or F1 score, is a statistical measure used to assess the similarity or overlap between two sets or groups, ranging from 0 (no overlap) to 1 (total similarity). It is commonly used in various fields, including image segmentation, natural language processing, and information retrieval, to evaluate the agreement or similarity between two sets of data. 40
Accuracy differs from Dice by measuring only true values (percentage of correct detection), whereas Dice score includes also false values. 41 Dice score could represent the best criteria to analyze AI network performances in the field of image segmentation and was recorded in four studies. 15,17,19,23 Table 3 summarizes the encountered metrics and provides an ML-oriented definition for each.
Statistics in Artificial Intelligence Segmentation Models
Acc = accuracy; NPV = negative predictive value; PPV = positive predictive value; Se = sensitivity; Sp = specificity.
Clinical implications and surgical planning
Our literature review intended to report the current evidence on urinary stone automated detection on NCCT. This radiologic classification task, among other oncologic and nononcologic imaging interpretations, can benefit from AI and supervised learning, with a direct improvement for patients. Involving AI processes in the radiology field does not aim to and will not overcome radiologists, but radiologists with AI will provide better interpretations compared with radiologists without AI. 42,43 In the initial stage of ARC, an automated stone detection would help to improve patients' path from emergency departments to urology clinics in several ways. First, radiologists will have more time dedicated to find what is usually missed on NCCT because of focusing on finding the stone (the “unreported data”), such as urologic (anatomical variation, clots, small renal mass, vascular abnormalities) and non-urologic findings.
Moreover, the decision of contrast agent injection for nonstone-related ARC will be facilitated by the gained time on standard cases. Therefore, an etiologic diagnosis will be given to a higher proportion of patients, even in case of nonstone-related ARC. Then, a greater amount of NCCT can be interpreted in the same amount of time, which would increase NCCT access in emergency for patients, currently limited because of human resources. 3 Regarding this last aspect, we acknowledge that the new potential human limit for NCCT access in case of ARC would be the reasonable number of NCCTs a radiographer can perform. A recent clinical audit conducted by the British Association of Urological Surgeons (BAUS) of ureteral stone care pathways reported female patients to have lower access to NCCT performed within 24 hours of ARC presentation (13% vs 7.3% for men [chi-squared p = 0.01]). 4 We can reasonably think AI-aided imaging to solve discrepancies in access. Finally, an AI-detected stone could trigger an automated clinic apportionment with the urology department, improving without any doubt the follow-up. Thus, recent publications have demonstrated a threefold reduction of the duration between referral and treatment by the creation of a dedicated ARC clinic. 44 Consequently, having an automated method for stone detection could reduce even more this waiting time, but will reach the available human resource limit and the reasonable delay for spontaneous passage or medical expulsive therapy (2–4 weeks). 7
A step further in integrating AI to clinical practice could consist in large language models for automated radiology reports given to radiologists. A recent experiment using ChatGPT has shown promising results in generating accurate reports. 45 However, authors emphasized some incorrect statements, missed relevant medical information, and potentially harmful passages. AI integration, instead of practitioner replacement, will solve the responsibility issue inherent with AI misdiagnosis or unreported data, but a special attention has to be given to ethical principles for the application of AI to health care and in urology. 46
Finally, automated stone detection will facilitate endourologic procedures planning by automated stone burden estimation. From NCCT segments, SV is easily accessible in daily practice, from which a lithotripsy duration can accurately be calculated for flexible ureteroscopy. 32,35,47,48 After being applied to the stone diagnosis and quantification, AI networks will carry on surgical planning and pursue to improve patients' care.
Conclusion
AI and DL processes can detect measure urinary stones from NCCT. Currently, trained networks do not compile all requirements for stone detection on NCCT: ureteral and kidney stones, and phlebolith-distal ureteral stone distinction. Further studies are needed, providing an external validation to generalize the presented results, including complex cases with anatomical abnormalities and frequent urologic foreign bodies (ureteral stent and nephrostomy tubes). Stone detection on NCCT represents the future management for early emergency and urologic planning stages of urolithiasis.
Availability of Data and Materials
The data sets used and analyzed during this study are available from the corresponding author upon reasonable request.
Research Involving Human Participants or Animals
This article does not contain any studies with human participants or animals performed by any of the authors.
Footnotes
Authors' Contributions
F.P., D.S.: Conceptualization, methodology, data collection and analysis, writing—original draft, and writing—review and editing. H.C.-S., Y.P., C.A., V.A., S.C., S.A.: Writing—review and editing.
Author Disclosure Statement
The authors declare that they have no conflict of interest. but F.P. has declared consultancy for Dornier MedTech. D.S. has declared educational work with Olympus, Storz, and Cook. S.A. has declared educational work with Storz.
Funding Information
EUSP Scholarship of the European Association of Urology (FPT) (grant number: 2023-002). French Association of Urology Research Grant (FPT).
Supplementary Material
Supplementary Figure S1
Abbreviations Used
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
