Abstract
Introduction:
This study aims to systematically review the literature reporting tools for scoring stone complexity and the stratification of outcomes by stone complexity. In doing so, we aim to determine whether the evidence favors uniform adoption of any one scoring system.
Methods:
PubMed and Embase databases were systematically searched for relevant studies from 2004 to 2014. Reports selected according to predetermined inclusion and exclusion criteria were appraised in terms of methodologic quality and their findings summarized in structured tables.
Results:
After review, 15 studies were considered suitable for inclusion. Four distinct scoring systems were identified and a further five studies that aimed to validate aspects of those scoring systems. Six studies reported the stratification of outcomes by stone complexity, without specifically defining a scoring system. All studies reported some correlation between stone complexity and stone clearance. Correlation with complications was less clearly established, where investigated.
Conclusions:
This review does not allow us to firmly recommend one scoring system over the other. However, the quality of evidence supporting validation of the Guy's Stone Score is marginally superior, according to the criteria applied in this study. Further evaluation of the interobserver reliability of this scoring system is required.
Introduction
F
A standardized measure of stone complexity and burden would have advantages over subjective assessment. For example, it would allow more robust risk adjustment, which is important both for research purposes, to facilitate meaningful comparison between outcomes in different studies, and in establishing standards or attempting comparison between different providers in the context of quality assurance and improvement. 1
Furthermore, a validated instrument for assessing stone complexity could facilitate accurate predictions regarding surgical outcomes and would represent a valuable asset in planning surgery and counseling patients.
We undertook a systematic literature review to identify instruments that have been developed to categorize renal stone complexity, with a specific focus on PCNL. This review aims to appraise the evidence in support of the validation of the various scoring systems and to assess whether this evidence favors the uniform adoption of one of these.
Methods
Search strategy
We searched Medline and Embase databases for articles published between January 2004 and May 2014. No language restrictions were imposed.
The search terms used were as follows: “PCNL stone complexity” OR “Guy's Stone Score” OR “nephrolithometry” OR “PCNL Stone Burden” OR “PCNL Stone Score.” The reference lists of included articles were reviewed to identify additional studies for consideration.
Inclusion and exclusion criteria
We included all studies evaluating stone complexity with specific regard to PCNL surgical outcomes, published since 2004, to reflect contemporary practice and outcomes (Fig. 1).

Summary of study selection process.
We included studies reporting the derivation and validation of rating systems for stone complexity as well as those less formally suggesting pre- and perioperative determinants of stone clearance and complications, for example, through multivariate logistic regression analyses or more simply stratifying or comparing outcomes between groups defined by complexity.
Pediatric PCNL, being considered a distinct area of practice, was excluded. We excluded any studies not solely looking at PCNL and its outcomes, review articles, comments, editorials, case reports, and in-vitro or animal experiments.
Due to the heterogeneity of study design, treatment methodology, and patient population of included studies, quantitative meta-analysis was not attempted.
Data extraction and assessment of methodologic quality
Titles and abstracts identified by the search were independently assessed for inclusion by two reviewers (J.W. and J.A.). All articles that were potentially relevant were obtained and their suitability for inclusion was assessed again after reading the complete articles.
Where only conference abstracts could be found for studies, the authors were contacted to request further detailed information.
Data were then independently extracted by the same two reviewers (J.W. and J.A.) and presented in structured tables. These data included information on study design, population characteristics, and outcomes. Any disagreements were resolved by discussion and consultation with a third reviewer (O.W.). Missing and additional information was sought from the authors.
The quality of identified studies was appraised using an abbreviated and adapted tool (Table 1), incorporating aspects of both QUADAS2 and STARD, and using published frameworks for the assessment of methodologic quality in diagnostic studies. 4,5
Study focused on interobserver reliability.
Mini-PCNL.
Confounding by indication for CT.
N = no; N/A = not applicable; P = partial; PCNL = percutaneous nephrolithotomy; U = unknown/unclear; Y = yes.
Data extracted from studies
The following information was abstracted: • Methods for developing measures of stone complexity • Reporting tools and variables used to describe stone complexity • Imaging modalities used to derive scores and define complexity • Specific outcomes used to report success rate (including the imaging modality used to assess stone status and the definition of “stone-free status.”) • Correlation of stone clearance with stone complexity • Inclusion of complication rates with outcomes and stratification of these according to stone complexity • Attempts to assess interobserver reliability
Results
Four distinct scoring systems were identified, which are summarized in Table 2. 1,7,8,10 The characteristics of all 15 included studies are reported in Tables 3 and 4, including the methodological assessment score, which is itemized for each study in Table 1.
S-ReSC = Seoul renal stone complexity score.
Study focused on interobserver reliability.
ANOVA = analysis of variance; AUC = area under the curve; CROES = Clinical Research Office of the Endourological Society; ROC = receiver operator curve.
Six studies stratified outcomes by stone complexity, without explicitly describing scoring systems, three of which describe the results of exploratory regression analyses, designed to identify which factors influenced outcomes. 12 –17
A further five reports aimed to externally validate previously described scoring systems. 2,3,6,9,11
All but one study reported data from a single institution. 7 Study sample sizes ranged from 58 to 12,482 patients.
Assessment of methodologic quality
The methodologic quality of included studies was evaluated, using a scoring system, incorporating aspects of the STARD and QUADAS2 scores (Table 1). 4,5 This evaluation aimed to reflect the internal and external validity of the studies, including assessments of interobserver reliability.
All but one of the studies defined the primary outcomes clearly. 16 Five studies failed to clearly define the patient inclusion and exclusion criteria. 8,13,15 –17
Five studies failed to demonstrate that their cohort of patients was widely representative. 6,10,14,16,17 One study, for example excluded patients with comorbidity, severely limiting its external validity. 6
Only two studies reported that rating was performed preoperatively, thereby preventing bias in rating stone complexity caused by raters already knowing outcomes. 1,6 No other study reported blinding of raters to outcomes by any other method.
Correlation between stone complexity and stone clearance
All four studies pertaining to the Guy's Stone Score report significant correlation with the stone-free rate following PCNL. 1,2,3,6
The nephrolithometric nomogram was found to have predictive accuracy with regard to stone clearance, based on receiver operator curve (ROC) analysis, with an area under the curve (AUC) of 0.76. 7 On univariate regression, stone burden was identified as the most reliable predictor of stone clearance.
Analysis of the influence of individual components of the S.T.O.N.E. score on stone-free rate indicated that only stone size and number of involved calices (not density, degree of obstruction, or tract length) predicted the stone-free rate. 10
Seoul Renal Stone Complexity score (S-ReSC) was also found to accurately predict the stone-free rate after PCNL. This was reported as AUCs of between 0.853 (95% confidence interval [CI] 0.787, 0.919) and 0.860 (95% CI 0.793, 0.927). 8,9
Each of the studies that stratified outcomes without outlining a scoring system reported that stone clearance rates were correlated with stone complexity, which was variously defined. 12 –17
Correlation between stone complexity and complications
Two studies report a correlation between Guy's Stone Score and complication rates 3,6 ; two reported no correlation between Guy's Stone Score and complication rates. 1,2
The Nephrolithometric nomogram study did not investigate interactions between stone complexity and complication rates. 7
Analysis of the influence of individual components of the S.T.O.N.E. score on complications indicated that only stone size influenced complication rates, although this effect was not significant. In addition, the authors reported a statistically significant correlation between the S.T.O.N.E. score and estimated blood loss, operative time, and length of hospital stay. 10
The Seoul stone score was associated with a statistically nonsignificant difference between three groups defined by stone complexity, in terms of overall complication rates. 8
Four of the studies that stratified outcomes without outlining a scoring system report investigation into correlations between complication rates and stone complexity. 13,14,17 Two studies demonstrated a significant effect of stone size on length of stay, 12,14 with a further study demonstrating a significant correlation between stone area, but not stone configuration, and complication rates. 13
Interobserver reliability
Interobserver reliability was not robustly investigated in any of the included studies and not reported at all in two studies pertaining to scoring systems. 6,7 Although several of the reports mentioned interobserver reliability and the number of items reviewed was generally adequate for assessment of interobserver reliability, the number of raters was arguably insufficient in every study, in which this characteristic was reported. Threshold numbers of items and raters for studies aiming to assess interobserver reliability have been published (Table 5). 18 These estimated thresholds imply that none of the reviewed studies estimated interobserver reliability with a variation coefficient (reflecting imprecision) less than 30%.
Several of the studies used junior fellows or even medical students in their attempts to demonstrate inter-rater reliability. 2,11 One study used two urology residents, whose specific levels of experience were not recorded. 2 The specific areas of disagreement seem to suggest that more experienced adjudication would have been beneficial. Similarly, in assessing the interobserver reliability of S.T.O.N.E. nephrolithometry, a better level of reproducibility was demonstrated for the score as a whole and for each individual component (S.T.O.N.E.) when the medical student raters were excluded. 11
The interobserver reliability assessment of the S.T.O.N.E. score involved review of 58 sets of noncontrast CT images, the most of any of the reviewed studies. 11
Discussion
This is the first systematic review of studies stratifying the success of PCNL against measures of stone complexity. We identified four distinct scoring systems, and the quality of evidence underpinning their use in clinical practice is variable. The best quality evidence appears to be from studies assessing the Guy's Stone Score, which was shown to accurately predict outcomes, with supporting studies scoring higher than others in the methodological quality assessment tool.
Further validation work, including more robust assessment of interobserver reliability, is required and is currently being undertaken.
Study heterogeneity
There is considerable heterogeneity in the included studies particularly with respect to the outcomes used for validation, including the precise definitions and imaging strategies used to check stone clearance. 19 Furthermore, studies aiming to validate previously reported scoring systems have used different criteria for stone clearance than the original study and even definitions of predictive accuracy vary or are incompletely defined. 1 –3 Such heterogeneity renders comparisons of predictive accuracy problematic.
Direct comparison between different scoring systems using a single cohort and consistent outcome criteria may offer the best way to determine which scoring systems offer the best predictive accuracy.
The nephrolithometric nomogram was directly compared with the Guy's Stone Score by the Clinical Research Office of the Endourological Society (CROES) study group and found to have a ROC AUC of 0.69, indicating inferior predictive accuracy compared with the nephrolithometric nomogram (0.76, p < 0.001). 7
A recent study compared the Guy's Stone Score, the S.T.O.N.E. nephrometry score, and the CROES study nephrolithometric nomogram, each calculated for patients in a single 3-year cohort. 1,7,10,20 Regression analyses were used to compare associations between each score, stone-free status, and complication rates. All three scoring systems were equally predictive of stone-free status, while the Guy's Stone Score and S.T.O.N.E. nephrolithometry also predicted complications. 20
Clinical implications
A reliable definition of stone complexity could have far-reaching implications on clinical practice, involving training, workforce planning, centralization, and referral pathways and revalidation.
For instance, competence to perform PCNL on a less complex stone may be used as a measurable benchmark within a modular training program, whereas an ability to safely operate on a more complex stone might represent the objective of a subspecialist fellowship. In turn, planning for the provision of such fellowships could be guided by an understanding of the epidemiology of stone complexity currently encountered by PCNL surgeons, as reflected by registry data, for example. 21
For consultant urologists performing PCNL, an ability to accurately and reliably represent the complexity of cases and to use this to report risk-adjusted outcomes could be utilized for revalidation.
Comparisons between different providers would also benefit from a uniformly adopted stone complexity score, for the purpose of risk adjustment. Such comparison can have important health policy implications, for example, with volume outcome analysis informing debates concerning the centralization of endourologic services. It is conceivable, for example, that future analyses could provide evidence to support the centralization of PCNL for more complex stones, with lower volume centers continuing to offer PCNL for other stones. In fact, it seems likely that this situation reflects current UK referral practice.
For the comparison of outcomes between surgical care providers to be meaningful, robust risk adjustment must be established and uniform adoption of a single grading system for stone complexity would facilitate risk adjustment for PCNL.
Importantly, a preoperative understanding of the probability of success and complication from PCNL allows meaningful and accurate information to be afforded to patients undergoing this surgery. The scoring systems reviewed represent significant progress in this respect.
Research implications
It is hoped that this systematic review will stimulate debate between eminent practitioners and researchers in the field of PCNL, toward a future consensus about scoring stone complexity, which would facilitate the interpretation and comparison of results between centers.
PCNL is continuously evolving, both in terms of technique and in how services are structured. A universally applied scoring system remains highly desirable, particularly for risk adjustment in the context of multicenter studies comparing outcomes between providers, where different techniques are introduced and in measuring variability of outcomes.
Conclusion
It is not possible on the basis of this review to advocate adoption of one of the existing scoring systems, in either clinical practice or research. However, the quality of evidence supporting validation of the Guy's Stone Score is marginally superior, according to the criteria applied in this study, and has its own limitations. Further evaluation of the interobserver reliability of this scoring system is required.
Footnotes
Author Disclosure Statement
No competing financial interests exist.
Abbreviations Used
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
