Abstract
Background and Purpose:
Several scoring systems have recently emerged to predict stone-free rate (SFR) and complications after percutaneous nephrolithotomy (PCNL). We aimed to compare the most commonly used scoring systems (Guy's stone score, S.T.O.N.E. nephrolithometry, and CROES nomogram), assess their predictive accuracy for SFR and other postoperative variables, and develop a risk group stratification based on these scoring systems.
Materials and Methods:
We performed a retrospective review of patients who have had a PCNL at four academic institutions between 2006 and 2013. Primary outcome was SFR within 3 weeks of the surgery and secondary outcomes were operative time (OT), complications, and length of stay (LOS). We performed chi-squared, t-test, logistic, linear, and Poisson regressions, as well as receiver operating characteristics curve with area under the curve (AUC) calculation.
Results:
We identified 586 patients eligible for analysis. Of these, 67.4% were stone free. Guy's, S.T.O.N.E., and CROES score were predictive of SFR on multivariable logistic regression (odds ratio [OR]: 1.398, 95% confidence interval [CI]: 1.056, 1.852, p = 0.019; OR: 1.417, 85% CI: 1.231, 1.631, p < 0.001; OR: 0.993, 95% CI: 0.988, 0.998, p = 0.004) and have similar predictive accuracy with AUCs of 0.629, 0.671, and 0.646, respectively. On multivariable linear regression, only S.T.O.N.E. was an independent predictor of longer OT (β = 14.556, 95% CI: 12.453, 16.660, p < 0.001). None of the scores were independent predictors of postoperative complications or a longer LOS. Poisson regression allowed for risk group stratification and showed the S.T.O.N.E. score and CROES nomogram to have the most distinct risk groups.
Conclusions:
The three evaluated scoring systems have similar predictive accuracy of SFR. S.T.O.N.E. has additional value in predicting OT. Risk group stratification can be used for patient counseling. Further research is needed to identify whether or not any is superior to the others with regard to clinical usefulness and predictive accuracy.
Introduction
T
Despite years of research on preoperative variables as predictors of outcomes after PCNL, there remains a lack of standardized reporting of preoperative patient and stone-related data. 8 –10
Until recently, a useful scoring system for prognostic evaluation of success and complication rates of PCNL surgeries was unavailable. Since 2011, Guy's stone score (GSS), S.T.O.N.E. nephrolithometry, and the Clinical Research Office of the Endourological Society (CROES) nomogram have been proposed as means of preoperative assessment using patient and stone characteristics. 11 –13 These scoring systems are based on data that are easily obtainable before the surgery from preoperative imaging and patient history and could become a means of standardized reporting of preoperative cohort data. The different variables used in each of the scoring systems are listed and compared in Table 1. To date, there have been a few studies validating these scoring systems on single institutional patient cohorts. Applicability and generalizability of these scoring systems remain to be elucidated.
PCNL = percutaneous nephrolithotomy; SWL = extracorporeal shockwave lithotripsy; URS = ureteroscopy; CROES = Clinical Research Office of the Endourological Society.
To the best of our knowledge, we present the largest multicenter cohort study evaluating and comparing these three scoring systems for their accuracy in predicting postoperative outcomes and clinical applicability.
Materials and Methods
After obtaining internal Institutional Review Board approval at each participating institution, we performed a retrospective chart review of patients who underwent a PCNL between 2006 and 2013 at four academic institutions.
Selection criteria
Patients, who were 18 years old or older at the time of surgery with an available preoperative CT were included in the study. All surgeries included were performed as primary PCNL surgeries. Secondary surgeries for treatment of residual fragments after the initial surgery were excluded from this analysis. All surgeries were performed in academic referral centers for stone disease by experienced fellowship-trained endourologists.
Measurements
A single observer from each center reviewed the images and reported on all the variables obtainable from CT, necessary for calculation of the GSS, S.T.O.N.E. score, and CROES nomogram, as described by Thomas et al., 11 Okhunov et al., 12 and Smith et al. 13 We defined a partial staghorn as a stone extending into one or two calices, and stones extending in more than two calices were categorized as staghorn calculi.
Demographic and perioperative data
Patient demographics collected were age, gender, body mass index (BMI), American Society of Anesthesiology (ASA) score, and previous surgical and medical history. Perioperative data collected included operative time (OT), length of stay (LOS), number and location of percutaneous tracts, and postoperative complications, according to the Clavien scores assigned to each complication post PCNL as described by de la Rosette et al. 14
Outcomes
Primary outcome of our analysis was stone-free rate (SFR) on postoperative day 1 or within 3 weeks of the surgery as assessed by noncontrast CT or kidney, ureter, and bladder radiograph with renal ultrasound. We used a cutoff size of 2 mm for clinically insignificant residual fragments. 15 As a secondary analysis, we aimed to identify the predictive value and accuracy of the scoring systems for OT, LOS, and complications according to the adjusted Clavien classification. 14 The final goal of our analysis was to create risk groups based on the scoring systems.
Statistical analysis
For comparison of the variables between patients who are stone free and not stone free, we used chi-square test for categorical variables and Student's t-test for continuous variables. For multiple group comparisons, between-group comparison was performed with an adjusted p-value according to Holm. 16 To assess predictability of SFR, we performed univariable and multivariable logistic regression analysis. Univariable and multivariable linear regression were used to assess predictability of OT, postoperative complications, and LOS. The S.T.O.N.E. score was categorized in four groups and CROES nomogram was divided in quartiles for risk group stratification. We used a modified Poisson regression model with a robust variance estimator to estimate relative risk for residual stone after a single PCNL surgery. To identify the predictive accuracy of each of the stone scoring systems, we generated receiver operating characteristic (ROC) curves with area under the curve (AUC) analysis. AUCs were compared according to Hanley and McNeil. 17 Significance was established with a p-value <0.05. Statistical analysis was performed using SPSS version 22 (IBM Corp., Armonk, NY).
Results
Out of a total cohort of 1696 patients treated, 586 patients had all data necessary for analysis available and were included in the study. Patient demographics and clinical characteristics, as well as perioperative variables of patients who are stone free and not stone free, are available in Table 2. After the surgery, 67.4% of patients were considered stone free. The average stone size was significantly larger in the cohort which was not stone free (557 vs 1045 mm2, respectively, p-value <0.001). Patients with stones in multiple locations had significantly lower success rates compared to patients who had a stone in a single location (p-value 0.002) on post hoc analysis. SFRs for lower pole and renal pelvis stones were higher than for stones in other or multiple locations (p = 0.041 and 0.013), but did not reach statistical significance considering adjusted p-values for multiple comparison. The median scores of the three scoring systems were significantly different for patients who are stone free vs patients who are not stone free: Guy's 2 vs 3, p < 0.001; S.T.O.N.E. 7 vs 9, p < 0.001; and CROES 220 vs 183, p < 0.001.
Significant results are bolded.
ASA = American Society of Anesthesiology; BMI = body mass index; IQR = interquartile range.
On multivariable logistic regression, accounting for stone size, tract length, stone location, mean stone density, hydronephrosis, staghorn morphology, age, BMI and ASA, the GSS, S.T.O.N.E. score, and CROES nomogram score were all independent predictors of residual fragments after PCNL (odds ratio [OR]: 1.398, 95% confidence interval [CI]: 1.056, 1.852, p = 0.019; OR: 1.417, 85% CI: 1.231, 1.631, p < 0.001; OR: 0.993, 95% CI: 0.988, 0.998, p = 0.004, respectively).
The AUCs of the ROCs of the three stone scoring systems were 0.629, 0.671, and 0.646 for Guy's, S.T.O.N.E., and CROES score, while the stone size in mm2 has an AUC of 0.652 (Table 3 and Fig. 1). When comparing the AUCs of the different scoring systems and stone size, there were no statistical differences.

Receiver operating characteristic curves for the three scoring systems in predicting stone-free status.
AUC = area under the curve; CI = confidence interval.
After stratifying both the S.T.O.N.E. and CROES scores in four groups, we calculated the relative risk of residual stone after one PCNL surgery. The relative risks reported in Table 4 demonstrate the increased risk of not achieving stone-free status compared to the low risk group (e.g., patients who have a preoperative Guy's Grade 3 have a twofold higher risk of not becoming stone free with one surgery compared to a patient with a Grade 1).
We had a total complication rate of 29.2% with only 3.4% complications of Clavien grade 3 or higher. None of the scoring systems were strong predictors of complications. GSS was the only predictor on univariable analysis of a longer hospital stay (β = 0.221 days, 95% CI: 0.032, 0.409, p = 0.022) with an increase of ∼5.3 hours per increase in Guy's Grade. On multivariable analysis, however, controlling for ASA, urinary tract abnormality, age, total stone burden, mean stone density, and complication, the only independent predictors of hospital stay in the equation were ASA, urinary tract abnormality, and presence of complication. Multivariable linear regression analysis identified only the S.T.O.N.E. score as an independent predictor of OT (β = 14.556, 95% CI: 12.453, 16.660, p < 0.001).
Discussion
In recent years, the importance of systematic and standardized reporting of outcomes after various endourologic surgeries, including PCNL, has been emphasized. 8,9,14 Although Hyams et al. had previously highlighted a vast heterogeneity in reporting of preoperative variables in surgical management of kidney stones, consensus recommendations on standardized reporting of preoperative data have not yet been proposed. 10 Preoperative prognostic tools can be useful not only to stratify patients in different risk groups but also as a means of standardized reporting of preoperative cohort data.
Currently, GSS, S.T.O.N.E. score, and the CROES nomogram score represent the three most commonly used prognostic tools for PCNL. 11 –13 Although these systems were constructed independently and through different methodologies, they are all proposed to aid the surgeon in assessing case complexity to predict SFR and complication risks, while assisting in preoperative surgical planning.
There are several important differences between the currently analyzed scoring systems to be illustrated. Although the S.T.O.N.E. score is entirely based on data obtainable from the preoperative CT, both the GSS and CROES score include patient variables. The GSS, however, does not include stone size, a strong predictor of success. All the scoring systems include a measure of stone complexity. Staghorn or partial staghorn stone formation is most used to indicate the complexity of a stone and is included as a variable in the GSS and CROES nomogram. 11,13 The lack of consensus on definitions for these terms renders these scoring systems susceptible to score variations due to subjective interpretation of partial and full staghorn stone. This was identified by both Thomas and Ingimarsson. 11,18 The S.T.O.N.E. scoring systems tried to eliminate this ill-defined feature by using the number of calices involved by stone as a surrogate for stone complexity. 12 This more objective measure of stone complexity could reduce scoring variations. In comparison to the GSS and CROES nomogram, however, the S.T.O.N.E. score does not include the number of stones or stone location, which have been shown to influence treatment success. 13,19,20 Although the CROES nomogram includes most of the variables that appeared significantly different between the cohort that is stone free and the cohort with residual stone in our population and may therefore be a more complete scoring system, the large continuous scale and complexity in use limit its application in everyday practice.
As important as the development of scoring systems, is the external validation of these prediction models before widespread use in clinical practice. 21,22 The GSS has been validated on multiple occasions. 18,23 –25 Ingimarsson et al. have shown a good interrater concordance for the GSS and interestingly pointed out that 56% of the discordant results were due to unclear definitions of abnormal renal anatomy and partial staghorn stone. 18 Assessment of interobserver reliability for the S.T.O.N.E. score showed an effectively good concordance between urology residents, fellows, and staff. 26 Stone size and number of calices involved seemed to be the most challenging variables to measure with a slightly lower concordance. To date, the CROES nomogram has been externally validated on two occasions, showing fair predictive accuracy for SFR after PCNL. 25,27 Vergouwe et al. have demonstrated that a large enough sample size, including at least 100 events (in this case patients with residual stone), is required to adequately perform external validation. 28 This should be taken into account when interpreting results of validation studies performed on smaller patient cohorts. Our analysis represents the largest sample external validation analysis to date.
The staghorn stone morphometry classification by Mishra et al. and the Seoul National University Renal Stone Complexity (S-ReSC) score by Jeong et al. were not included in this comparison. 29,30 The staghorn morphometry score is a model that aims to predict the number of tracts and stages needed to clear the stone burden in patients with staghorn renal calculi. A contrast-enhanced CT with urography phase, which entails a higher radiation dose for the patient, and specific CT scan volumetric assessment software are necessary to classify a stone as type 1 (single tract single stage), type 2 (single stage multiple tract or multiple tract single stage), or type 3 (multiple tract multiple stage). 29 The S-ReSC score is based on only one parameter, the number of sites in the collecting system involved by stone and appears to have a good predictive accuracy. 30 We could, however, not evaluate this score as these data were not collected as outlined in the article.
In our current analysis, we establish that all the evaluated scoring systems are equally accurate in predicting SFR after single PCNL surgery. This finding corroborates previously reported similarities in predictive accuracy of scoring systems in smaller cohorts. 25,31,32 Interestingly, none of the systems have significant added predictive accuracy over stone size alone as a predictor of SFR. When assessing each of the stone scores against a set of variables on multivariable analysis (controlling for stone size, mean HU, tract length, hydronephrosis, number of calices, stone location, age, and ASA), it is interesting to visualize that with S.T.O.N.E., the stone location is retained in the model, with CROES, the number of calices involved, and with GSS, hydronephrosis is retained. Although each of the scoring systems is not more accurate in predicting SFR than stone size alone, stone size is not retained in a multivariable regression model containing any of the scoring systems, most likely due to collinearity.
By stratifying all patients in four groups within the Guy's, S.T.O.N.E., or CROES score, we could include a risk stratification with calculated relative risks for residual stone after the surgery compared to the reference, that is, lowest risk group (Table 4). Although the relative risk differences between the risk groups of both S.T.O.N.E. and CROES score are quite similar, those differences seem to be smaller for the GSS. The risk of residual stone is not significantly different between a Guy's Grade 1 and Grade 2 and there appears to be only a small increase in risk of having residual stones with a Guy's Grade 4 compared to a Grade 3. This may indicate that there is a more clear distinction between risk groups when using the S.T.O.N.E. score or CROES nomogram than with the GSS.
Although a higher S.T.O.N.E. score predicts a longer OT, none of the scoring systems can be considered a strong predictor for postoperative complications. Goyal et al. similarly reported that the GSS is not an independent predictor of postoperative complications in a pediatric population. 33 In the initial articles describing the GSS and S.T.O.N.E. score, the scores were not strong predictors of complications. 11,12 The CROES nomogram was initially not assessed for ability to predict postoperative complications. 13 In contrast to our findings, others did identify a correlation between GSS and complications in a prospective validation cohort. 23,24,34
In a comparison of the GSS and S.T.O.N.E. score, Noureldin et al. demonstrated both scoring systems to be independent predictors of longer OR-time. 31 Although the GSS was indeed a predictor of longer OR-time in our patient cohort on univariable analysis, adding ∼8 minutes per increase in score, it was not retained as an independent predictor in a multivariable linear regression model. Bozkurt et al., comparing the GSS to the CROES nomogram, reported that both scoring systems were predictive of estimated blood loss, OT, and overall complications. 32
In contrast to an earlier evaluation and comparison of the scoring systems on a smaller cohort, we could not identify any independent relation between the scoring systems and LOS in the current analysis. 25
We were unable to identify any score as being superior compared to the others with regard to predicting a stone-free outcome after PCNL. Although the S.T.O.N.E. score has no statistical benefit over the other scoring systems with regard to predicting SFR, it seems to be the easiest to obtain, relying purely on imaging characteristics without addition of other patient-related data and provides distinct risk stratification. This is, however, a subjective finding and should be more objectively substantiated, for instance by surveying a large group of practicing urologists about the use of the scoring systems. With the risk stratification as suggested in this article, patients can be classified as low risk, intermediate risk, high risk, and very high risk of residual stone disease after single PCNL for renal stone disease and counseled as such.
The main limitations of this study are its retrospective character and the data assessment by a single observer in each of the participating centers. We tried reducing these limitations of a multicenter retrospective study by communicating standardized outcome definitions and data collection methodology. In addition, interrater reliability has been shown to be fairly good for both the GSS and the S.T.O.N.E. score with κ = 0.72 and 0.75, respectively. 18,26 Although outcome definitions were standardized, postoperative imaging assessment of stone-free status was not. This indeed is a limitation of the study and is due to its retrospective character. When comparing the patients who had a CT postoperatively to the patients who did not, the three models still perform similarly and not significantly different from each other within and between the cohorts (AUC for GSS, S.T.O.N.E. score, and CROES nomogram were 0.637, 0.643, and 0.649, respectively, for the CT cohort and 0.614, 0.688, and 0.653 for the non-CT cohort). Although the accuracy of plain radiography may be lower than CT for residual fragments after PCNL, 35 it was not within the scope of this project to identify and compare diagnostic differences between different imaging modalities for postoperative SFR assessment. In a prospective study, this as well would need to be standardized in all participating centers. One could argue that with AUCs below 0.7, not one of the scoring systems performs well in predicting stone-free status after a single PCNL surgery for renal stone disease. They are for now, at least, a step in the right direction toward a more uniform way of reporting preoperative variables in PCNL-related research. However, as the systematic use of any of the five scoring systems is dependent on surgeon's preference, this only partially solves the problem. Further research is needed to identify if one is superior to the others with regard to clinical usefulness and predictive accuracy.
Conclusions
The three evaluated scoring systems for predicting outcomes after PCNL are all equally predictive of stone-free status after a single PCNL surgery. Patients can be stratified in a low, intermediate, high, or very high-risk group with their associated relative risk for residual stone. We have outlined the differences between the scoring systems and would argue that the S.T.O.N.E. score would be the easiest to obtain, while the CROES nomogram may be a more complete assessment of patient and stone complexity. All scores seem to be clinically useful. Further research is needed to identify whether or not any is superior to the others with regard to clinical usefulness and predictive accuracy.
Footnotes
Author Disclosure Statement
No competing financial interests exist.
