Interobserver Reliability and Reproducibility of S.T.O.N.E. Nephrolithometry for Renal Calculi

Abstract

Purpose:

To assess the reliability of the S.T.O.N.E. (stone size [S], tract length [T], obstruction [O], number of involved calices [N], and essence or stone density [E]) nephrolithometry scoring system by testing its reproducibility between different observers.

Patients and Methods:

Preoperative images of 58 patients who underwent percutaneous nephrolithotomy (PCNL) were reviewed. Medical students, urology residents, one fellow, and a urology attending independently reviewed all images and scored the renal stones. Interobserver reliabilities of the total score for all categories and each component were evaluated by the intraclass correlation (ICC) and a κ coefficient.

Results:

The interobserver reliability for the total score demonstrated high correlations for all components and total score (ICC=S, T, O, N, E and total 0.80, 0.97, 0.89, 0.84, 0.91, and 0.87, respectively). κ rates for individual components between two medical students were 0.36, 1, 0.31, 0.45, 0.33, and 0.30 for the S, T, O, N, E components and total score, respectively. κ values between the two urology residents were 0.71, 1, 0.92, 0.79, 0.93, and 0.67 for S, T, O, N, E components and total score, respectively. κ values between the urology fellow and an attending physician were 0.95, 1, 0.88, 0.94, 0.89, and 0.87 for S, T, O, N, E components and total score, respectively. P value for all the scoring components was <0.05, indicating that the estimated κ was not a result of chance.

Conclusions:

The S.T.O.N.E. nephrolithometry has excellent interobserver reliability. Quantifying the S and N metrics was the most challenging and least reliable. Standardized protocols to measure these components should be considered to improve accuracy and reproducibility of the scoring system.

Introduction

P ercutaneous nephrolithotomy (PCNL) remains the treatment of choice for patients with large and complex kidney stones.¹ Stone burden and distribution, caliceal complexity, and degree of hydronephrosis have been demonstrated to play an important role in the outcomes of PCNL.^2
–4 These variables can be obtained accurately from preoperative CT imaging and effectively used for diagnosis, patient consulting, and surgical planning.^5,6

The S.T.O.N.E. (stone size [S], tract length [T], obstruction [O], number of involved calices [N], and essence or stone density [E]) nephrolithometry scoring system was developed recently to help standardize academic reporting and to predict outcomes of PCNL.⁷ The scoring system integrates five components measured from preoperative CT images and quantitatively characterizes stone status to provide an overall picture of the complexity of the surgical procedure.

Most measurements obtained from CT imaging involve some degree of measurement error and variability between observers with different levels of expertise and knowledge. Because measurement errors and variations can substantially affect the interpretation of a scoring system and statistical analysis, it is important to quantitate such errors by evaluating a reliability index. As such, we assess the reliability of the S.T.O.N.E. nephrolithometry scoring system by testing its reproducibility between different observers in a cohort of patients who underwent PCNL.

Patients and Methods

Patient identification

After Institutional Review Board approval, we reviewed a retrospective kidney stone database of patients who underwent PCNL. Patients with available preoperative noncontrast CT imaging were randomly selected and images were reviewed. For the purpose of this study, patients with multiple and/or bilateral stones, kidney anomalies, and/or a history of surgical intervention on the ipsilateral side were excluded.

Image review

All images were independently reviewed and S.T.O.N.E. nephrolithometry scores were determined by six observers. Observers included two medical students (MS3), two urology residents (PGY3), a urology fellow, and a urology attending. All participants were given standardized 10 to 15 minute instructions on the scoring system before CT measurements.

Interobserver agreement measurement

Interobserver agreement for all six observers was assessed using the interclass correlation (ICC),⁸ and subsequently κ statistical analysis was used to assess agreement independently between the subgroups of two medical students, two residents, and the fellow and urology attending. In addition, ICC was used to assess agreement between all raters and between four raters excluding medical students. Interobserver variations were assessed for each scoring system component as well as the total scoring system. The cutoffs for both ICC and κ used in this study were: 0.0–0.2, slight agreement; 0.21–0.4, fair agreement; 0.41–0.6, moderate agreement; 0.61–0.8, substantial agreement; 0.81–1.00, almost perfect agreement. The κ and ICC statistics were calculated, along with P values, to determine whether values differed from 0 (ie, agreement differs from that expected by chance alone) and the corresponding 95% confidence interval. If the P value was significant (P<0.05), then agreement differed from that expected by chance alone.

Results

Noncontrast CT images of a total of 70 patients who underwent PCNL were reviewed. Of these, 58 met the inclusion criteria.

Table 1 demonstrates the κ coefficient of concordance between pairs of raters with similar levels of training for all five components of the scoring system. κ rates for individual components between two medical students were relatively poor, showing 0.36, 1, 0.31, 0.45, 0.33, and 0.30 for the S, T, O, N, E components and total score, respectively. All individual components had poor correlations except the T component.

Table 1.

κ Test Results for Pairs of Raters with Similar Level of Training

	Medical students	Residents	Fellow/attending	P values
S	0.36	0.71	0.95	<0.001
T	1.0	1.0	1.0	<0.001
O	0.31	0.92	0.88	<0.001
N	0.45	0.79	0.94	<0.001
E	0.33	0.93	0.89	<0.001
Total score	0.30	0.67	0.87	<0.001

S=stone size; T=tract length; O=obstruction; N=number of involved calices; E=essence or stone density.

κ values between the two urology residents were 0.71, 1, 0.92, 0.79, 0.93, and 0.67 for S, T, O, N, E components and total score, respectively. κ correlations again were highest for the T component followed by E, O, N, S and total score.

κ values between the urology fellow and an attending physician were 0.95, 1, 0.88, 0.94, 0.89, and 0.87 for S, T, O, N, E components and total score, respectively. This demonstrated very strong concordance for all the scoring system components.

ICC coefficient among all six raters for S, T, O, N, E and total score were relatively low, showing 0.63, 0.94, 0.48, 0.64, 0.49, and 0.75, respectively. When raters with the least expertise (two medical students) were excluded from analysis, however, the ICC coefficients were notably higher with values of 0.81, 0.98, 0.90, 0.84, 0.91, and 0.87 for all five components and the total score, respectively. Furthermore, the lower 95% confidence limits for the ICCs were 0.73, 0.97, 0.85, 0.78, 0.87, and 0.82, respectively, showing strong agreement between observers. P values for κs and ICCs for all the scoring components were <0.001, indicating that agreement was not the result of chance (Table 2). Overall, grading of the track length (T) component of the scoring system had the highest coefficient followed by essence (E), degree of obstruction (O), number of calices (N), and stone size (S) component. Components N and S were the most challenging measurements with the greatest interobserver variability.

Table 2.

Intraclass Correlation as Measure of Agreement Between All Six Raters and Four Raters Excluding Medical Students

	All raters	P values	Raters excluding MS	P values
S	0.63 (0.53–0.73)^*	<0.001	0.80 (0.73–0.83)^*	<0.001
T	0.94 (0.92–0.96)	<0.001	0.97 (0.96–0.99)	<0.001
O	0.48 (0.37–0.60)	<0.001	0.89 (0.84–0.94)	<0.001
N	0.64 (0.53–0.74)	<0.001	0.84 (0.77–0.86)	<0.001
E	0.49 (0.34–0.63)	<0.001	0.91 (0.87–0.94)	<0.001
Total score	0.75 (0.66–0.82)	<0.001	0.87 (0.81–0.91)	<0.001

Confidence intervals.

MS=medical students; S=stone size; T=tract length; O=obstruction; N=number of involved calices; E=essence or stone density.

Discussion

Preoperative imaging is a critical step in establishing accurate diagnoses, as well as determining the optimal treatment modality and planning of the surgical intervention.⁹ CT has become the gold standard imaging modality for urolithiasis allowing for efficient, reproducible imaging with a high spatial resolution.^10,11 Stone characteristics obtained from CT can be integrated into a grading system to provide a picture of the stone's complexity and, as such, the potential procedure difficulty. This grading system would also allow for more efficient workup of patients with urinary tract stones. In addition, S.T.O.N.E. nephrolithometry will allow for quantitative and standardized assessment of PCNL outcomes across different series.⁷ In this study, we evaluated the interobserver reliability of this scoring system.

The stone size (S) component was measured as an area determined by the two longest orthogonal dimensions on the axial CT images. For a smooth ovoid or rounded stone, these dimensions could be reliably determined without substantial discordance between users. Many stones, however, do not retain such conducive shapes as they increase in size. This resulted in the greatest variation and lowest concordance rates between all raters and pairs of observers with similar levels of expertise. Larger stones, including staghorn calculi and those that are targets for lithotripsy, tend to conform to the renal collecting system and develop protrusions or excrescences, which result in irregular, less geometric configurations. Furthermore, the shape and distribution of the stone often varies across consecutive axial CT images such that a stone may be within a calix with relative sparing of the pelvis on the most superior image and progress to be entirely within the pelvis with sparing of calices on the most inferior image.

This irregularity in stone configuration leads to two major variations in measurement between users as they decide where best to measure the stone size. First, the rater must decide which of several axial CT images to use for assessing stone dimensions. Second, the rater must determine which planes best approximate the largest dimensions on axial imaging—whether these are true anteroposterior and transverse planes or two obliquely oriented orthogonal planes. In addition, a rater may decide to use one plane of measurement on a particular axial image and measure the orthogonal plane on a completely different axial image. These variables would need to be minimized and strict measurement protocols must be used to improve the reproducibility of stone size (S) as a component of the scoring system. This is particularly important, because multiplying the dimensions to determine size of an area may lead to amplification of small measurement variations.

Because of simplicity, the tract length (T) measurement was shown to have the highest correlation among all observers across all comparisons. Tract length was measured between the skin surfaces and stone targeted for treatment. Because there could be multiple stones with varying distributions within the kidney, we decided to place the first start point for measurement at the most distal or central stone farthest from the skin, which would potentially be the most difficult target for percutaneous lithotripsy. Various body morphologies and numerous points on the skin surface, where observers could initiate the percutaneous access and measure the tract length, made a standardized access inevitable. We established a 45-degree angle from horizontal on axial images as the standard for extending the line to the skin surface as the second point for measuring the tract length. The component of tract length was subdivided into only two scores (≤100 mm and >100 mm), which served to further reduce the potential variation in scoring between observers. As such, highest interobserver reliability was achieved in this component by standardizing the protocol for stone size measurement and limiting the potential scores assigned.

Recognizing the appearance of hydronephrosis on CT imaging was the key to score the third component of obstruction and relied heavily on the user's knowledge of renal cross-sectional anatomy on CT images. We aimed to minimize the subjectivity of this measurement by allowing for only two scores in assessing obstruction: None or mild hydronephrosis vs moderate or severe hydronephrosis. This still leaves potential for variation in scoring, however, given cases of localized obstruction. A stone that is obstructing upper pole calices may result in severe hydronephrosis in the superior part of the kidney with the remaining calices and renal pelvis left intact. One user may score this localized obstruction as severe, while another may determine that there is only mild overall obstruction because the majority of the kidney is unaffected.

Determining the number of calices involved is also highly dependent on the user's understanding of renal cross-sectional anatomy.⁶ This component resulted in the second highest variability between the raters. This may be because of the inherent limitations of noncontrast CT imaging, including diminished delineation between adjacent structures of similar densities as occurs at the medullary-caliceal interface. Hence, this scoring can be made more challenging by the presence of compound calices and numerous variations in caliceal configuration between kidneys and across upper, mid, and lower zones of the same kidney.^2,12 As such, users with different levels of expertise and background knowledge in urinary system CT imaging may score involvement of two compound calices (each composed of two smaller calices) as either a “1” (1–2 calyces assuming each compound calix as a solitary calix) or higher (if each smaller calix is assumed to be a solitary calix). Depending on whether or not the caliceal involvement is adjacent and contiguous, this may lead to further upscoring of stones as staghorn calculi. In addition, users may rely to different extents on axial and coronal imaging, which can lead to missed caliceal involvement if adjacent calices are incorrectly stacked and perceived as one when scrolling between consecutive images. Standardization in the definition of a calix and the imaging plane used to enumerate caliceal involvement can aid to improve interobserver reliability in the measurement of this component.

We decided to score stone essence (E) by determining the density on axial CT imaging measured as the average Hounsfield unit (HU) for a circular region of interest (ROI) drawing a line surrounding as much of the stone as possible while minimizing the adjacent soft tissue and kidney elements. As with stone size, this measurement becomes challenging in cases where stone configuration is more irregular and does not conform to a circular ROI, more often problematic in larger and staghorn calculi. Soft tissue or fluid density around the stone may be included in the measurement leading to underestimation of stone density.

Many of these larger and staghorn calculi are lamellated, which leads to varying densities from the center to the periphery of the stone. Selecting an ROI that only included the calcified stone and eliminated external noncalcified densities could lead to users measuring different parts of a lamellated stone, which would result in large density variations depending on the size, shape, and position of the ROI. By standardizing the ROI as described above, we aimed to eliminate this significant variation while accepting the less substantial underestimation of stone density inherent in using a fixed circular ROI that surrounded the stone while including some adjacent noncalcified densities. This did not eliminate the subjectivity between users, however, in regard to selection of appropriate axial CT images and stone slice for density measurements. By using only two subdivisions for scoring (≤950 and >950 HU), we aimed to minimize the effect of this subjectivity between users on the overall scoring reproducibility, because theoretically, only stones with densities falling immediately around the 950 HU threshold would be potentially upscored or downscored as a result of interobserver variation.

The preoperative studies performed in our series were obtained without the use of intravenous contrast, as previously mentioned, which inherently limits the imaging contrast between structures of similar density. While noncontrast images may not pose a significant hindrance in detection of high density calcified stones on a background of soft tissue and water densities in the remainder of the kidney, it can make it difficult to resolve more subtle anatomic distinctions such as medullary-caliceal interfaces. Furthermore, high density elements such as calcified stones often exhibit a “blooming” artifact, which can obscure the structure's borders leading to an overestimation of its size and blurring of adjacent tissues. For the observer with a higher level of expertise and knowledge of renal and cross-sectional imaging anatomy (eg, urology attendings and fellows), these issues are not likely to result in significant errors of measurement. For the less experienced observer (eg, junior urology residents or medical students), however, the diminished imaging contrast and artifact produced by high density stones can lead to more substantial errors in measurements and with application of our S.T.O.N.E. scoring system. The reproducibility of measurements between users of differing experience levels in each of the five components is affected by the global factors discussed above in addition to more specific limiting factors within each scoring category.

With increasing attention to radiation dose, low-dose CT imaging is being actively integrated in the diagnosis and management of urolithiasis.¹³ With the growing application of reduced radiation dose techniques in noncontrast urologic CT imaging, it would be beneficial to see how this scoring system holds up in the face of diminished imaging resolution that accompanies reduced radiation dose protocols.

Conclusions

We demonstrated an excellent interobserver reliability of the S.T.O.N.E. nephrolithometry scoring system, particularly in the components of tract length (T), obstruction (O), and stone essence (E) followed by stone size (S). Degree of training and levels of expertise with CT imaging of the genitourinary system play an important role in the accurate grading and assessment of stone complexity. Standardized protocols for measurement of stone size and number of calices involved by methods discussed in this article will lead to further improvement in the reproducibility of this scoring system.

Disclosure Statement

No competing financial interests exist.

Footnotes

Abbreviations Used

References

Preminger

, Assimos

, Lingeman

et al. Chapter 1: AUA guideline on management of staghorn calculi: Diagnosis and treatment recommendations. J Urol, 2005; 173:1991–2000.

Binbay

, Akman

, Ozgor

et al.

Does pelvicaliceal system anatomy affect success of percutaneous nephrolithotomy?

Urology, 2011; 78:733–737.

de la Rosette

, Assimos

, Desai

et al. The Clinical Research Office of the Endourological Society Percutaneous Nephrolithotomy Global Study: Indications, complications, and outcomes in 5803 patients. J Endourol, 2011; 25:11–17.

Bagrodia

, Gupta

, Raman

et al. Predictors of cost and clinical outcomes of percutaneous nephrostolithotomy. J Urol, 2009; 182:586–590.

Patel

, Walkden

, Ghani

, Anson

. Three-dimensional CT pyelography for planning of percutaneous nephrostolithotomy: Accuracy of stone measurement, stone depiction and pelvicalyceal reconstruction. Eur Radiol, 2009; 19:1280–1288.

Thiruchelvam

, Mostafid

, Ubhayakar

. Planning percutaneous nephrolithotomy using multidetector computed tomography urography, multiplanar reconstruction and three-dimensional reformatting. BJU Int, 2005; 95:1280–1284.

Okhunov

, Friedlander

, George

et al. S.T.O.N.E. Nephrolithometry: Novel surgical classification system for kidney calculi. Urology, 2013Epub ahead of print.

Shrout

, Fleiss

. Intraclass correlations: Uses in assessing rater reliability. Psychol Bull, 1979; 86:420–428.

Magrill

, Patel

, Anson

. Impact of imaging in urolithiasis treatment planning. Curr Opin Urol, 2013; 23:158–163.

10.

Fulgham

, Assimos

, Pearle

, Preminger

. Clinical effectiveness protocols for imaging in the management of ureteral calculous disease: AUA technology assessment. J Urol, 2013; 189:1203–1213.

11.

Lipkin

, Preminger

. Imaging techniques for stone disease and methods for reducing radiation exposure. Urol Clin North Am, 2013; 40:47–57.

12.

Stunell

, McNeill

, Browne

et al. The imaging appearances of calyceal diverticula complicated by uroliathasis. Br J Radiol, 2010; 83:888–894.

13.

Kulkarni

, Uppot

, Eisner

, Sahani

. Radiation dose reduction at multidetector CT with adaptive statistical iterative reconstruction for evaluation of urolithiasis: how low can we go? Radiology, 2012; 265:158–166.