Crowdsourcing Evaluation of Ureteroscopic Videos Using the Post-Ureteroscopic Lesion Scale to Assess Ureteral Injury

Abstract

Introduction and Objectives:

We hypothesized that crowdsourcing assessments could be applied to the Postureteroscopic Lesion Scale (PULS) for ureteral injury.

Methods:

At a single institution, we prospectively digitally recorded 14 ureters at the terminal portion of standard ureteroscopic procedures. Each recording was reviewed by 10 global experts to determine a mean PULS score. Following training, the Crowd-Sourced Assessment of Technical Skills, C-SATS^® (C-SATS, Inc., Seattle, WA) platform was used to obtain crowd-based reviews. The mean crowd PULS scores was determined using the linear mixed-effects (LME) model. The intraclass correlation coefficient (ICC) was calculated to measure the agreement among experts. Spearman's rank correlation (rho) was used to quantify the strength of the relationship between the crowd LME mean and the experts.

Results:

Ten expert's reviews and 2100 layman reviews were obtained in 21 days and 49 hours, respectively. The ICC for the 10 experts was 0.68 (95% confidence interval 0.49, 0.86). When the expert mean PULS was <1, the crowd scored those recordings at 1 or greater. The highest scored recording by the experts was a 3.2, which the crowd scored at 2.25. The correlation between the crowd LME means and expert means across all videos was 0.70 (p = 0.0056) indicative of moderately strong agreement.

Conclusion:

In this initial application of crowd-sourced evaluation of ureteral injury, there was a moderately strong correlation between crowd and expert ratings. Refinement of the training, through exposure to the nuances of ureteral injuries, in particular for PULS <1 or ≥3, may lead to better crowd/expert correlation. Compared to expert review, crowd data can be collected with much greater efficiency.

Introduction

As medical technology advances, the methods of judging and assessing surgical skill have evolved too. In recent years, the novel idea of crowd-sourced reviewers has been applied to a broad range of applications in clinical medicine and research. Crowdsourcing platforms have been used to offer assessment of dry-lab surgical skill tasks and simulations,^1

–6 robot-assisted procedures,⁷ and even analysis of bladder cancer via optical biopsy.⁸ Crowd-Sourced Assessment of Technical Skills, C-SATS^® (C-SATS, Inc., Seattle, WA), is one such crowdsourcing platform that securely allows surgical recordings to be uploaded and rapidly reviewed by either experts and/or an international group of prequalified lay reviewers.

There are many benefits to using crowd technology to assess surgical performance, the most notable of which is the crowd's ability to produce a large number of reviews in a short period of time. Currently, expert surgeons are used to rank and classify surgical skill sets; however, this effort is remarkably time-consuming and expensive, given the value of an hour of a faculty surgeon's time. To date, to the best of our knowledge, this approach has not been applied to the clinical assessment of ureteroscopic procedures.

Flexible ureteroscopy (URS) is the recommended procedure by both the American Urological Association and Endourological Society's guidelines for the treatment of larger ureteral and smaller lower pole kidney stones.⁹ To facilitate ureteroscopic stone extraction, a ureteral access sheath (UAS) may be placed. Careful selection and placement of a properly sized UAS is of utmost importance to avoid complications. Improper size selection and excessive force applied during the deployment of the UAS increases the risk of injury to the ureteral mucosal and/or deeper muscle layers. In one prospective study, it was found that 46.5% of patients who underwent URS with UAS for kidney stones developed iatrogenic ureteral injury.¹⁰ To classify injuries caused by the UAS during deployment, the Postureteroscopic Lesion Scale (PULS) was developed by a panel of expert ureteroscopists.¹¹ PULS was then used to serve as a grading scale for the condition of the ureter after URS¹¹ (Appendix 1). Although the PULS grading scale is used to grade injury, it is similar in design to other grading scales used to rank and classify surgical skill in dry-lab testing environments. The Global Evaluative Assessment of Robotic Skills (GEARS)¹² is an example of a current grading scale used in dry-lab, robotic assessment tasks and has been shown to be a viable tool to be used by crowd-workers to assess surgical skill.^1,3,7

With increasing evidence that C-SATS can be a valid alternative to expert evaluation, we hypothesized that crowdsourcing assessment could also be applied to URS using the PULS to train the lay public to assess ureteral injury in a manner similar to expert ureteroscopists.

Methods

After obtaining Institutional Review Board approval, eligible patients undergoing routine urolithiasis procedures at the University of California, Irvine (UCI) Medical Center were screened and consented for the study. The eligibility criteria for this study included nonpregnant adult patients undergoing URS alone or as part of an endoscopic-guided percutaneous nephrolithotomy (PCNL) who had a documented preoperative sterile urine culture. In addition, the following were obtained: history and physical examination, review of medications, noncontrast enhanced CT scan of the abdomen and pelvis, complete blood count, and a comprehensive metabolic panel.

At the end of the surgical procedure (URS or PCNL), the ureter was recorded via the flexible ureteroscope as the UAS was being removed. In our hospital, this is standard of care for our urologists to assess for ureteral wall injuries, bleeding, and/or ureteral stones. Of note, as per routine procedure at our institution, most patients were placed on tamsulosin (0.4 mg/day) 1 week before surgery, to possibly facilitate deployment of the larger 16F UAS.¹³

Among the 14 recorded ureteroscopies, there was a wide variety of conditions (Table 1): ureters with a prior stent vs ureters with a stent at the time of the procedure, and UAS deployed vs no UAS. In addition, there was no pretreatment with tamsulosin in patient 14, and smaller UAS sizes were used in patients 7 and 8 (11F rather than 16F). These 14 recordings, totaling 15 minutes of viewing time, were then distributed to 10 expert endourologists via a YouTube™ link with a reference to the grading system and images of corresponding grade, who then proceeded to rate them using the PULS scoring system. The experts were provided with no clinical information. The authors noted clear injuries in patients 2, 4, and 13 equivalent to a Grade 3 on the PULS grading system.

Table 1.

Relevant Clinical Information During Ureteral Access Sheath Deployment and Terminal Ureteroscopic Recording

Patient/Recording No.	Preprocedure stent (F)	Postprocedure stent (F)	1 Week preoperative tamsulosin (Yes/No)	UAS size (F)	UAS length (cm)	Injury (Yes/No)	Location of injury in ureter	Type of ureteroscope (Digital/Fiber Optic)
1	7/14 Endopyelotomy	None	Yes	None	None	No		Fiber optic
2	None	7/14 Endopyelotomy	Yes	16	35	Yes	Mid	Fiber optic
3	None	6	Yes	16	55	No		Digital
4	None	7/10 Endopyelotomy	Yes	16	55	Yes	Proximal	Fiber optic
5	None	6	Yes	16	35	No		Fiber optic
6	6	6	Yes	16	55	No		Digital
7	None	7/14 Endopyelotomy	Yes	11	55	No		Fiber optic
8	None	6	Yes	11	35	No		Digital
9	None	6	Yes	16	35	No		Fiber optic
10	None	6	Yes	16	45	No		Fiber optic
11	None	6	Yes	16	55	No		Digital
12	None	6	Yes	16	35	No		Fiber optic
13	None	6	Yes	16	35	Yes	Proximal	Digital
14	None	6	No	16	35	No		Digital

Patient 2, 4, and 13 had visible injuries on recordings.

UAS = ureteral access sheath.

The same recordings were then securely sent to “the crowd” via C-SATS secure network in compliance with applicable standards and requirements of the Health Insurance Portability and Accountability Act (HIPAA) and the Health Information Technology for Economic and Clinical Health Act (HITECH). Crowd-sourced reviewers underwent training to familiarize themselves with ureteroscopic procedures (indications and possible injuries) and the PULS scoring system (Appendix 2). All reviewers were asked to assess the grade of ureteral wall injury following URS using the PULS grading system for all 14 recordings. They were blinded to both patient information and expert score for each ureter. Scores were collated and analyzed to assess congruency of the crowd-sourced group to the expert urologists on their ability to produce accurate PULS scores. The crowd mean PULS scores were determined using a linear mixed-effects (LME) model. The intraclass correlation coefficient (ICC) was also calculated as a measure of the agreement among experts. Spearman's rank correlation (rho) was used to quantify the strength of the relationship between the crowd LME mean and the experts: very strong agreement (>0.8), moderately strong agreement (0.6–0.8), fair agreement (0.3–0.5), and poor or no agreement (<0.3).

Results

Table 1 shows the relevant preoperative and intraoperative clinical information for each individual patient. The experts provided their reviews over a 21-day period. In all, 10 experts gave PULS scores for each recording resulting in an ICC of 0.68 (95% confidence interval 0.49, 0.86). Each expert reviewer's individual scores for each recording is shown in Table 2. The largest range was seen for recordings 2 and 9. Recording 4 had a documented proximal ureteral injury that was not reflected in the expert rating.

Table 2.

Postureteroscopic Lesion Scale Scores for Expert Reviewers and the Crowd with Means Scores, Ranges, and Confidence Intervals

	Expert reviewer
Video	A	B	C	D	E	F	G	H	I	J	Expert mean score	Expert 95% confidence interval	Median score	Range (minutes)	Range (Max)	Crowd mean score	Crowd 95% confidence interval
1	2	1	1	0	2	0	1	1	2	1	1.10	0.57, 1.63	1	0	2	1.62	1.36, 1.88
2	3	4	3	3	5	3	3	2	3	3	3.20	2.64, 3.76	3	2	5	2.30	2.05, 2.55
3	1	1	1	1	1	1	1	0	1	0	0.80	0.50, 1.10	1	0	1	2.03	1.78, 2.28
4	0	1	0	0	1	0	1	1	0	1	0.50	0.12, 0.88	0.5	0	1	1.68	1.43, 1.93
5	1	1	1	1	2	1	1	2	2	1	1.30	0.95, 1.65	1	1	2	1.90	1.64, 2.16
6	0	0	0	0	0	0	0	0	1	0	0.10	0.00, 0.33	0	0	1	1.02	0.77, 1.26
7	1	1	1	0	2	1	1	1	1	1	1.00	0.66, 1.34	1	0	2	2.56	2.31, 2.82
8	0	1	0	0	1	1	0	1	1	1	0.60	0.23, 0.97	1	0	1	1.69	1.44, 1.93
9	1	1	1	1	1	1	0	1	3	2	1.20	0.64, 1.76	1	0	3	1.81	1.56, 2.06
10	0	1	0	1	2	0	1	1	1	1	0.80	0.35, 1.25	1	0	2	1.60	1.36, 1.85
11	0	0	0	0	1	0	0	0	0	0	0.10	0.00, 0.33	0	0	1	1.08	0.83, 1.32
12	1	2	1	0	2	1	0	1	2	1	1.10	0.57, 1.63	1	0	2	1.53	1.28, 1.77
13	3	2	2	2	3	2	2	2	3	3	2.40	2.03, 2.77	2	2	3	2.36	2.12, 2.60
14	0	1	0	0	1	0	0	0	1	0	0.30	0.00, 0.65	0	0	1	1.01	0.76, 1.26
Mean response	0.9	1.2	0.8	0.6	1.7	0.8	0.8	0.9	1.5	1.1	1.04					1.73

The C-SATS platform trained and received feedback from a total of 2128 crowd reviewers in 49 hours (28 reviewers were eliminated due to incompletion of the task). The total cost of compensation to the crowd was $1191.68, the mean of which was $0.56 per lay reviewer (Appendix 2).

Figure 1 shows the comparison between the expert's mean grade for each recording and for the crowd-sourced grade determined by the LME model. When experts graded a recording <1 on the PULS, the crowd scored it at 1.01 to 2.03. The highest scored recording was a 3.2 (recording 2) by the experts; the crowd scored the same recording at 2.30 despite the presence of a clear injury (Fig. 2). In contrast, the highest crowd score was for recording 7 at 2.56 in which no injury was noted; the experts scored the same recording at 1.0. Interestingly, recording 4 with a documented ureteral injury (Fig. 3) was scored higher by the crowd (1.68) than by the experts (0.50). The third patient with a documented ureteral injury (recording 13) (Fig. 4) had a notably high level of agreement between the crowd and experts scoring 2.36 and 2.40, respectively, on the PULS. Crowd LME mean data on average gave higher scores for the recording than the experts (1.73 vs 1.04). The 95% confidence interval range was smaller for the crowd in all recordings except patient 6. The Spearman's rank correlation was calculated to determine the strength of the relationship between the two groups. This correlation value between the experts and crowd LME mean was 0.70 (p = 0.0056), which is considered a moderately strong agreement (Fig. 5).

FIG. 1.

Comparison of expert and crowd mean scores for each video arranged according to highest expert PULS score to the right (Video 2). PULS = Postureteroscopic Lesion Scale.

FIG. 2.

(A) Normal proximal ureter with guidewire in patient 2. (B) Mid ureter showing grade 3 PULS injury with periureteral fat exposed in same patient.

FIG. 3.

(A) Normal distal ureter with guidewire in patient 4. (B) Proximal ureteral injury (white arrow) in same patient.

FIG. 4.

(A) PULS 3 acute injury (white arrow) in a patient after passing an access sheath. (B) Repeat ureteroscopy in same patient 6 weeks later showing complete healing of the injured site.

FIG. 5.

Spearman's rank correlation with crowd LME mean and expert PULS scores showing a moderately strong correlation. LME = linear mixed-effects.

Discussion

Previous studies used C-SATS successfully in a dry-lab training environment for novice surgeons, and subsequently in vivo in a pig to analyze the surgeon's technical skill, based on criteria such as robotic control, bimanual dexterity, and others.¹⁴ In one such study, 3938 crowd assessments of resident dry laboratory task videos were received within 3.5 hours of submission of the videos to the C-SATS database, while it took eight experts an average of 22 days (6–34 days) to produce 150 reviews.¹ The crowd sourcing was also shown to be a relatively inexpensive analytical method, as each crowd-worker was paid an average of $0.44 per task video reviewed; the total cost of the crowd assessment was $85/resident or $2125 for the 25 resident applicants.¹ However, to our knowledge no study has tested C-SATS in a clinical setting where the crowd was asked to analyze the performance based on the appearance of the tissue at the termination of the procedure, in this case URS.

In this study, the crowd was able to generally provide the correct assessment compared to expert-graded mean scores when the injury was of a minor nature (i.e., PULS 1–3); however, there was a larger gap between the expert and crowd mean score for expert-graded scores under 1 and over 3. These errors may be due to suboptimal crowd training before assessment, quality of the recordings, or a lack of diligence on the part of our expert reviewers given the discrepancy for recording 4 where indeed there was an injury (Fig. 3). This injury could be seen at the end of the recording and thus would have been missed unless an individual viewed the recording in its entirety.

Our study showed that there was a moderately strong correlation between the crowd-sourced and expert reviewers. When looking at lower grade injuries scored by experts, the crowd tended to score them higher. This can be observed in patient recordings 6, 11, and 14 (Fig. 1). Similarly, higher grade injuries scored by experts were scored lower by the crowd, which can be seen in patient recordings 2 and 13. Interestingly, the experts themselves are not in agreement with each other, as the recording with the most obvious injury (recording 2) had a range between 2 and 5 (mean 3.20). The authors judged this recording as an example of a PULS 3 injury. On the contrary, recording 4 had a proximal ureteral injury that was scored 1.68 by the crowd but 0.5 by the experts. We are at a loss to explain this lapse but hypothesize that perhaps the injury was too subtle/obscured to some extent by the guidewire or perhaps the recording was viewed too quickly to assess the injury that appeared at the very end of the recording (Fig. 3). Also, there was a large spike in the crowd-sourced review of recording 7, nearly an average of two grades higher than the experts' reviews. In this recording, there were several blood clots floating in the field of view despite absence of injury to the ureteral wall; the crowd, viewing the blood clots, may have erroneously upgraded the injury. In addition, this recording was taken using a fiberoptic ureteroscope vs a digital ureteroscope used for the other six recordings; this could have led to a poorer review by a novice reviewer simply due to the relatively poorer quality of the recording (also used for recording 4). We performed a supplemental analysis excluding recording 7 with this possibility in mind, which led to an increased Spearman's rho value of 0.79 (p = 0.0014), a large improvement in the agreement between the two data sets (rho value 0.70 increased to 0.79), indicating that patient 7 was likely an outlier. This finding has made it clear to the authors that further training of the crowd to ignore the presence of blood clots in the absence of ureteral wall injury is necessary. It also points out the importance of providing recordings of superior quality that are better obtained with the newer digital ureteroscope eliminating the haziness of the fiberoptic “screen door” image.

Crowdsourcing to date has been most commonly used among teaching medical institutions for the assessment of robotic surgical skills among residents and practicing surgeons. One such bench study conducted in 2016 by Polin and colleagues used a robotic assessment checklist known as the robotic-objective structured assessments of technical skills (R-OSATS) and videos from 60 robotic bench-top laboratory drills.⁴ In all, 448 crowd reviewers produced 2517 R-OSATS assessments within 16 hours.⁴ According to this research, it was determined that crowdsourcing is a rapid and viable alternative to expert surgeon evaluation for robotic surgery; this was corroborated by other institutions who posted similar results using C-SATS.⁴ It is worth noting that crowd reviewers are able to make specific comments on each recording just like an expert would in the same situation, which further extends the usefulness and practicality of C-SATS as a collaborative teaching, instructional, and auditing tool for novice surgeons.⁴

There are many benefits to using C-SATS over traditional expert assessments, among which the most important is the amount of time it saves to obtain results. In the present study, 2100 valid crowd reviewers were recruited and completed the questionnaire within 49 hours, whereas the expert reviews were not completed until 21 days. Other studies using C-SATS have yielded similar results. Holst and colleagues were able to receive 50 crowd-worker assessments from 5 recordings (250 total reviews) within an average of 2 hours and 50 minutes vs 3 expert surgeon assessments on the same 5 recordings, which took more than 10 times as long.⁵ This study also concluded that C-SATS was sufficient to replace expert analysis as an R ² value of 0.93 was achieved on a linear regression plot comparing crowd vs surgeon assessments.⁵ Our study demonstrated that C-SATS may be used for identification and classification of ureteral injury; this could be applied during residency training to compare a resident's perceived level of ureteral injury to that of the crowd and an in-house expert. Moreover, the PULS grading system may be used to develop a registry of ureteral injury and be employed to potentially determine whether a stent needs to be placed after a ureteroscopic procedure. The latter, however, needs further study and corroboration.

The main weaknesses of the current study rests in the less than optimal training of the crowd reviewers, which led to some wide variations in ratings. Also, there was a wide variation among the experts that could well mean that the selected recordings were suboptimal; certainly, the fiberoptic recording should have been eliminated given the broad scores it engendered. For these reasons and possibly others that we have failed to discern, we did not achieve a strong level of agreement (>0.8 Spearman's rho value) between the crowd and the expert reviews. It is apparent to us that re-training with more videos of injured vs normal ureters to better prepare the lay reviewers before their analysis is needed. Certainly, before acceptance of crowd-sourced assessments of URS, a correlation with expert reviewers in excess of 0.80 and preferably 0.90 needs to be obtained.

Conclusion

Using the PULS grading system, C-SATS showed a moderately strong correlation between crowd and expert reviewers viewing 14 ureteroscopic recordings of varying levels of ureteral wall injury. Refinement of the crowd's training protocol may improve accuracy, specifically, for PULS grades <1 or ≥3. Compared to expert review, crowd-sourced data can be collected with much greater efficiency.

Footnotes

Acknowledgments

The authors would like to thank the expert endourologists who dedicated their time to make this study possible: Dr. Demetrius Bagley, Dr. Duane Baldwin, Dr. Ben Chew, Dr. John Denstedt, Dr. Michael Grasso, Dr. Brian Matlaga, Dr. Manoj Monga, Dr. Margaret Pearle, Dr. Roger Sur, and Dr. Olivier Traxer.

Author Disclosure Statement

The study was self-funded through the research accounts of the Department of Urology at UC Irvine.

Abbreviations Used

Appendix 1.

The Postureteroscopic Lesion Scale as Described by Schoenthaler and Colleagues ¹¹

Grade	Ureteral injury description	Overall classification of surgery
0	No lesion	Uncomplicated URS (No grading according to the Dindo-modified Clavien classification of surgical complications)
1	Superficial mucosal lesion and/or significant mucosal edema/hematoma
2	Submucosal lesion
3	Perforation with less than 50% partial transection	Complicated URS (Grade 3a or b according to the Dindo-modified Clavien classification of surgical complications)
4	More than 50% partial transection
5	Complete transection

URS = ureteroscopy.

Appendix 2. The Crowd: Training and Additional Information

Each member of the crowd was trained before reviewing the videos by a brief yet informative 4-minute video that was uploaded to YouTube™ by the staff at C-SATS^®. The reviewers were provided with a link through the C-SATS assessment tool that educated them on the various aspects of the anatomy and surgery before proceeding with the grading of ureteral injury. The training video consisted of a prerecorded voiceover that narrated a slideshow of relevant information. The topics that were covered included a summarized explanation of the anatomy of the urinary tract, kidney stones, ureteroscopy, and the Postureteroscopic Lesion Scale (PULS) grading system. Each grade of PULS from 0 to 5 was described according to the initial Schoenthaler and colleagues definition with correlating patient injury illustrations. Upon completing the training video, the reviewers were provided with another private YouTube link to the 14 ureteroscopic recordings to be graded. Using the aforementioned assessment tool, the crowd selected the most appropriate grade of the ureter from 0 to 5. The mean length of ureteral recordings was 56 seconds (range: 23–111).

The lay reviewer cohort consisted of 2128 trained individuals who returned their assessments within 49 hours. Each reviewer was compensated at a rate of $6 for every hour spent working on the assignment. By the end of the study, an average of $0.56 was compensated to each reviewer resulting in a total cost of $1191.68. The only demographic data collected by the C-SATS team was the location of the reviewers as denoted in the following table.

References

Vernez

, Huynh

, Osann

, et al. C-SATS: Assessing surgical skills among urology residency applicants. J Endourol, 2017; 31:S95–S100.

Deal

, Lendvay

, Haque

, et al. Crowd-Sourced Assessment of Technical Skills: An opportunity for improvement in the assessment of laparoscopic surgical skills. Am J Surg, 2016; 211:398–404.

White

, Kowalewski

, Dockter

, et al. Crowd-Sourced Assessment of Technical Skill: A valid method for discriminating basic robotic surgery skills. J Endourol, 2015; 29:1295–1301.

Polin

, Siddiqui

, Comstock

, et al. Crowdsourcing: A valid alternative to expert evaluation of robotic surgery skills. Am J Obstet Gynecol, 2016; 215:644.e1–644.e7.

Holst

, Kowalewski

, White

, et al. Crowd-Sourced Assessment of Technical Skills: An adjunct to urology resident surgical simulation training. J Endourol, 2014; 29:604–609.

Aghdasi

, Bly

, White

, et al. Crowd-sourced assessment of surgical skills in cricothyrotomy procedure. J Surg Res, 2015; 196:302–306.

Ghani

, Miller

, Linsell

, et al. Measuring to improve: Peer and crowd-sourced assessments of technical skill with robot-assisted radical prostatectomy. Eur Urol, 2016; 69:547–550.

Chen

, Kirsch

, Zlatev

, et al. Optical biopsy of bladder cancer using crowd-sourced assessment. JAMA Surg, 2016; 151:90–93.

Assimos

, Krambeck

, Miller

, et al. Surgical management of stones: American Urological Association/Endourological Society Guideline, PART II. J Urol, 2016; 196:1161–1169.

10.

Traxer

, Thomas

. Prospective evaluation and classification of ureteral wall injuries resulting from insertion of a ureteral access sheath during retrograde intrarenal surgery. J Urol, 2013; 189:580–584.

11.

Schoenthaler

, Buchholz

, Farin

, et al. The Post-Ureteroscopic Lesion Scale (PULS): A multicenter video-based evaluation of inter-rater reliability. World J Urol, 2014; 32:1033–1040.

12.

Goh

, Goldfarb

, Sander

, et al. Global Evaluative Assessment of Robotic Skills: Validation of a clinical assessment tool to measure robotic surgical skills. J Urol, 2012; 187:247–252.

13.

Kaler

, Safiullah

, Patel

, et al. MP75-14 the impact of one week of pre-operative Tamsulosin on deployment of 16-French ureteral access sheaths. J Urol, 2017; 197:e1008–e1009.

14.

Holst

, Kowalewski

, White

, et al. Crowd-Sourced Assessment of Technical Skills: Differentiating animate surgical skill through the wisdom of crowds. J Endourol, 2015; 29:1183–1188.