Abstract
Background:
Prior studies on technical skills use small collections of videos for assessment. However, there is likely heterogeneity of performance among surgeons and likely improvement after training. If technical skill explains these differences, then it should vary among practicing surgeons and improve over time.
Materials and Methods:
Sleeve gastrectomy cases (n = 162) between July 2018 and January 2021 at one health system were included. Global evaluative assessment of robotic skills (GEARS) scores were assigned by crowdsourced evaluators. Videos were manually annotated. Analysis of variance was used to compare continuous variables between surgeons. Tamhane's post hoc test was used to define differences between surgeons with the eta-squared value for effect size. Linear regression was used for temporal changes. A P value <.05 was considered significant.
Results:
Variations in operative time discriminated between individuals (e.g., between 2 surgeons, means were 91 and 112 minutes, Tamhane's = 0.001). Overall, GEARS scores did not vary significantly (e.g., between those 2 surgeons, means were 20.32 and 20.6, Tamhane's = 0.151). Operative time and total GEARS score did not change over time (R2 = 0.0001–0.096). Subcomponent scores showed idiosyncratic temporal changes, although force sensitivity increased among all (R2 = 0.172–0.243). For a novice surgeon, phase-adjusted operative time (R2 = 0.24), but not overall GEARS scores (R2 = 0.04), improved over time.
Conclusions:
GEARS scores showed less variability and did not improve with time for a novice surgeon. Improved technical skill does not explain the learning curve of a novice surgeon or variation among surgeons. More work could define valid surrogate metrics for performance analysis.
Introduction
An overarching goal of both surgical education and analysis of surgeons in practice is to define operative quality, which can be simplified as follows:
As in medicine generally, the goal of operative performance analysis is to identify factors that optimize patient outcomes. To define the surgical learning curve, ideally, we would calibrate surgical technical and nontechnical parameters against that which optimizes patient outcomes. However, long-term outcomes are not available immediately and therefore surrogate metrics are desirable. For this purpose, a commonly utilized surrogate metric is operative time. 1 The degree to which operative time provides a valid evaluation of operative proficiency is largely unknown.
However, it stands to reason that accomplishing surgical steps and tasks in less time suggests improved efficiency, fewer errors, and possibly improved quality. Therefore, operative time has been widely subscribed in prior attempts at operative performance analysis. 1 More direct attempts at measuring the technical and nontechnical components of operative proficiency have been introduced. Of particular interest is assessment of technical skills as recent work has shown a direct connection between surgical technical skill and desirable patient outcomes.2–6
No current studies robustly connect robotic surgical skill evaluation with patient outcomes, and this stands as a challenge. For robotic skills assessment, a commonly used metric is the global evaluative assessment of robotic skills (GEARS). GEARS is a Likert scale-based assessment tool, which has shown construct validity in differentiating expert- and novice-level performance. 7 To reduce manual labor from experts, crowdsourced GEARS scores have been validated to compare favorably with expert-rated GEARS scores. 8
Our group has interest in the use of video-based assessment (VBA) for defining surgical quality. For this purpose, we employ crowdsourced GEARS scores. In several major studies that compare surgical skill with outcomes, only a handful of videos are included for analysis of a surgeon's technique. However, many cases are included for analysis of the complication rate.2–6 One might expect some variability in surgical skill evaluations. In examining video logs, it is also possible to identify a reflection of the surgeons' learning curve. In this study, we directly examine individual surgeons' learning curves with metrics based on VBA.
Materials and Methods
A prospectively maintained database of surgical procedures logged throughout a large health system in New York was queried for robot-assisted laparoscopic sleeve gastrectomies performed between January 2019 and January 2021. Surgeon variables (i.e., gender, age, and fellowship status) were obtained from either the health system website or their page on Doximity (Doximity, Inc., San Francisco, CA). These data are excluded for purposes of privacy.
This study was deemed exempt by the institutional review board and consent was not obtained. The detailed methodology for VBA used by our group was described previously and involves use of crowdsourced GEARS scored by the Crowd-Sourced Assessment of Technical Skills (C-SATS) group (Seattle, WA).9,10
A subgroup analysis for 2 of the surgeons, 1 very experienced (>15 years) and the other only 4 months into independent practice, was performed on videos collected for detailed analysis of operative times. The videos were segmented using start and stop times for each phase of the operation: gastrocolic ligament dissection (first cautery to ligament to last cautery to ligament before stapling), stapling (first placement of stapler after advancing bougie to last firing of stapler), and oversewing (first suture entry to last cut of suture used). This phase segmentation was done by a single reviewer (D.P.B.). Videos were obtained from May 2018 to September 2021.
Descriptive statistics comparing surgeons are presented as mean ± standard deviation for continuous data. The mean operative time and GEARS scores for each surgeon were compared with one-way analysis of variance (ANOVA), with two-tailed P value <.05 considered significant. Of note, the post hoc analysis for between-surgeon differences was compared with Tamhane's T2 test, and a value <.05 was considered significant.
For assessment of effect size, eta-squared values are shown. For the temporal analysis, a linear regression analysis was utilized and Pearson's coefficient (R2) was compared with the direction of change indicated by whether the correlation was positive or negative; a P value <.05 was considered significant. Groups were independently observed with normal distributions. All analyses were performed with SPSS 26.0 (IBM, Armonk, NY).
Results
Four surgeons performed the majority of sleeve gastrectomies recorded during the study period at this health system. Each surgeon completed more than 10 cases, defined as necessary for inclusion in the temporal analysis. All surgeons were board certified with fellowship training in minimally invasive surgery (<1–27 years of experience). Total GEARS scores were similar between individual surgeons. However, the ANOVA showed a significant difference in their means overall: 19.95 ± 0.73, 19.99 ± 0.69, 20.32 ± 0.76, and 20.60 ± 0.54 (P = .001, eta-squared 0.106) (Table 1).
One-Way Analysis of Variance Results for Global Evaluative Assessment of Robotic Skills: Total and Subcomponent Scores as Well as Operative Time
Small differences in significance are observed among surgeons for most metrics. The operative time shows the most striking effect size (eta-squared). Bold values indicate statistical significance (P < .05).
ANOVA, analysis of variance; GEARS, global evaluative assessment of robotic skills; SD, standard deviation.
Small differences in mean GEARS subcomponent scores were also observed (Table 1). The only GEARS subcomponent score that did not show a significant difference between surgeons was bimanual dexterity (P = .131). The effect sizes were small across the board for GEARS scores: total score, eta-squared, 0.106; bimanual dexterity, 0.036; depth perception, 0.084; efficiency, 0.131; force sensitivity, 0.084; and robotic control, 0.103.
Mean operative time showed a larger magnitude of differences between surgeons across the study period: 91.00 ± 19.18, 112.25 ± 33.03, 126.19 ± 36.92, and 149.29 ± 32.56 (P < .001), as reflected in the larger eta-squared value (0.244). Individual between-surgeon differences reflected by the Tamhane T2 statistic are visualized in Figures 1–7 and results generally show no differences between similarly rated surgeons, but do show differences between surgeons at the extremes of the distribution for a given overall or subcomponent score (Figures 1–7). The ANOVA data and effect sizes are given below (Table 1).

Box and whisker plots of mean total GEARS scores for the 4 surgeons. Significance of the ANOVA result for the 4 surgeons is shown in the upper right and the post hoc Tamhane T2 statistic is shown with lines between surgeons being compared. ANOVA, analysis of variance; GEARS, global evaluative assessment of robotic skills.

Box and whisker plots of mean GEARS subcomponent scores of bimanual dexterity for the 4 surgeons. Significance of the ANOVA result for the 4 surgeons is shown in the upper right and the post hoc Tamhane T2 statistic is shown with lines between surgeons being compared. ANOVA, analysis of variance; GEARS, global evaluative assessment of robotic skills.

Box and whisker plots of mean GEARS subcomponent scores of depth perception for the 4 surgeons. Significance of the ANOVA result for the 4 surgeons is shown in the upper right and the post hoc Tamhane T2 statistic is shown with lines between surgeons being compared. ANOVA, analysis of variance; GEARS, global evaluative assessment of robotic skills.

Box and whisker plots of mean GEARS subcomponent scores of efficiency for the 4 surgeons. Significance of the ANOVA result for the 4 surgeons is shown in the upper right and the post hoc Tamhane T2 statistic is shown with lines between surgeons being compared. ANOVA, analysis of variance; GEARS, global evaluative assessment of robotic skills.

Box and whisker plots of mean GEARS subcomponent scores of force sensitivity for the 4 surgeons. Significance of the ANOVA result for the 4 surgeons is shown in the upper right and the post hoc Tamhane T2 statistic is shown with lines between surgeons being compared. ANOVA, analysis of variance; GEARS, global evaluative assessment of robotic skills.

Box and whisker plots of mean GEARS subcomponent scores of robotic control for the 4 surgeons. Significance of the ANOVA result for the 4 surgeons is shown in the upper right and the post hoc Tamhane T2 statistic is shown with lines between surgeons being compared. ANOVA, analysis of variance; GEARS, global evaluative assessment of robotic skills.

Box and whisker plots of mean operative time for the 4 surgeons. Significance of the ANOVA result for the 4 surgeons is shown in the upper right and the post hoc Tamhane T2 statistic is shown with lines between surgeons being compared. ANOVA, analysis of variance.
The temporal analysis of GEARS scores and operative times showed that only one the experienced surgeon showed a positive change in overall GEARS scores over time (Surgeon B) (Table 2). For subcomponent scores, sporadic changes were observed. Bimanual dexterity showed no significant changes over time for any surgeon. Depth perception and robotic control showed positive changes with time for 2 experienced surgeons (surgeons B and C). The subcomponent score of efficiency showed a decrease over time for Surgeon C.
Linear Regression Comparisons of Surgeons Using Global Evaluative Assessment of Robotic Skills: Total and Subcomponent Scores as Well as Operative Time Over the Study Period
The only universally seen change with time is an improvement in force sensitivity. Bold values indicate statistical significance
GEARS, global evaluative assessment of robotic skills; NEG, negative; POS, positive.
Most interestingly, the subcomponent score of force sensitivity showed positive changes over time for all surgeons. None of the surgeons improved their operative time over the course of the study period, including the novice surgeon (Surgeon A). In the subgroup analysis of Surgeon A and Surgeon C with phase segmented operative times, the novice surgeon (Surgeon A) showed improvement in operative time with time for gastrocolic ligament dissection, stapling, and overall (Table 3). The more experienced surgeon (Surgeon C) did show slight improvement with time in stapling, but not for other phases.
Linear Regression Comparisons of Surgeons as Well as Operative Time and Phase Times for 2 Surgeons in the Study: Surgeon A, Novice, and Surgeon C, Experienced
Bold values indicate statistical significance
NEG, negative.
Overall, pairwise comparison of the operative time of active surgical phases shows improvement in operative time for the novice surgeon (Surgeon A), but not the more experienced surgeon (Surgeon C) (Fig. 8).

Scatter plots of operative times (phase adjusted) for the novice surgeon, Surgeon A (dark squares), and the more experienced surgeon, Surgeon C (light dots), show improvement for the novice surgeon, likely reflecting the surgical learning curve in operative performance. The linear regression output of operative time changes over time, as shown on the right for both surgeons.
Discussion
Overall summary
In the longest study to date comparing changes in metrics of technical skill over time, we have corroborated results of prior studies using operative time as a surrogate metric for the surgical learning curve for a novice surgeon. 1 More generally, although statistically significant differences were found in both total and subcomponent GEARS scores between surgeons completing robotic sleeve gastrectomy, the magnitude of these differences is outweighed by the variability in any given surgeon's performance throughout the study period.
The largest effect size in the ANOVAs between surgeons was seen for operative time. Few generalizable trends in GEARS scores over time were observed. Exceptionally, the subcomponent score of force sensitivity increased over time for all surgeons in the study. None of the surgeons showed improvement in bimanual dexterity or unadjusted operative time. On adjusted analysis of operative times using phases of sleeve gastrectomy, a novice surgeon showed improvement in overall operative time, gastrocolic ligament dissection, and stapling time. Conversely, a more experienced operator showed improvement only in stapling time.
To summarize, two pieces of evidence corroborate the use of operative time as a surrogate metric for operative proficiency: (1) the large effect size in differentiating operative times between surgeons and (2) the improvement in adjusted operative time with time by the novice surgeon in this study. The nonuniversal changes in GEARS scores and subcomponent scores over time (excepting force sensitivity) suggest that these metrics are not useful for tracking technical skill changes over time scales on the order of 2 years.
Furthermore, the larger effect size seen for operative time among the surgeons in this study compared with the smaller effect size seen for GEARS total and subcomponent scores suggests that operative performance, as measured with the surrogate of operative time, cannot be wholly explained by technical skill variation.
Finally, the improvement in time with adjusted operative times by the novice surgeon in the subgroup analysis suggests the use of operative time as a reasonable, although imprecise, surrogate for operative proficiency in measuring the surgical learning curve.
Evaluation of prior studies
This study raises some questions about studies correlating technical skills with favorable patient outcomes. These studies generally use a few videos for a given surgeon to assess the skill level. For example, in an index study by Birkmeyer et al, surgeons submitted a single video for VBA. 2 Several more recent studies have also employed a model in which only a small sample of videos is used to assign a technical skill score to a given surgeon.3,4
However, if the variability of technical skill measurement for a given surgeon is comparable with the difference between surgeons used to discriminate between levels of skills, then the claim that differences in such skill measurement explain the differences in outcomes between surgeons is tenuous. However, this hinges on the external applicability of skill evaluation in robotic and nonrobotic settings.
The variability seen in this study could in principle be misleading. Nevertheless, going forward, in robotic surgical skill evaluation, a challenge stands to include multiple videos, at a minimum, in assessing surgical technical skill.
Surgical learning curve
Which metric should be used to approximate operative proficiency is an open question. That operative time is used as such in the literature is likely a matter of convenience. 1 A robust demonstration of a surgical learning curve would define proficiency as the nadir of complication rates (or conversely as the plateau of some positive outcome).
However, as noted above, operative time is a reasonable surrogate given its availability and plausible connection to operative quality. In this study, we do not find evidence supporting technical skill as the primary driver of differences in operative proficiency when comparing surgeons.
However, more robust models in which operative complication rates and beneficial outcomes are shown to correlate with simpler surrogate metrics would be useful to provide surgical trainees and new surgeons with indicators of operative quality in manageable time.
Limitations
Although this is a relatively long study of the variability of the metrics of technical skill evaluation in robotic surgery, it remains somewhat limited in size. Only 4 surgeons were included and only one surgeon was a novice (<1 year experience). Therefore, the data claiming that this study corroborates the use of operative time in defining the surgical learning curve are limited.
Additionally, the subgroup analysis of comparing surgeons A and C incorporates more data than the GEARS and operative time comparisons of all 4 surgeons. Therefore, claiming that the lack of a trend in GEARS scores for the novice surgeon (Surgeon A) cannot account for a temporal relationship with phase-adjusted operative time assumes extension of the early nontrend in GEARS scores.
With regard to the setting, although these procedures were completed by the attending surgeon, these cases were performed at academic centers, so it is unknown to what degree trainees participated in the case, possibly confounding conclusions. Additionally, inclusion of cases in this study is dependent on surgeons opting to upload the cases at the beginning of the case to C-SATS; there likely exists some degree of selection bias, perhaps especially so at the beginning of the trial period when surgeons were less comfortable with the idea of recording cases.
Some of the results of this study are not easy to interpret. For example, it is not immediately obvious why the force sensitivity subcomponent score should have improved over time for all surgeons in the study. It does not seem justifiable to grant this as evidence of the learning curve because most surgeons involved had many years of experience. It could also represent a change in assessment on the part of the crowdsourced reviewers with time. That is, it could be that reviewers were more likely to grant higher scores later in the study period in this domain for a reason that is not apparent.
Finally, there is a question about whether this study is useful given the diversity of attitudes about recording cases. There is understandable hesitation to be acknowledged when anonymized logs of video for VBA are considered; in a recent survey of SAGES members, only ∼39% of videos of operations were reportedly recorded, even by this interested population. 11
Elsewhere, we have outlined the sensitivity needed when considering the maintenance of video logs, including medicolegal concerns and patient confidentiality and privacy. 12 However, the use of video logs allows assessment of the variability in a given surgeons' practice and to assess their learning curve.
Conclusions
Surgeons who completed robotic sleeve gastrectomy show small differences in technical skills, as measured with GEARS scores. Overall, operative proficiency, as reflected by operative time, cannot be completely explained by variations in technical skills, as measured with GEARS scores. More studies are needed to provide surgeons with a more reliable readily available indicator of operative quality for performance analysis.
Footnotes
Authors' Contributions
D.P.B. was involved in conceptualization, investigation, methodology, statistics, and manuscript preparation. S.K. and P.A. were involved in investigation and manuscript preparation. K.C. was involved in data collection and manuscript preparation, reviewing, and editing. S.P.D. and A.A. were involved in manuscript preparation, reviewing, and editing. D.M. and P.J.C. were involved in conceptualization and manuscript reviewing and editing. E.Y. was involved in manuscript reviewing and editing. F.F. was involved in supervision and manuscript preparation, reviewing, and editing.
Disclosure Statement
F.F. has consulting affiliations with Activ Surgical and Boston Scientific. D.P.B. and P.A. have consulting affiliations with Deep Surgical. P.J.C., E.Y., D.M., K.C., A.A., S.K., and S.P.D. have nothing to disclose.
Funding Information
Intraoperative Performance Analytics Laboratory (IPAL) is supported by a 2020 SAGES Robotic Surgery Grant.
