Abstract
Background:
This study compares surgical performance during analogous vesico-urethral anastomosis (VUA) tasks in two robotic training environments, virtual reality (VR) and dry laboratory (DL), to investigate transferability of skill assessment across the two platforms. Utilizing computer-generated performance metrics and pupillary data, we evaluated the two environments to distinguish surgical expertise and ultimately whether performance in the VR simulation correlates with performance in live robotic surgery in the DL.
Materials and Methods:
Experts (≥300 cases) and trainees (<300 cases) performed analogous VUAs during VR and DL sessions on a da Vinci robotic console following an Institutional Review Board (IRB) approved protocol (HS-16-00318). Twenty-two metrics were generated in each environment (kinematic metrics, tissue metrics, and biometrics). The DL included 18 previously validated automated performance metrics (APMs) (kinematics and event metrics) captured by an Intuitive system data recorder. In both settings, Tobii Pro Glasses 2 recorded the task-evoked pupillary response (reported as Index of Cognitive Activity [ICA]) to indicate cognitive workload, analyzed by EyeTracking cognitive workload software. Pearson correlation, Mann–Whitney, and independent t-tests were used for the comparative analyses.
Results:
Our study included six experts (median caseload 1300 [interquartile range 400–3000]) and 11 trainees (25 [0–250]). A total of 8/9 metrics directly comparable between VR and DL showed significant positive correlation (r ≥ 0.554, p ≤ 0.032); 5/22 VR metrics distinguished expertise, including task time (p = 0.031), clutch usage (p = 0.040), unnecessary needle piercing (p = 0.026), and suspected injury to the endopelvic fascia (p = 0.040). This contrasts with 14/22 APMs in DL (p ≤ 0.038), including linear velocities of all three instruments (p ≤ 0.038) and dominant-hand instrument wrist articulation (p = 0.013). Trainees experienced higher cognitive workload (ICA) in both environments when compared with experts (p < 0.036).
Conclusions:
Most performance metrics between VR and DL exhibited moderate to strong correlations, showing transferability of skills across the platforms. Comparing training environments, APMs during DL tasks are better able to distinguish expertise than VR-generated metrics.
Introduction
Simulation training is used during the initial phase of the surgical learning curve to improve psychomotor skills without risking patient safety. 1 This is particularly true for robot-assisted surgeries since they require not only complex psychomotor skills but also the ability to fluently operate a robotic surgical system. Several different training modalities have been specifically developed for robotic surgery, including virtual reality (VR), dry laboratory (DL) (synthetic), and wet laboratory (human/animal tissue) models. 2
Both VR and DL simulation environments have been shown to be effective for minimally invasive surgery training. 3 Due to the complexity of the multijoint angulation of robotic instruments, DL simulation of robot-assisted surgery cannot simply be accomplished using a training kit or a training box. Instead, it requires a setup of the full robotic system in the operating room, making this training modality less accessible to trainees. For this reason, various VR simulators for robot-assisted surgery (Mimic dV-Trainer, da Vinci® SimNow®, and Simbionix RobotiX Mentor) have been developed. VR simulations are no longer limited to basic robotic surgical skills. They offer training modules for specific robotic procedures, both in part and in their entirety. These new simulation modules offer a realistic surgical setting (anatomic, tissue structure, and interaction) and require a combination of robotic skill elements to complete each task. The question remains whether skills gained using virtual simulator training ultimately transfer to the operating room. 4,5
In our previous works, we developed and validated objective measurements of efficiency derived from robot-assisted surgeries to assess surgical performance. These objective, computer-generated motion-tracking measurements, introduced as automated performance metrics (APMs), have been linked to perioperative and long-term patient outcomes. 6 –8 In particular, APMs during the vesico-urethral anastomosis (VUA), the key reconstructive step of a robot-assisted radical prostatectomy, have been highlighted as important features for prediction of urinary continence recovery. 8 We have also demonstrated that surgeon cognitive workload while operating, measured by pupillary response, can effectively differentiate surgical expertise. 9
In this study, we evaluate performance during VUA training tasks using analogous VR and DL training models. We utilize computer-generated performance metrics provided by feedback in the VR environment and APMs generated in the DL. We also track pupillary response to measure the cognitive workload of the participant during the exercises. Using these metrics, we investigate whether a surgeon's performance in the VR setting correlates with their performance on the full da Vinci robot in the DL setting, as well as compare which set of metrics can better distinguish between experts and trainees. Ultimately, we aim to show that surgical skills acquired with the VR simulator transfer to live robotic surgery.
Materials and Methods
Study design
After obtaining institutional review board approval, participants of varying surgical experience at our institution performed a VR VUA task on the da Vinci SimNow (Intuitive Surgical, Sunnyvale, CA, USA) (Fig. 1A) and a DL VUA exercise on the da Vinci Xi System (Intuitive Surgical) (Fig. 1B). Participants were faculty surgeons, urology fellows, and urology residents stratified into two groups based on experience level: trainees (fellows and residents, <300 robotic console cases), and experts (attending surgeons, ≥300 robotic console cases) (cases defined by the number of robotic surgeries where the participant performed a significant portion of the surgery as the primary surgeon). We divided groups by participant training level based on our previous work distinguishing surgeon performance, and for this study, 300 cases provided that cutoff. 9 Participants were randomized to perform either the VR simulation or DL first.

VR VUA exercise
A guided VR simulation of VUA (3D Systems, Rock Hill, SC, USA) was performed on the da Vinci SimNow (Fig. 1A). A total of 12 stitches were guided by the VR system to complete the entire VUA. Participants were oriented to standardized directions to ensure baseline understanding on the use of the da Vinci SimNow. A total of 22 metrics were collected during the VR simulation, including 20 derived by an automated evaluation provided by the simulator. Metrics were categorized into five metric families: kinematic metrics (i.e., distance and path length), event metrics (i.e., console events [clutch usage]), tissue metrics (i.e., tissue handling technique [unnecessary needle piercing and wound separation]), duration (i.e., total task time), and one biometric (cognitive mental workload).
DL VUA exercise
Participants conducted the DL task on a live da Vinci Xi System using a VUA kit produced by 3-DMed® (

Cognitive mental workload—biometric
Cognitive mental workload was a biometric assessed in both the DL and VR environments using task-evoked pupillary response (TEPR). Participants wore the Tobii Pro Glasses 2 wearable eye-tracking system (Tobii Technology, Inc.), which recorded TEPR by measuring pupil dilation at a sampling rate of 100 Hz, while performing both VR simulation and DL (Fig. 2B). These eye-tracking recordings were anonymized and sent to EyeTracking, Inc., for data processing through their EyeWorks™ software. The software's algorithms produced the Index of Cognitive Activity (ICA), a scaled metric from 0 to 1, reflective of TEPR and real-time cognitive workload, with greater values indicating higher cognitive workload. Measuring cognitive mental workload is an objective measurement of the surgeon's impression of each task's difficulty level and gives insight into the physiologic state of the surgeon in real time during each task. 9
Statistical analysis
All data, including VR simulation computer-generated metrics, DL metrics (APMs), and biometrics (TEPR), were compared between experts and trainees using an independent t-test or Mann–Whitney test depending on whether the variable was normally distributed. We selected eight directly comparable metrics from the VR simulation and DL along with ICA and conducted a Pearson correlation analysis. Statistical analysis was done using IBM® SPSS® 24, with p < 0.05 considered statistically significant. The median and range were used to report performance metrics.
Results
Seventeen participants were enrolled in this study, six experts and eleven trainees. The median robotic console case experience was 1300 (interquartile range [IQR] 475–2625) for experts and 25 (IQR 0–113) for trainees. Two trainees and one expert are left-handed, while the remaining participants are all right-handed.
Translatability of metrics between training environments
We selected nine corresponding metrics from both the VR simulation and the DL session. Eight of the nine metrics generated from VR simulation had statistically significant associations with those captured from the DL session, including all three kinematic metrics, three of the four event metrics, duration, and the biometric (ICA). The total task time, dominant and nondominant instrument traveling distance, camera traveling distance, number of unnecessary needle piercings, number of times the clutch was used, and ICA showed significant correlation between the simulation and DL (Table 1). Cognitive mental workload (ICA) had strong to very strong associations (0.86–0.93, p < 0.001). Kinematic metrics (total task time, dominant and nondominant instrument traveling distance, and camera traveling distance) showed moderate to strong correlations (0.65–0.77, p < 0.008). The clutch use and tissue handling metric (unnecessary needle piercing) showed moderate associations (0.55, p = 0.032; 0.65, p = 0.004; respectively).
Correlation Between Comparable Virtual Reality Simulation-Generated Metrics and Automated Performance Metrics
DH = dominant hand; DL = dry laboratory; ICA = Index of Cognitive Activity; NDH = nondominant hand; SIM = simulation.
Distinguishability between expert and trainee surgeon performance
Virtual reality
The VR simulation computer-generated metrics could be grossly categorized into five metric families: kinematic metrics, event metrics, tissue metrics, duration, and the biometric ICA. A total of 5/22 metrics collected during the VR VUA simulation could distinguish surgeon expertise, which included none of the four kinematic metrics, one event metric, two tissue metrics, the duration metric, and the biometric (ICA). During VR VUA simulation, experts were more efficient than trainees, showing shorter task completion time (15.8 minutes vs 22.4 minutes, p = 0.031) (Table 2). For console operating skills, trainees used the clutch more than experts (17 vs 5, p = 0.040). For the tissue handling technique, less injury to the endopelvic fascia/urethral sphincter (0.5 vs 2, p = 0.040) and less unnecessary needle piercing (35.5 vs 52, p = 0.026) were reported for experts. In consideration of the cognitive mental workload, experts had a lower ICA, indicating less mental stress (0.29 vs 0.53, p = 0.036).
Comparing Virtual Reality Simulator-Generated Metrics Between Experts and Trainees
DH = dominant hand; ICA = Index of Cognitive Activity; NDH = nondominant hand.
Dry laboratory
Similarly, APMs generated during the DL could be grossly categorized into four groups: kinematic metrics, event metrics, duration, and the biometric ICA; tissue handling metrics are not currently able to be assessed during the DL. A total of 14/22 metrics collected during the DL VUA task could distinguish surgeon expertise and included ten of the 14 kinematic metrics, none of the four event metrics, all three duration metrics, and the biometric (ICA). Experts consistently demonstrated more efficient movement: shorter task completion time (10 minutes vs 15 minutes, p = 0.005); less distance traveled by the nondominant instrument (7.55 m vs 9.67 m, p = 0.013); greater movement velocity of both instruments (dominant and nondominant); and greater movement velocity of the camera (p ≤ 0.038) (Table 3). Experts also had less EndoWrist® instrument wrist articulation in the dominant instrument while performing the VUA (693 radians vs 863 radians, p = 0.013). Cognitive mental workload measurements again showed that expert participants had lower ICA (0.29 vs 0.43, p = 0.024).
Comparing Dry Laboratory Automated Performance Metrics Between Experts and Trainees
DH = dominant hand; ICA = Index of Cognitive Activity; NDH = nondominant hand.
Discussion
This comparative study sought to evaluate the surgical performance of experts and trainees during analogous VR and DL VUA tasks and primarily investigate whether surgical performance in a simulated training environment correlates with surgical performance on the da Vinci robot. APMs have previously been validated to assess performance on the surgical robot during live surgery but have yet to be used to assess performance in simulated training environments, including VR environments, which provide their own set of metrics. This is the first time that APMs have been recorded in a training environment. We see that not only are APMs highly correlated with metrics from VR, but (in this study) APMs are also shown to be better distinguishers of expertise between analogous exercises in different training environments. Consequently, this highlights the value of APMs in the training environment and lays the foundation for further studies relating DL APMs to live APMs, which again have been predictive of patient outcomes. Proving that surgical skills gained in training environments transfer to live surgical procedures is instrumental to the future of training programs. In this study, we were able to correlate performance across two training environments with different methods of collecting performance metrics. Ideally, we could measure the same APMs in the DL setting and the VR simulation environment. Showing correlation of performance across training environments allows for confirmation that surgical skill improvement in the VR simulation faithfully correlates with improvement in live robotic surgery. The ability to seamlessly track performance and progression of skills during training, regardless of training medium, is vital to measuring progress. Correlating performance on a VR simulation is necessary given its usefulness as a training tool, being less expensive than a full robot and thus more available to training programs.
Our results indicate that some VR-generated metrics could distinguish the expertise of the participating surgeons. A total of 5/22 metrics collected were significantly different among the two groups, including total task completion time, clutch use, injury to the endopelvic fascia/urethral sphincter, and unnecessary needle piercing. When assessing tissue handling metrics, which were unique to the VR platform, we did not see statistically significant differences in tissue injuries (other than the endopelvic fascia/urethral sphincter), tissue approximation, or stitches within optimal depth. The ability to assess instrument interaction with tissue is a potential advantage of VR over DL environments, but perhaps the value of these metrics is currently limited by the technology's ability to truly mimic these interactions.
On the other hand, surgical performance measured by APMs during the DL session showed significant difference between the experts and trainees more often than the metrics on the VR simulator. This suggests that APMs potentially provided more value in terms of assessing instrument movement efficiency. As APMs are time-based metrics generated using data from the robotic instruments and camera, they provide more granular kinematic data and thus more robustly distinguish motion differences between the two groups, especially during complex techniques such as suturing.
While the metrics provided in the DL and VR simulator are extrinsic factors that affect surgeon performance, the cognitive mental workload measures an intrinsic factor showing a real-time measurement of surgical stress levels. TEPR was measured in both training environments to assess and compare participant cognitive mental workloads through ICA values. During both VR and DL exercises, cognitive workload was able to distinguish between experts and trainees. Expert surgeons consistently demonstrated lower ICA values and therefore less mental stress. Between the two environments, all participants exhibited higher cognitive workloads during VR than during DL tasks. Our previous study has shown that under high mental workload conditions, experts and trainees display inverse relationships with kinematic metrics. 9 In particular, the study had illustrated that experts with high ICA values show a decrease in instrument velocity, while trainees display an increase. The perceived difficulty of the VR task and possibly related consequential increase in cognitive workload may have contributed to the inability to distinguish experts and novices based on kinematic measures alone.
Our study has a few limitations. The sample size was relatively low and from a single institution. Future studies should validate these findings at other centers. The metrics generated by the VR simulation and the APMs captured during the DL session were not completely identical, which limited the ability for perfect comparison. We utilized only one VR simulation model of the VUA. While there are other models, at present, the authors felt that the 3D Systems was the most developed VR VUA simulation. We could not assess tissue handling metrics in the DL settings based on current technical limitations. Future studies with a larger sample size, homogeneous participants with comparable robotic surgical experience, the ability to measure identical APMs in the simulated environment and thus directly compare with surgical data from the operating room, and the use of DL models made with material of measurable deformity may improve the ability to correlate VR simulation performance with live robot performance.
Studies to confirm predictive ability (how performance on VR simulation anticipates future performance in DL settings or even in live surgery) in the future would further provide evidence in favor of robotic surgical training in a simulated environment. The current form of this VR simulator for VUA is limited to development of robotic control skills and instrument movement skills. Currently, it is not as well suited for assessing tissue handling. Further development of this VR platform with more realistic tissue deformation may augment its usefulness for advanced training.
Conclusions
Our study indicates a strong correlation of surgeon performance, as measured by computer-generated metrics, during training exercises in VR and DL environments, highlighting the transferability of skills between the two domains. However, when comparing VR-generated metrics and APMs in isolation relative to their domain, DL metrics were more capable of distinguishing expertise. Cognitive workload and surgeon surveys revealed that VR tasks are more difficult and less realistic than DL tasks.
Footnotes
Authors' Contributions
A.C. was involved in acquisition of data and drafting of the manuscript; J.C. was involved in drafting of the manuscript; S.M. was involved in acquisition of data; S.S.R. performed the critical revision of the manuscript; R.M. performed the analysis and interpretation of data and statistical analysis; S.M. performed the analysis and interpretation of data; J.H.N. was involved in conception and design, acquisition of data, and critical revision of the manuscript; and A.J.H. was involved in conception and design, critical revision of the manuscript, supervision, and obtaining funding.
Acknowledgment
The authors would like to acknowledge Anthony Jarc (Intuitive Surgical, Inc., Clinical Research, Norcross, Georgia, USA) for processing of automated performance metrics.
Author Disclosure Statement
A.J.H. has financial disclosures with Quantgene, Inc. (consultant), Mimic Technologies, Inc. (consultant), and Johnson & Johnson (consultant). All other authors have no conflicts to disclose.
Funding Information
This study was supported, in part, by an Intuitive Surgical Clinical Research Grant.
