Abstract
Purpose:
Surgical performance is critical for clinical outcomes. We present a novel machine learning (ML) method of processing automated performance metrics (APMs) to evaluate surgical performance and predict clinical outcomes after robot-assisted radical prostatectomy (RARP).
Materials and Methods:
We trained three ML algorithms utilizing APMs directly from robot system data (training material) and hospital length of stay (LOS; training label) (≤2 days and >2 days) from 78 RARP cases, and selected the algorithm with the best performance. The selected algorithm categorized the cases as “Predicted as expected LOS (pExp-LOS)” and “Predicted as extended LOS (pExt-LOS).” We compared postoperative outcomes of the two groups (Kruskal–Wallis/Fisher's exact tests). The algorithm then predicted individual clinical outcomes, which we compared with actual outcomes (Spearman's correlation/Fisher's exact tests). Finally, we identified five most relevant APMs adopted by the algorithm during predicting.
Results:
The “Random Forest-50” (RF-50) algorithm had the best performance, reaching 87.2% accuracy in predicting LOS (73 cases as “pExp-LOS” and 5 cases as “pExt-LOS”). The “pExp-LOS” cases outperformed the “pExt-LOS” cases in surgery time (3.7 hours vs 4.6 hours, p = 0.007), LOS (2 days vs 4 days, p = 0.02), and Foley duration (9 days vs 14 days, p = 0.02). Patient outcomes predicted by the algorithm had significant association with the “ground truth” in surgery time (p < 0.001, r = 0.73), LOS (p = 0.05, r = 0.52), and Foley duration (p < 0.001, r = 0.45). The five most relevant APMs, adopted by the RF-50 algorithm in predicting, were largely related to camera manipulation.
Conclusion:
To our knowledge, ours is the first study to show that APMs and ML algorithms may help assess surgical RARP performance and predict clinical outcomes. With further accrual of clinical data (oncologic and functional data), this process will become increasingly relevant and valuable in surgical assessment and training.
Introduction
S
Efforts have been devoted to achieve effective yet practical surgical performance assessment. The number of cases previously performed by a surgeon (caseload) is a commonly utilized marker of surgeon experience. Learning curve studies of radical prostatectomy show that patient outcome improves with greater caseload. 3 However, evaluating surgical performance by caseload alone may be inaccurate or inconsistent as it often is self-reported by the surgeon. 3 To overcome this limitation, several surgical assessment tools have been developed and validated to measure surgical performance. 4 –6 With the widely validated Global Evaluative Assessment of Robotic Surgery (GEARS), evaluation of the vesicourethral anastomosis in the robot-assisted radical prostatectomy (RARP) demonstrated association with select patient outcomes. 7 A recent study showed that GEARS scores from select steps of RARP were predictive of early continence after surgery. 8 Prostatectomy Assessment and Competency Evaluation 5 and Robotic Anastomosis Competency Evaluation 6 are procedure-specific and step-specific evaluative tools and provide deconstructed task evaluation with cognitive surgical skill feedback. Although designed with standardized, objective evaluation criteria, these tools nonetheless require manual review of surgical videos, introducing inherent variance by expert or crowd-sourced evaluators. 9
For the last two decades, artificial intelligence (AI) has been broadly applied in healthcare to process big data to deliver more efficient care. 10,11 Machine learning (ML), a form of AI, automates analytical algorithm building. Using algorithms that iteratively learn from data, ML allows computers to find hidden insights and detect underlying patterns of a given data set without explicit instruction. Thus, this approach avoids preassumptions regarding model types and variable interactions and may offer additional knowledge that has eluded detection by standard statistical methods. 11
With a novel data-recording device, the “dVLogger” (Intuitive Surgical), automated performance metrics (APMs) (instrument motion tracking metrics and system events recorded in Cartesian coordinates) and synchronized surgical footage can be captured directly from the da Vinci robotic system in real time during the actual live surgical procedure. In our original pilot study, we identified and validated certain APMs during RARP that can distinguish surgeon expertise. 12 In the current study, we evaluate the utility of ML algorithms to evaluate surgical performance during whole procedure RARP and predict patient outcome. We also evaluate the relative importance of individual APM in learning and predicting outcomes.
Materials and Methods
APM collection
Under institutional review board-approved protocol for collection and analysis of surgical performance and clinical patient data, patients with localized prostatic adenocarcinoma who underwent primary RARP from August 2016 to March 2017 at our institution were included in this study. To control for surgical complexity, patients with prior treatment for prostate cancer (i.e., radiotherapy and cryotherapy) were excluded from the study. Also, any cases with lysis of adhesion from prior abdominal or pelvic surgeries were excluded from analyses. Synchronized surgical endoscopic video footage and 25 APMs from the studied cases were recorded by the “dVLogger” (Intuitive Surgical) in a process previously described. 12 The “dVLogger” captures synchronized video as endoscope view at 30 frames per second. The APMs, derived from robot system data (kinematic and events), are recorded at 50 Hz. Kinematic data included characteristics of movement such as instrument travel time, path length, and velocity. System events included frequency of master controller clutch, camera movements, third arm swap, and energy usage (Table 1).
APM = automated performance metric; m/s = meter per second; No./s = numbers/second.
Patient outcome data collection
Patient demographics, surgical data, such as surgery time, estimated blood loss (EBL), and early postoperative outcomes, such as length of stay (LOS), pelvic drainage volume, drainage tube duration, Foley catheter duration, and readmission, were prospectively collected under the same institutional review board-approved protocol. At our institution, patients undergoing RARP are taken care by the same midlevel care team, following a standardized enhanced recovery after surgery protocol, and unified discharge criteria. Surgical drainage tube is removed if the daily drainage output is less than 200 mL. Patients are followed up in the clinic 1 week after the surgery where the Foley catheter is removed. If there is anastomosis leakage, Foley catheter will be left in place for a longer time. If no anastomosis leakage is detected, Foley catheter will be removed in the clinic.
ML algorithm selection
We trained three distinct ML algorithms: “Random Forest-50” (RF-50), Support Vector Machine-Radial Basis Function (SVM-RBF), and L2 Regularization (LR-L2), using Python® with scikit-learn 0.19.0 package (
ML case categorization
All 78 RARPs were first categorized into two groups based on postoperative LOS. The routine LOS of RARP at our institution is 1 to 2 days. 13 Thus, in this study, we defined cases with LOS ≤2 days as “Expected LOS (Exp-LOS),” and those with LOS >2 days were determined as “Extended LOS (Ext-LOS).” We chose LOS as the label in this study because in our initial analysis, this perioperative data point was a sensitive and stable differentiator between cases of experienced and less experienced attending surgeons. We used APMs as “training input” and LOS as the “label” to train ML algorithms (Fig. 1). We also trained the algorithms with additional patient demographic data (age, body mass index [BMI], American Society of Anesthesiologists [ASA] scores) to determine any improvement in the prediction accuracy.

The process of ML model training and prediction.
We used “Stratified k-fold (k = 10)” cross-validation to predict all cases and estimate the model's performance. For example, the “training input” (APMs, with or without patient demographics) and “label” (LOS) of 70 cases were utilized to train the ML algorithms. We then fed the trained algorithms with the “training input” from the remaining eight separate cases to evaluate and predict surgical performance of those cases. This was the first round of cross-validation. For each around of validation, by using “Stratification,” we made sure that the predicting set contains approximately the same percentage of each label as the training set. During the next round, another combination of 70 cases was used to train, and the outcome of another 8 cases was predicted. After 10 rounds of training and prediction (the last two rounds each contained 71 training cases and 7 predicting cases, to make sure that each of the 78 cases had only been predicted once), all surgeries were classified by the algorithms as “Predicted as expected LOS (pExp-LOS)” and “Predicted as extended LOS (pExt-LOS)” groups based on the APMs of each case (Fig. 1). We then compared the clinical outcomes (continuous variables) among the two predicted groups using the Kruskal–Wallis test. Readmission was compared using Fisher's exact test.
ML model predicting outcomes
We further utilized the same “training input” and distinct postoperative outcomes as “labels” to separately train the selected ML algorithm. For continuous outcome variables (surgery time, EBL, LOS, pelvic drainage volume, drainage tube duration, Foley duration), we trained the algorithm to perform regression analysis. For categorical variable (readmission), we trained the algorithm to perform classification. These distinctly trained algorithms (specific to each clinical outcome variable) were then applied to directly predict the corresponding clinical outcome of each case. We then determined the accuracy of ML prediction by evaluating the association between the predicted results and the actual (“ground truth”) patient outcomes using Spearman's correlation test (continuous variables) and Fisher's exact test (categorical variables).
Metric ranking
A total of 25 APMs were utilized as “input” to train the ML algorithms (Table 1). After the algorithms were trained to categorize cases and predict patient outcome, the importance of each metric (instrument motion tracking [kinematic] metrics or system events [frequency] metrics) was calculated using the algorithms' learned parameters. For the RF-50 algorithm, the importance score for each metric was calculated as the normalized total reduction of Gini impurity associated with splitting(s) on that metric in each tree, averaged over all the trees. For the SVM-RBF algorithm, the importance scores cannot be straightforwardly calculated due to its inherent complexity. For the LR-L2 algorithm, the importance score for each metric was taken from the multiplicative weight associated with that metric. A higher importance score for a metric indicated a greater predictive performance of that metric in the algorithm. We then ranked these APMs and determined the top five metrics most predictive of patient outcome.
Results
Algorithm selection
Seventy-eight RARP cases by nine faculty surgeons (median 500 console cases, range 100–2000) were included in this study. Of the 78 studied RARP cases by nine faculty surgeons, 67 cases had LOS 1 to 2 days and 11 cases had Ext-LOS (>2 days). From this same cohort of patients, the RF-50 algorithm predicted 73 cases as “pExp-LOS” and 5 as “pExt-LOS.” The predictive accuracy was 87.2%. The SVM-RBF algorithm predicted 66 cases as “pExp-LOS” and 12 as “pExt-LOS,” demonstrating an accuracy of 83.3%. The LR-L2 algorithm predicted 71 cases as “pExp-LOS” and 7 as “pExt-LOS,” demonstrating the accuracy of 82.1%. We chose RF-50 for further analysis. If trained with APMs and patient demographics, the predictive accuracy of the RF-50 algorithm improved to 88.5%. To focus on how surgeon performance influences patient outcome, we performed all subsequent analyses using only APMs as “training material.”
Case categorization
No significant difference was identified in patient age, BMI, ASA scores, prostate-specific antigen, Gleason score, prior transurethral resection of prostate, or presence of median lobe between the two groups predicted and categorized by RF-50 algorithm (p > 0.05) (Table 2). Surgeries in the “pExp-LOS” group outperformed the “pExt-LOS” group in surgery time (3.7 hours vs 4.6 hours, p = 0.007), LOS (2 days vs 4 days, p = 0.02), and Foley catheter duration (9 days vs 14 days, p = 0.02).
In “pExp-LOS” group, patients were readmitted for small bowel obstruction (one patient) and vesicourethral anastomosis leakage (one patient).
ASA = American Society of Anesthesiologists; BMI = body mass index; EBL = estimated blood loss; IQR = interquartile range; LOS = length of stay; pExp-LOS = predicted as expected length of stay; pExt-LOS = predicted as extended length of stay; POD1 = postoperative day 1; PSA = prostate-specific antigen; TURP = transurethral resection of prostate.
Bold type means p ≤ 0.05.
Outcome prediction
Patient outcomes predicted by the RF-50 algorithm had significant association with actual “ground truth” value in surgery time (r = 0.73, p < 0.001), LOS (r = 0.52, p = 0.05), and Foley catheter duration (r = 0.45, p < 0.001) (Table 3).
RF-50 = Random Forest-50.
Bold type means p ≤ 0.05.
Metric ranking
We calculated the APMs' “importance score” from the RF-50 algorithm, and identified the top five metrics adopted by RF-50 algorithm in predicting surgery time, LOS, and Foley duration (Table 4). Metrics related to camera manipulation were top predictors for each of our selected perioperative outcomes.
m/s = meter/second; No./s = numbers/second.
Discussion
In this study, we trained an RF-50 ML algorithm to evaluate APMs during full-length RARP and to predict select postoperative outcomes. ML algorithms, while now commonplace outside of medicine (i.e., weather forecasting, DNA sequence classification), have not previously been harnessed to process surgeon APMs taken directly from da Vinci robot systems data.
Common wisdom indicates that one can predict surgical performance by knowing that particular surgeon's prior case volumes, that is, a more experienced surgeon is likely to perform better than someone less experienced. Assessment of surgical footage by peer evaluators also grades a surgeon's expertise. Yet, by having the ability to perform detailed analysis of every single movement of the robot across an entire procedure, we may be able to have a fuller picture of surgical performance and may be able to better predict patient outcomes. In the present study, APMs processed by an ML model showed the ability to effectively categorize surgical performance by LOS and identify significant differences with other patient outcomes by this categorization. Furthermore, select outcome data (surgery time, LOS, Foley catheter duration) predicted by the ML algorithms were also significantly associated with the ground truth. This pilot study showed the significant potential of APMs and ML in improving surgery evaluation, augmenting surgeon training, and predicting patient safety and wellness.
The “training inputs” used in this study were the APMs, captured directly from the robotic system by the “dVLogger” in real time during the actual performance of RARP. The “label” was the hospital LOS. Although there were other patient outcome variables that could evaluate surgical performance, we chose LOS as the “label” in this initial pilot study due to its ability to differentiate cases by surgeon experience (data not presented). All RARPs performed at our institution are under the same postoperative pathway, and the discharge-to-home criteria for each patient are nearly uniform. Patients are discharged after meeting a constellation of criteria. 13 Thus, LOS is a comprehensive and standardized short-term postoperative outcome measurement. Using LOS as a “label” of clinical outcome, the algorithm was able to evaluate and classify surgeries that were not only significantly different in LOS but also in surgery time and Foley catheter duration. We believe that after effective training, the ML algorithm learns through the APMs how to differentiate surgical performance with several optimal patient outcomes.
ML data processing is different from the conventional statistical analysis. For the conventional statistical approach, researchers need to choose a predesigned model that is most suitable for the data. One major limitation of the conventional statistical model is that only theoretically relevant variables based on previous studies and experience, or significant variables identified by the univariate analysis, are to be tested. The number of variables to put in the model is limited, for irrelevant variables will compromise the power of the model. This greatly constricts the ability to explore the influencing factors.
In contrast, ML is not built on a prestructured model; rather, herein, the data create the model by detecting the underlying patterns. Naturally, the more variables (input) used to train the algorithm, the more accurate the algorithm becomes. Also, this approach avoids preassumptions regarding model types and variable interactions. For the above reason, ML may unveil hidden knowledge that may remain undetected by traditional statistical methods. Furthermore, by using dVLogger data, subtle differences in robotic manipulations that are difficult to detect by human review can be recorded and examined. When looking across the entire procedure, these accumulated technical differences may dramatically affect patient outcome. Increasing evidence has suggested that ML can be more accurate than conventional logistic regression across a wide variety of subject areas. 14,15 Thus, when the input data volume is high, it is important to consider techniques beyond standard regression to optimize accuracy. 14
In our previously published robotic APM validation study, experts showed more efficient camera manipulation in select steps of RARP. 12 Another study on robotic virtual reality simulation exercises also suggested that, expert surgeons had significantly less total camera moving time but higher frequency of camera movement than new robotic surgeons. 16 In this study, the metric importance ranking also showed that camera manipulation-related metrics were important during ML learning and prediction. Thus, we believe that skillful (brief, efficient, and frequent) camera manipulation is emerging as an important indicator of robotic surgical expertise. In reality, this discovery perhaps represents a sensitive measure of surgeon performance as opposed to a specific technical skill to be trained. Further work in this area will associate APMs to surgeon intent and behavior in relation to the surgical context (i.e., fluctuation of metrics during lower vs higher complexity tasks).
There are limitations to the present study. This is a preliminary study of applying APMs and ML to surgical evaluation and patient outcome prediction. We are accruing oncologic and functional data, which will be presented in future work. We have studied only one procedure, the RARP. However, the lessons learned in this effort may be applied to other urologic and nonurologic procedures. Given that the study took place at a major teaching institution, several cases had resident or fellow participation, although under careful faculty surgeon supervision. It is therefore noteworthy that the aggregate movements and events that occurred in each full-length RARP may have stemmed from more than one operator. In the whole cohort analysis, most statistically significant associations between ground truth and predicted clinical outcomes were not strong; it is possible that with a larger cohort of cases training the algorithm in the near future, more accurate predictions can be made and thus result in stronger associations with actual outcomes. Finally, while the performance metric analyzed reflect movement and efficiency, it does not yet describe a skills domain that is deficient or provide step-specific feedback. Further work that associates APMs to surgeon cognition and intent may allow for meaningful surgeon feedback.
Conclusion
The marriage of APMs derived from the surgical robot and ML algorithms may help objectively assess surgical performance on a large scale and predict clinical outcomes. With further accrual of clinical data (oncologic and functional data), this process will become more relevant and valuable in surgical assessment and training.
Footnotes
Acknowledgment
The authors thank Omid Mohareri for dVLogger support.
Author Disclosure Statement
A.J.H. is a consultant for Ethicon, Inc. A.J. is an employee of Intuitive Surgical.
