Abstract
The use of simulation as an assessment tool is much more controversial than is its utility as an educational tool. However, without valid simulation-based assessment tools, the ability to objectively assess technical skill competencies in a competency-based medical education framework will remain challenging. The current literature in urologic simulation-based training and assessment uses a definition and framework of validity that is now outdated. This is probably due to the absence of awareness rather than an absence of comprehension. The following review article provides the urologic community an updated taxonomy on validity theory as it relates to simulation-based training and assessments and translates our simulation literature to date into this framework. While the old taxonomy considered validity as distinct subcategories and focused on the simulator itself, the modern taxonomy, for which we translate the literature evidence, considers validity as a unitary construct with a focus on interpretation of simulator data/scores.
Introduction
W
Akin to conceptual frameworks in curriculum development and skill learning, central to CBME is the need for iterative assessments of learner abilities and competences. In surgical disciplines like urology, this includes a significant emphasis on not only cognitive and attitudinal abilities but also skill competencies.
The time spent in the operating room by modern surgical trainees in the CBME era no longer provides sufficient exposure to train and assess them adequately. To address this need, simulation-based training and assessment methods have become important adjunctive tools utilized by surgical training programs. This is clearly illustrated by the significant increase in simulation-based training research published over the past decade. 2 The literature supporting the role of simulation-based training methods is now quite robust, with much of the surgical literature focusing on simulators and the validation of simulators.
Validity theory has changed significantly since the urologic community adopted the notion that simulation-based training methods were effective complements to surgical training. As a result, much of the current literature in urologic simulation-based training and assessment uses a framework and definition of validity that is now outdated—likely from absence of awareness rather than an absence of comprehension. Moreover, the literature on simulation-based assessment tools has been sparse over this same time period. The use of simulation as an assessment tool is much more controversial than is its utility as an educational tool, but without valid simulation-based assessment tools, the ability to objectively assess technical skill competencies in a CBME framework will remain challenging.
The following review article focuses on providing the urologic community an updated taxonomy on validity theory as it relates to simulation-based training and assessments and translates our simulation literature to date into this framework. In addition, we provide a review of the literature at various stages of the competence continuum (medical student, resident, and practicing physician), as simulation-based curricula and assessment tools will have an important role in improving our ability to accurately assess technical skill in the context of a CBME training model.
Validity—the Old and the New
Over the past 60 years, several documents from three different organizations have been published and disseminated to guide the development and use of assessment tests. During 1954, the first guide called “Technical Recommendations for Psychological Tests and Diagnostic Techniques” was introduced by a committee from the American Psychological Association (APA). Through collaborative efforts with the American Educational Research Association (AERA) and the National Council on Measurement in Education (NCME), several iterations of this “consensus standards” guide have been published under the updated title “Standards for Educational and Psychological Testing.”
In 1974, the consensus standards included the framework of validity most often used in current simulation-based training research published by the urologic community: “valid instruments” with distinct “types of validity.” The process for validating a given test or simulator involved subjective approaches such as face and content validity and objective approaches such as criterion and construct validity. 3 Face validity was defined as whether or not a simulator represented what it was intended to represent, as judged by learners. Content validity was defined as whether or not a simulator realistically taught what it was supposed to teach, as assessed by educational content experts. Objective approaches to validity included criterion and construct validity. Criterion validity referred to the concept of performance scores on a simulator correlating with another gold standard definition of skill while construct validity referred to the concept of the simulator being able to distinguish expert performances from novice performances (Table 1). While this taxonomy may still have some relevance with respect to simulators and simulation-based training, it is no longer considered acceptable by behavioral scientists when it comes to testing or assessment. Accordingly, this approach has been removed from subsequent revisions of the 1985 consensus standards. These newer iterations of the consensus standards have included several significant changes to the concept of validity, most of which has not been reflected in the current urologic education literature. The most recent guide was published in 2014. 4
N.B.
OSATS = objective structured assessment of technical skills.
The core concept, based on Messick, that remains in the 2014 framework of validity for testing no longer refers to types of validity but rather focuses on a unitary concept of validity where all validity is construct validity. 5 This was brought to the attention and translated for the surgical community in 2010 by Sweet and colleagues and Korndorffer and colleagues. 6,7 Validity is defined as “the degree to which evidence and theory support the interpretation of assessment scores for proposed uses of tests.” 4 This new definition considers validity to be a “hypothesis” whereby evidence should be collected to either accept or refute it and this evidence should come from multiple sources, including the test content, response processes, internal structure, relationships to other variables, and consequences of testing.
Validity evidence is also required for each use of the test or simulator. That is, validity evidence is unique to the defined population in which it was evaluated—validity evidence for a simulator using medical student assessments does not necessarily mean it will be a valid assessment tool for maintenance of certification of practicing physicians—and validity evidence must be considered in the context of the planned use of those results—strength of validity evidence for low-stake formative assessments differs from that of high-stake summative assessments.
Finally, validity evidence is not for the simulator or the test itself, but rather it applies to the interpretation of the simulator/test performance scores—will we be able to validly interpret the performance scores obtained using a specific simulator to make a judgment of competence or skill.
The validation process is considered a responsibility of both test/simulator developers and users. The responsibility of developers is to provide relevant rationale and evidence that support any test/simulator score interpretations for particular uses intended by the developers. However, the responsibility of users is to evaluate the evidence in the particular setting in which the test is to be used.
Current validity taxonomy—the five sources of validity evidence
First introduced in the 1999 iteration of the consensus standards, 8 rather than delineating types or categories of validity, the modern validity taxonomy describes a framework with five distinct sources of validity evidence that are to be used to help accept or deny the interpretation of an assessment (Table 1). Not all five sources of evidence are required for all assessments and depending on the type of assessment being made, more emphasis on one or more sources may be necessary. 9
Simulation-based assessment tools in a CBME continuum
The CBME model involves a continuum of learning not limited to residency training alone. It is a paradigm that focuses on the acquisition, maintenance, and enhancement of skills at various stages of competence, including undergraduate medical education, residency training, and continuing professional development. As part of a CBME framework, simulation-based training curricula and assessment tools can play a large role in developing and confirming competence within each stage of learning. Simulation-based training methods can reduce the educational footprint of training on patient outcomes while simultaneously allowing trainees and practicing physicians to acquire and maintain various skills. 10 With the demonstration of robust validity evidence, simulation-based methods can also provide educators with the ability to accurately assess competence.
One of the most well-known and widely adopted simulation-based training curricula in surgery is the fundamentals of laparoscopic surgery (FLS) curriculum. 11,12 It not only serves as a comprehensive basic laparoscopic skill training module but also it has been shown to have validity evidence for use as an assessment tool for trainees and practicing surgeons alike. In fact, the FLS curriculum is now a mandatory requirement for all general surgery trainees in the United States for certification by the American Board of Surgery. 13 In the urologic community, there is growing evidence for the use of comparable simulation-based basic laparoscopic skill training curricula. Both the American Urological Association Basic Laparoscopic Urologic Skills (AUA BLUS) curriculum 14,15 and the European Association of Urology European Basic Laparoscopic Urologic Skills (E-BLUS) program have been described for the training and assessment of residents and practicing urologists. 16
While the urologic surgical education and simulation literature has grown significantly over the past decade, much of the literature still focuses on an outdated concept of validity. In addition, there is a relative paucity of data supporting the use of various simulation-based assessment tools for the purpose of objective assessment of surgical skill. In particular, there are no data supporting the use of these simulation-based assessment tools for high-stake assessments, a problem not only in the urologic literature but also the surgical literature in general. In addition, studies demonstrating translation of performance to the operating room, as a result of a specific training curriculum (previously referred to as ‘predictive validity), are rare and attempts to translate the impact of these interventions to improved patient outcomes are even rarer and methodologically flawed.
Crowd-sourced Assessment of Technical Skills (C-SATS) is a relatively recent tool for obtaining assessments through an online community of lay persons or crowds using expert developed and validated evaluation tools. 17 Assessments from C-SATS are comparable to those obtained from experts for both basic and advanced robotic skills. 18,19 A recent report from the Michigan Urologic Surgery Improvement Collaborative (MUSIC) showed a strong correlation between the reviews of crowds and peer surgeons for assessment of robot-assisted radical prostatectomy (RARP) skills of 12 robotic surgeons using the Global Evaluation Assessment of Robotic Skills (GEARS) and Robotic Anastomosis and Competency Evaluation (r = 0.78 and r = 0.74), respectively. 19 In a later study, the same group reported a significant correlation between the crowd reviews of the urethrovesical anastomosis and the postoperative outcomes of RARP in terms of the urethral catheter replacement rate and the readmission rate. 20 Importantly, C-SATS assessments are also rapid and cost-effective.
While not exhaustive, the following tables list studies that have been published in the urologic literature that aimed to evaluate simulators or simulation-based assessment tools; the associated validity evidence presented within each study is also provided (Tables 2 –5). The last table shows that most of the current studies on assessment provided the evidence of relations with other variables, either in the form of comparison of performance between experts and novices or correlation of performance with the previous experience, and neglected other sources of validity evidence especially the response processes and consequences (Table 6).
OR = Operating Room
C-SATS = Crowd-sourced Assessment of Technical Skills; GEARS = Global Evaluation Assessment of Robotic Skills; RACE = Robotic Anastomosis and Competency Evaluation.
Future Directions
The majority of the published urologic literature on the validation of simulators or other simulation-based assessment tools seems to use the old framework of validity evidence. As a result, almost all studies reported validity evidence from relation to other variables in one form or another. Many studies also included, at least on a cursory level, evidence based on test content and internal structure. Unfortunately, most studies have not addressed validity evidence from response processes or consequential validity evidence and this represents a significant gap in the validity literature.
As the surgical training community moves toward a CBME paradigm enriched with simulation-based training and assessment methodology, it is important that the urologic community embraces the need for robust validity evidence to support the judgments of competency made using these simulation-based assessment tools, particularly if we are going to be making high-stake judgments of competence. As such, it is imperative that the urologic community moves away from the outdated and limited concept of “types” of validity and adopts the contemporary taxonomy of validity evidence, which espouses a unitary concept of validity where all validity is construct validity and in which validity evidence comes from various sources. We must understand that we are not looking to validate a simulator or test itself, but rather looking for validity evidence to support the judgments we make using the “scores” that result from the simulator or test within a specified context.
Ethics Statement
This study was conducted according to the Declaration of Helsinki 2013 and its amendments.
Footnotes
Author Disclosure Statement
No competing financial interests exist.
