Abstract
Mastery learning assessments have been described in simulation-based educational interventions; however, studies applying mastery learning to multiple-choice tests (MCTs) are lacking. This study investigates an approach to item generation and standard setting for mastery learning MCTs and evaluates the consistency of learner performance across sequential tests. Item models, variables for question stems, and mastery standards were established using a consensus process. Two test forms were created using item models. Tests were administered at two training programs. The primary outcome, the test–retest consistency of pass–fail decisions across versions of the test, was 94% (κ = .54). Decision-consistency classification was .85. Item-level consistency was 90% (κ = .77, SE = .03). These findings support the use of automatic item generation to create mastery MCTs which produce consistent pass–fail decisions. This technique broadens the range of assessment methods available to educators that require serial MCT testing, including mastery learning curricula.
An increasing focus on outcome-based education and test-enhanced learning has drawn attention to mastery learning as a promising model for learner assessment (Agrawal et al., 2012; Cook et al., 2013; Frank et al., 2010; Holmboe et al., 2010; Larsen et al., 2008, 2009; McGaghie, 2015). In mastery learning, participants who meet the criteria for mastery advance to the next phase of training; however, participants who do not must practice and retest until they are able to meet the criteria for mastery (Hodges, 2010). While this model has been described in simulation interventions, studies applying mastery learning to multiple-choice tests (MCTs) are lacking (Bandaranayake, 2008; Cook et al., 2013; McGaghie et al., 2014; Yudkowsky et al., 2014).
Several challenges complicate the development and implementation of MCTs for mastery learning: Standard setting involves nuanced differences from traditional techniques (Yudkowsky et al., 2015). Selection of content is based on relevance to the mastery standard, as opposed to selection of content of variable difficulty in traditional MCTs (Yudkowsky et al., 2015). Tests face different threats to validity, including issues with memorization and changes in variance with retesting (Lineberry et al., 2015). Potential for repeated learner retesting warrants the development of large question banks (Holmboe et al., 2010; Lineberry et al., 2015).
Without practical solutions to these challenges, the potential of mastery learning MCTs will remain unrealized. Regarding the challenges of mastery standard setting and content selection, a traditional consensus process can still be utilized; however, the goals that guide this process must be adjusted (Yudkowsky et al., 2015). Standard setting in mastery learning focuses on the goal of assessing the learner’s preparedness for advancement as defined by a high likelihood of success at the next training level. A consensus process may be used to identify and appropriately weigh content based on relevance to this mastery standard.
Time and expertise requirements make the costs of simply “scaling up” traditional MCT development strategies for mastery MCTs prohibitive (Gierl et al., 2012; Haladyna & Rodriguez, 2013). Automatic item generation (AIG) offers a solution to this problem by using a systematic approach to efficiently create large banks of relevant questions (Gierl & Lai, 2013; Gierl et al., 2012; Haladyna & Rodriguez, 2013; Pugh et al., 2016). The mainstay of this process is the item model, which is a scaffold describing a scenario that can be manipulated using variables to create large quantities of similar questions. Incidental changes to variables within item models (e.g., changing a patient’s age from 46 to 48 years) result in isomorphic test items, whereas variables that change the context or difficulty of items are defined as radicals (Haladyna & Rodriguez, 2013).
The efficiency of AIG in allowing rapid development of unique items is clear, and there is preliminary evidence of variable psychometric properties for radical test items (Gierl et al., 2016). However, the adaptation of this approach to mastery MCTs and the psychometric properties of isomorphic test items have yet to be described.
In this study, we evaluate a process of mastery learning MCTs using two innovations: (1) a mastery standard setting process modified from the procedural mastery learning literature (Yudkowsky et al., 2014) and (2) a process of AIG modified to meet the needs of mastery learning assessments. Results were evaluated in the form of psychometric data from two versions of the test administered using a test–retest approach. The consistency of pass–fail decisions across versions of the test was the primary outcome.
Method
Setting and Participants
The test was administered to 47 residents spanning all years of training at two 3-year emergency medicine programs (University of Chicago: n = 34; University of Illinois at Peoria: n = 13). Participation was voluntary and there were no consequences for nonparticipation. This study was approved by both institutional review boards.
Test Item Development
Item models were developed by two clinician educators (E.S., G.P.) and reviewed by two medical educators (A.T., Y.S.P.). Item models were mapped to a blueprint from an existing curriculum. Authors identified and agreed upon incidental variables within item models; for each area of incidental content, banks of variables (e.g., patient age, gender, disease risk factors) were created. No variables were included in answer choices. A total of 20 item models were developed.
A modified Delphi consensus process was utilized to evaluate expert agreement for the following parameters: (1) item model alignment with the test blueprint, (2) appropriateness of answer choices for the question stem, and (3) interchanging variables resulted in isomorphic test items. The panel consisted of five emergency physicians with active teaching roles and mean of 14 years in clinical practice. The consensus process continued until there was 100% agreement on these parameters. Consensus was achieved in two rounds with the alteration of one answer choice and one visual stimulus.
Mastery Standard Setting
The consensus process for mastery standard setting utilized the same modified Delphi technique and panel. The framework used for standard setting was a modification of the patient safety approach described by Yudkowsky et al. (2014) where essential steps have noncompensatory standards. In the modified framework, panelists were asked to rate each test item using one of three categories (essential, important, and good). Anchors were provided for each category. For essential items, “The knowledge tested in these questions is required for advancement beyond the intern level. All second-year emergency medicine trainees should possess this knowledge.” For important items, “Interns are expected to know most of the material tested in these questions to be ready for advancement beyond the intern level.” For good items, The knowledge tested in these questions is not required for advancement beyond the intern level. Possession of this knowledge may show promise for the learner; however, this level of knowledge is not necessary for emergency medicine interns to be considered ready for promotion to their second year of training.
There was unanimous agreement in final cut scores of 100% for essential items, 80% for important items, and 0% for good items. Strong consensus (defined as ≥80% agreement) was achieved on 18 of the 20 item models; fair consensus (60% agreement) was achieved on 1 item model, and 1 item model had poor consensus (<60% agreement). Item models with strong and fair consensus were assigned cut scores at the consensus level. The item with poor consensus had responses averaged and was assigned the cut score closest to the average response (important, cut score 80%).
Assessment of Consistency: Pass–Fail Decisions
The primary outcome of this study was the consistency of pass–fail decisions, as it has been described as the sole determinant of reliability for mastery learning assessments (Lineberry et al., 2015). Decision consistency was evaluated using the (1) test–retest κ statistic on the consistency of the retest pass–fail decision and (2) decision-consistency classification indices (incorporates β-binomial procedures; Livingston & Lewis, 1995). Both approaches derive inference on the reproducibility of decisions across test–retest purposes and also within the same assessment, controlling for chance agreement. To ensure no new knowledge was gained between tests, both versions were completed in one session where learners received the second test packet as soon as the first was completed.
Assessment of Consistency: Test Item Performance
Each version of the test was based on the same 20-item models; however, the two versions of the test featured different incidental variables. Variables for the first version of the test were selected randomly from the lists derived from the consensus process using an online random number generator (“Random.org,” 2017). Variables for the second version of the test were also assigned randomly with the exclusion of previously used variables. Consistency of learner performance across different iterations of test items was analyzed as a secondary outcome.
Pilot and Deployment
The test was piloted by two emergency medicine residents. Response process validity was addressed by reviewing feedback from the pilot and did not result in changes to the test. Tests were deployed sequentially and in the same order in April 2017. Data were analyzed using Stata Version 14 (StataCorp, College Station, TX). Decision-consistency statistics (Livingston & Lewis, 1995) were estimated using BB-CLASS (University of Iowa, Iowa City, IA).
Results
A total of 912 unique questions were created from the 20-item models and variable lists. The number of unique iterations per item model ranged from a minimum of 24 to a maximum of 128. These iterations can be combined to create millions of unique 20-question tests.
Mean scores differed significantly by training levels and increased with each additional year of training (Table 1). There were no significant differences in mean learner performance across versions of the test. There were no significant differences in the mean item difficulty between test forms (mean ± standard deviation: .69 ± .28 and .68 ± .26, respectively). Similarly, there were no significant differences in the mean item discrimination between test forms. Item-level differences in difficulty and discrimination were not assessed. Internal consistency reliability (Cronbach’s α) for the two tests were .61 and .65, respectively.
Descriptive Statistics: Mean Scores by Training Level.
Note. PGY = postgraduate year; SD = standard deviation.
The mastery standard had 94% test–retest agreement in pass–fail decisions (κ = .54). In one case, a participant received a pass, then failed; in one other case, a participant failed, then passed. Decision consistency (proportion of consistent decisions) based on the Livingston and Lewis (1995) statistic was .85 for both test forms. Consistency in performance across versions of test items was also high (90.11% agreement, κ = .77). The consistency of learner performance across different versions of test items was correlated with overall learner performance as measured by raw score.
Discussion
This technique successfully addressed four major challenges of mastery learning MCTs described previously. In addition, the psychometric characteristics of the items were preserved across test versions. The primary outcome of this study—the consistency of test–retest pass–fail decisions—showed strong agreement across versions of the test (94% consistent). The test–retest κ statistic for this standard (.54) would suggest “moderate” agreement by traditional standards (Landis & Koch, 1977). Using a more refined statistic for examining the probability of consistent decisions based on Livingston and Lewis (1995) yielded reliability estimates of .85, indicating high reproducibility of pass–fail classification decisions.
The use of variables in question stems serves two main purposes: to increase the number of unique questions available for mastery learning tests and to decrease the likelihood that learners will recognize the question stem upon retesting. In this study, higher performing learners were more consistent in their answers, while lower performing learners showed more variability. These findings support the use of this method for repeated testing as low-performing learners do not recognize that isomorphic questions test the same construct (even in back-to-back testing), whereas high-performing learners do. Limitations of this study include a small sample size, short test length, and low stakes. While this was a pilot study of emergency medicine residents, it is reasonable to anticipate these findings to be consistent across other specialties and health professions.
Conclusion
AIG and purposeful standard setting can be applied to create mastery MCTs. This methodological approach facilitates efficient development of large question banks with justifiable mastery standards and consistent pass–fail decisions. This technique broadens the range of assessment methods available to educators implementing mastery learning curricula.
Footnotes
Acknowledgments
The authors wish to acknowledge Drs. Christine Babcock, Seth Truger, Ira Blumen, and Daniel Robinson for their support of this project.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
