Abstract
Reliable and objective assessment of clinical decision-making skills has been a long-standing goal in occupational testing in the allied health professions. With this goal in mind, the key features problem (KFP) format was developed which elicits targeted decisions about key features of clinical scenarios. To build on a small body of empirical evidence evaluating their efficacy, this study evaluates whether KFPs successfully assess higher order cognitive processes. Analysis of objective data (item length and difficulty, item performance, and response times) and subjective data (expert ratings of cognitive complexity) supported the proposition that KFPs tend to be more cognitively complex than conventional multiple-choice questions. Not only were they rated as more complex, but this complexity accounted for some of the increase in time spent responding to these items. Results support the use of KFPs in standardized assessments for measuring higher order cognitive processes such as clinical decision making.
Keywords
Decision making can be viewed as a stage in the problem-solving process, where the problem solver must recognize and understand the nature of the problem, generate possible solutions, evaluate the likely benefits, risks, and consequences of those alternative solutions, and make one or more decisions about the best course/courses of action. Clinical decision making involves problem solving in clinical cases, and clinical decision-making skills are important in the allied health field because patients are put at potential risk if health care personnel cannot apply their knowledge and training to real-world problems in an integrated and appropriate manner. As such, developers of licensure and certification examinations for allied health professions have expended considerable efforts and resources toward measuring clinical decision-making skills.
Many allied health and medical organizations have focused their efforts on “authentic” assessments of competencies with clinical scenarios. The National Board of Medical Examiners has conducted continual research with computer-based, clinical scenarios to assess clinical reasoning now known as the computer-based case simulation (CCS) section of the United States Medical Licensing Examination (USMLE, Step 3). Another organization for veterinary medicine (National Board Examination Committee for Veterinary Medicine) used a series of patient management problems (clinical competency test) that required test takers to choose among a variety of alternatives for action some of which may be appropriate, others either not appropriate or even contraindicated about a clinical case scenario. Other medical and some allied health organizations have attempted to assess clinical reasoning through the use of oral examinations. All of these complex item and scoring formats required considerable expenditures of staff and subject matter expert (SME) time, as well as considerable financial resources. The decision to make such an investment demonstrates the importance of assessing clinical decision making in a licensing/certification context.
Patient management problems and oral examinations have been largely abandoned in certification and licensure testing. In the case of the patient management problem format, there has been a great deal of divided psychometric opinion as to its reliability and validity. Many have expressed the opinion that the expense and time required to produce such items is not repaid in terms of measurement reliability or validity. Oral examinations were discontinued due to measurement issues, time constraints, and concerns with bias and candidate anonymity.
Allied health and medical organizations have continued their efforts to assess clinical decision-making skills, primarily through the use of the multiple-choice item format. Efforts are made to improve fidelity, sometimes through the use of clinical photographs, video recordings, or other means, but the multiple-choice question (MCQ) is still the format of choice. Sometimes a “standardized patient” format will utilize interactions with a patient/actor to assess patient communications or even diagnostic skills, but this is a complicated and expensive approach. The same can be said of the multistation, objective structured clinical examination (OSCE) format, generally used to assess various clinical skills. In short, the allied health certification and licensure testing landscape is still in need of a better measurement strategy to assess clinical decision making. What is needed is an item format that is easier and less expensive to develop and maintain, more straightforward to administer and score, and stable in its measurement attributes, all while assessing clinical decision making at a higher cognitive level.
The key features problem (KFP) is a promising alternative measurement strategy designed to improve upon the use of scenario-based assessments in this context (Bordage & Page, 1987; Page & Bordage, 1995; Page, Bordage, & Allen, 1995). KFPs present clinical scenarios and can elicit short-answer constructed responses or can be objectively scored from a menu of response options where test takers choose multiple responses and are scored against a correct response profile derived by experts. A small body of research evidence supports the proposition that KFPs measure decision making which results from higher order cognitive processes and problem-solving strategies (Schuwirth et al., 1999; Schuwirth, Verheggen, van der Vleuten, Boshuizen, & Dinant, 2001). Haladyna (2004) suggests that the KFP strategy is a “strong contender among other approaches to modeling higher level thinking that is sought in testing competence in every profession” (pp. 169–170) but goes on to say that more research is needed to evaluate the degree to which KFPs accomplish the “elusive goal” (p. 170) of assessing clinical problem solving. The current study addresses this call for research by further evaluating the proposition that KFPs assess higher order cognitive processes consistent with clinical decision making.
The idea underlying KFPs is the notion of a “key feature,” or critical step in the resolution of a clinical problem (Bordage & Page, 1987; Farmer & Page, 2005). For example, the key feature of a solution may be a step where a practitioner is most likely to make an error, or a decision point where the next course of action could be harmful to the patient if the wrong decision is made. KFPs can address common errors made by practitioners as well as rare but critical conditions such as infrequent emergency procedures (Kim, Amin, & Ng, 2007). The KFP format presents a description of the clinical case scenario followed by questions that focus solely on the identified key features of the case. The questions typically provide several different response options from which multiple critical elements can be selected. Scoring can be dichotomous (i.e., credit for choosing all correct options and no incorrect options, versus no credit for partial, incorrect, or too many selections) or can allow partial credit to be assigned. In addition, “automatic zero” options may be included, for example in a situation where one particular decision kills the patient. Farmer and Page (2005) and Haladyna (2004) provide brief guides to the development of KFPs.
KFPs are not intended to measure factual knowledge; they are intended to measure application and synthesis of knowledge and perhaps experience, coupled with reasoning and judgment, in the selection of appropriate courses of action for scenarios that are actually encountered on the job. Research has indicated that KFPs correlate only moderately with measures of factual knowledge (Fischer, Kopp, Holzer, Ruderich, & Junger, 2005; Hatala & Norman, 2002), and that the thought processes underlying experts’ responses to KFPs are more elaborate and qualitatively different when compared to those underlying more factual-based multiple-choice items (Schuwirth et al., 2001). For instance, Schuwirth et al. (2001) found that experts used less information and responded more quickly to KFPs than did novices, whereas no such expert–novice differences emerged on the more factual MCQs. In addition, experts engaged in more “nonsequential” processing of information in KFPs, while novices accessed the information in a more sequential manner. These findings support the notion that KFPs elicit different reasoning and decision-making strategies than more factual items. In another study, Schuwirth et al. (1999) found that KFPs significantly differentiated students from a medical school that used an exclusive problem-based learning curriculum from a similar school that used primarily a lecture-based curriculum, further supporting the notion that KFP performance relies on problem-based clinical decision-making skills.
While the existing research on KFPs is promising, more evidence is needed to support the notion that these relatively elaborate items actually do go beyond more conventional, brief, single-answer MCQs and tap into the higher cognitive processes they purport to measure. Of the various objectively scored multiple-choice item formats available to test developers for designing their assessments (see Haladyna, 2004, for different options), the question is “Do KFPs stand apart from more conventional, brief single-answer MCQs in terms of the cognitive processes they elicit?” In order words, to what degree do KFPs, which to date have appeared superior to other attempts in measuring clinical decision making, actually accomplish the “elusive goal” of measuring higher level cognitive processes? The current study evaluated this question using a combination of SME ratings on the cognitive complexity of KFP and MCQ items and the response time and performance of actual test takers.
We proposed three research questions for this study. First, do experts view KFPs as more cognitively complex than MCQs? Answering this question provides new information on whether high-level cognitive processes are involved in answering KFPs to a greater extent than in more conventional MCQs. Second, does this complexity explain differences in response times between KFPs and MCQs, after controlling for differences in item length and difficulty? KFPs tend to be longer than MCQs which adds reading time, and KFPs also tend to be more difficult than MCQs which can affect response times. The question here is not only whether more time is spent on KFPs relative to MCQs over and above these factors, but also the degree to which this additional time spent can be accounted for by the KFPs eliciting higher level cognitive processes. Answering this question helps build the pattern of evidence regarding whether KFPs are tapping into clinical decision making.
Third, do these residual differences in response times (after controlling for item length and difficulty) have a different pattern of relationship to item performance for KFPs versus MCQs? To a certain degree, it can be argued that experts either know the answers or they do not, so that delayed responses may indicate that the test taker simply does not know the answer. However, if KFPs require higher level cognitive processes, then we would expect at least some of the extra time to translate into better item performance, whereas delayed responses on the simpler MCQs may be a stronger indication of a lack of knowledge. In other words, if KFPs involve higher level processes then answering correctly should take a bit longer as test takers work through their decision processes.
Method
Setting
This study was conducted in the context of examination development for two dietetic specialty areas (gerontology and oncology). A total of six different forms were used. These examinations contain a combination of conventional, relatively brief, four-option MCQs, and the more elaborate KFPs.
Item Development
The procedures for developing MCQs and KFPs were similar. Participants were provided a formal orientation in the principles of good item construction, opportunities to familiarize themselves with the content specifications, and opportunities to work with fellow participants to create the items. For each item, considerable emphasis was placed on specifying the linkage of item content to the content specifications and providing a citation from an authoritative reference source. Therefore, each item was linked to a specific task and knowledge from the test specifications and to a page or section of an authoritative reference source. There were numerous opportunities for individual assistance with item development as well as opportunities for review by other participants.
The primary distinction between the development process for MCQs and KFPs was the degree of focus on the key features of the problems for clinical decisions. MCQs were relatively brief and had four response options with one correct answer and focused on either applied questions regarding facts, concepts, and procedures or scenario-based questions regarding commonly encountered situations that require analysis, interpretations, and decisions. KFPs differed from scenario-based multiple-choice items in the degree of complexity of the scenarios and the degree to which they emphasized the candidate’s ability to triage among multiple correct answers and multiple distractors, many of which could be correct, but were not the actions or interventions that addressed the most urgent needs of the patient in the scenario. The intent of the KFP was to assess the candidates’ abilities to implement more than one step in their decision-making process, discriminate between relevant and irrelevant information presented in the scenario, and prioritize correct responses in terms of urgency of the situation. Test takers were instructed on how many responses to choose when responding to each item (e.g., “Choose three”), and they were scored in an all-or-none fashion. SMEs determined the minimum number of these correct options needed to receive a point for each item and those selecting fewer than the minimum received no credit for that item.
Examination Administration
Examinations were delivered to certification candidates as computer-based tests at testing centers where proctors provided standardized instructions. The candidates were actual test takers who were completing the examinations for certification and not for the purposes of this study. For this study, the data were extracted from database archives without any identifying information associated with the test takers.
Cognitive Complexity Surveys
Multiple survey forms were constructed using subsets of the KFPs from each of the six operational examination forms used in this study interspersed in a counterbalanced fashion with six corresponding subsets of MCQs drawn randomly from the same forms. The correct answers were not marked in order to force raters through the thought process of answering the items. Raters were instructed to read each item, think about the correct answer/answers, and decide on the level of cognitive processing required to answer correctly. Ratings were given on a 4-point scale derived from the language of common cognitive level taxonomies: 1 = Factual, 2 = Application, 3 = Interpretation, and 4 = Synthesis. Table 1 provides the definitions that were developed using the terminology of the focal professions to make them most meaningful to expert raters from these professions. The MCQs and KFPs were counterbalanced on the survey and were not labeled in any way for the raters as being one type of item or another.
Definitions of Cognitive Complexity Levels for the Rating Task
Note. a The details shown in this example after the “e.g.” are for the oncology examinations.
Procedure and Participants
SMEs were contacted via e-mail and asked to serve as raters in a quality-control research study designed to assess the cognitive processes elicited by certification candidates when they take the examinations. All invited SMEs were certified and in good standing in their respective specialty areas. Most were drawn at random from a database of certificate holders, while SMEs who had previously served as item writers were also invited to participate. The surveys were put on a password-protected Internet site, and when SMEs logged in they were randomly assigned one of the alternate forms in their specialty area (gerontology vs. oncology). They were allowed 3 weeks to log in and complete their ratings, and on completion they were sent an honorarium of $25. Each participant was assured that survey ratings would be combined with those of other participants to determine group trends. Individual ratings were kept confidential and had no real or potential impact on any aspect of their employment or status as a registered dietitian. All participants and their data were treated in accordance with the American Psychological Association (APA) ethical guidelines.
Variables and Data Collected
Comparisons of KFPs and MCQs were first made in terms of SME ratings of cognitive complexity on the survey forms. In addition, test taker response times (measured in seconds) and item performance (correct/incorrect) were extracted from data archives for each of the items that were included in the SME surveys. In addition, each test taker’s total examination score (i.e., their sum correct across all items, on which certification decisions were made in practice) was collected, and these were converted into within-form standard scores. These four variables—cognitive complexity, response times, item performance (both binary correct/incorrect values and aggregated percentage correct p values), and overall examination performance—were the primary focal variables of the study.
Because MCQs and KFPs typically differ in length which may covary with the primary variables (especially response times), two separate length variables were coded to incorporate statistical controls. First, we coded item stem length as a simple count of words within the “prompt” or stem of the item. This accounted for each word the test taker would need to read from the start of the item text up to the point where they begin reading and deciding on the response options. Second, we coded the total word count of the block of response options.
Data Analysis
Survey response rates were tallied, chi-square tests were used to evaluate whether response rates for the different forms varied across demographic variables, and rater reliability was evaluated using generalizability theory. Ratings of cognitive complexity were compared between item types using a graphical display and a 2-way mixed analysis of variance (ANOVA) with item type as the repeated variable and survey form as the independent-group variable. Item stem and option length, item difficulties, average cognitive complexity ratings of items, and an item type dummy variable (MCQ = 0, KFP = 1) were then used as predictors of response time in a hierarchical multiple regression analysis. Finally, standardized examination scores, item length, item type, item response time, and an interaction term between item type and item response time were used as predictors of item performance (correct/incorrect) in a multiple logistic regression analysis.
Results
Response Rate, Respondent Characteristics, and Reliability of Ratings
Participation requests were sent to 183 SMEs, and 101 (55.2%) agreed to participate and provided their ratings of items within their specialties. All participants were college educated, with 46 (45.5%) possessing bachelor’s degrees, 52 (51.5%) possessing master’s degrees, and 3 (3.0%) possessing doctoral degrees. Ninety-nine (98.0%) were employed in dietetic practice at the time of the study and likewise 99 (98.0%) were actively providing oncology nutrition services, with 81 (80.2%) providing these services for more than 20 hours per week and 62 (61.0%) for more than 30 hours per week. All 101 were credentialed as registered dietitians with the Commission on Dietetic Registration for at least 3 years, while 64 (63.4%) had been for over 10 years.
Random assignment to groups resulted in 12–22 raters per form. Affirming the success of the random assignment, chi-square tests of independence indicated that rater group membership was unrelated (ps > .05) to all demographic variable categories summarized above, supporting the equivalence of the groups of raters along these relevant demographic characteristics. Rater-by-item generalizability theory analyses for each form resulted in generalizability coefficients of .76 to .95 across examination forms, and .89 when pooled across forms (for both relative and absolute error), indicating strong interrater reliability for the ratings.
Do SMEs View KFPs as Requiring More Complex Cognitive Processes than MCQs?
Figure 1 provides a simple tally of cognitive level ratings, split by item type for visual comparison. A clear pattern is indicated where MCQs were more often seen as factual and progressively less often seen as measuring the application, interpretation, and synthesis levels. KFPs were least often seen as factual and more often seen as measuring the application, interpretation, and synthesis levels. Because this figure tallies ratings across items that were provided by the same SMEs, data were aggregated to means across raters and item types (MCQs vs. KFPs), and the mean ratings were used as the unit of analysis for tests of statistical significance.

Frequency counts of cognitive level ratings by item type summed across items and raters.
Figure 2 displays a plot of the average ratings for each item type across the six test forms. A two-way mixed ANOVA revealed a statistically significant and quite strong interaction between Item Type and Test Forms on cognitive complexity ratings, F(5, 95) = 18.00, p = .000, partial η2 = .49. The main effect of item type was also significant with a stronger effect size, F(1, 95) = 295.53, p = .000, partial η2 = .76, and the main effect of forms was also significant and moderately strong, F(5, 95) = 5.35, p = .000, partial η2 = .22. The main effect of item type, which was the strongest of the three effects, is of particular interest, with Figure 2 showing clearly that KFPs (M = 2.73; SD = .31) were rated higher, on average, than the MCQs (M = 1.93; SD = .44). While the significant interaction suggests this effect is moderated by form, simple effects with a Bonferroni-corrected α of .05/6 = .008 suggest that the difference is nevertheless significant for all six forms. As seen in Figure 2, the mean ratings for KFPs were always higher than those for MCQs, and the nature of the interaction is such that the effect was simply stronger for the gerontology test forms (η2s = .58 and .68) than for the oncology test forms (η2s = .08–.23).

Means of cognitive level ratings for MCQs and KFPs across test forms. MCQs = multiple-choice questions; KFPs = key features problems.
Does Cognitive Complexity Help to Explain Differences Between KFPs and MCQs in Test Taker Response Times?
The mean response times computed from actual test takers for each of the 166 items (Ns = 24 to 79 per item, depending on the test form) were then regressed onto the dummy variable representing item type (0 = MCQ, 1 = KFP) as well as the mean cognitive complexity rating across SMEs, after controlling for response time differences due to variability in item length and difficulty. Table 2 provides the descriptive statistics for each variable in this analysis broken down by MCQ versus KFP. For the regression analysis, item difficulties were converted into logits and then multiplied by −1 so that higher numbers represent more difficulty. All control variables were mean centered before entering them into the regression analysis to facilitate interpretation of the multiple regression results.
Descriptive Statistics for Control Variables (Item Length and Difficulty), Mean Cognitive Level Ratings, and Response Times
Notes. MCQs = multiple-choice questions; KFPs = key features problems.
Table 3 presents the results. Model 1 accounted for the effects of item length and included the item stem word count and options word count variables, which both significantly increased response times and together explained 57% of the variance (R 2 = .57, p = .000). Model 2 added item difficulty, which was found to account for further increases in response times and produced an increment of 13% to the explained variance. The total variance explained by these control variables was approximately 71%, and the β weights indicated that differences in stem word counts contributed the most to differences in overall response times. Due to mean centering, the raw equation for Model 2 reveals that for items of average length and difficulty test takers spent approximately 85.26 seconds responding; all else equal, each additional word in the stem adds approximately 0.45 seconds, each additional word in the response options adds approximately 0.86 seconds, and each 1-logit increase in item difficulty adds approximately 11.23 seconds of response time.
Results of Multiple Regression Analysis Predicting Mean Response Time Per Item From the Item’s Cognitive Complexity, Controlling for Item Length and Difficulty
Notes. N = 166 items, which was the unit of the analysis. To facilitate interpretation of the constant and b weights, the Stem Word Count, Options Word Count, and Item Difficulty variables were mean centered prior to the regression analysis, while the Mean Cognitive Level Rating variable was recoded such that the factual level equals 0. MCQs = multiple-choice questions; KFPs = key features problems.
a Model 1 R 2 = .573 (adjusted R 2 = .568), F(2, 159) = 106.75, p = .000. b Model 2 R 2 = .707 (adjusted R 2 = .701), ▵R 2 = .13, F(1, 158) = 72.01, p = .000. c Model 3 R 2 = .714 (adjusted R 2 = .707), ▵R 2 = .01, F(1, 157) = 4.09, p = .045. d Model 4 R 2 = .728 (adjusted R 2 = .719), ▵R 2 = .01, F(1, 156) = 7.916, p = .006.
In Model 3, the item type dummy variable (0 = MCQ, 1 = KFP) was entered and found to explain an additional 1% of variance in response time which was statistically significant, ▵R 2 = .01, p = .045. The unstandardized weight shows that of the 68.88-second raw difference in response times between MCQs and KFPs (derived from Table 2), 15.46 seconds still remain after accounting for differences in item length and difficulty. However, when mean cognitive level ratings were added in Model 4, they were found to be significant as well (▵R 2 = .01, p = .006) and rendered the coefficient for item type nonsignificant (p = .122). This pattern, where item type was significant in the model excluding cognitive levels but became nonsignificant when cognitive levels were added, is consistent with a mediation effect where differences in response times between MCQs and KFPs are largely explained by differences in the cognitive levels they measure.
Does Time Spent Responding Relate Differently to Item Performance for KFPs versus MCQs?
Table 4 presents the results of a logistic regression analysis focusing on item performance (incorrect = 0 and correct = 1) as the dependent variable. Each test taker’s standardized total score on the examination form was entered as a control variable in Model 1 to partial out differences in item performances due to levels of proficiency. Item length (stem and option word count) and item type were also entered in Model 1 to partial out the differences in item difficulty between MCQs and KFPs. Model 1 was found statistically significant, χ2(4) = 189.42, p = .000, with a Nagelkerke R 2 of .07. Model 2, which added item response time, was significantly stronger than Model 1, ▵χ2(1) = 131.31, p = .000, and incremented the Nagelkerke R 2 by .04 up to a total of .11. The nature of the effect was such that after accounting for differences in test taker abilities and item lengths and difficulties, more time spent on the items was associated with a lower likelihood of answering correctly.
Results of Logistic Regression Analysis Predicting Item Performance From Time Spent on KFPs Versus MCQs, Controlling for Test Taker Proficiency Levels
Note. MCQ = multiple-choice question; KFP = key features problem.
a Model 1 Nagelkerke R 2 = .07, χ2(4) = 189.42, p = .000. b Model 2 Nagelkerke R 2 = .11, ▵χ2(1) = 131.31, p = .000. c Model 3 Nagelkerke R 2 = .12, ▵χ2(1) = 20.81, p = .000.
In Model 3, however, time spent was found to significantly interact with item type, ▵χ2(1) = 20.81, p = .000, and incremented the Nagelkerke R2, by .01 up to a total of .12. The nature of the interaction effect is plotted in Figure 3, which shows the predicted probabilities of success for MCQs and KFPs as a function of the number of seconds spent on the item. Note that the control variables (test taker proficiency, item stem word count, and options word count) were held constant at their average values when plotting these probabilities, and the time elapsed across the horizontal axis runs approximately through the 99th percentile (351 seconds or 5.85 minutes) of time spent in the observed data. Figure 3 shows that the rate of decline in predicted probabilities of success over time for KFPs was not as steep as it was for MCQs.

Interaction plot of the effects of time spent on item performance for MCQs versus KFPs. MCQs = multiple-choice questions; KFPs = key features problems.
Discussion
The primary purpose of this study was to evaluate the degree to which KFPs tend to measure cognitive processes that are more complex than conventional MCQs, in order to expand the existing body of evidence (e.g., Schuwirth et al., 1999, 2001) regarding their validity for assessing clinical decision making. SME ratings of the cognitive complexity of KFP and MCQ items were found to be reliable and, addressing our first research question, were found to consistently reveal that they believed the KFPs measured higher cognitive levels than the MCQs. Thus, in the judgment of SMEs, the KFPs used in this study tended to achieve the goal of assessing relatively more complex cognitive processes than MCQs in the decision-making process required for selecting the correct answers to the items.
With respect to our second research question, consistent with the evidence that KFPs require more complex cognitive processes, our multiple regression analysis focusing on response times revealed that even after controlling for response time differences that were explained by the fact that the KFPs tended to be longer and more difficult than MCQs, test takers still spent more time responding to the KFPs. On average, test takers spent approximately 69 more seconds on KFPs than MCQs, and our results suggested that approximately 15 of these seconds could not be accounted for by differences in item length and difficulty alone. Moreover, when cognitive level ratings were added to the regression model, the differences in response times between KFPs and MCQs were reduced to nonsignificance. This finding is consistent with a mediation effect where the differences in response times between item types appear to be largely attributed to the higher level processes elicited by the KFPs. Thus, in response to our second research question, it appears that cognitive complexity does in fact help to explain longer performance on KFPs relative to MCQs. In other words, it is not only reading time and item difficulty that produces longer response times to KFPs but also more complex problem-solving and decision-making processes. This again supports the validity of KFPs as measures of higher level cognitive processes.
While cognitive complexity helps explain some of the extra time test takers spend responding to KFPs, the third research question went a little further and asked whether the extra time actually translates into greater gains in performance on KFPs versus MCQs. If so, this is again consistent with the proposition that the extra time tends to be spent engaging in a more complex decision-making process required to successfully answer the item. When evaluating this question the logistic regression results suggested that, after taking into account differences in test taker proficiency and differences in item length and difficulties, the more time a test taker spends on an item the less likely they are to answer correctly—however, the interaction model suggested that this pattern is more true of MCQs than KFPs. This is consistent with the SME ratings which identified many MCQs as factual in nature, where a short factual question is likely to result in a quick response for test takers who know the answer. On the other hand, SMEs felt that more of the KFPs required complex cognitive processes, and the regression results described earlier suggested that some of the extra time spent on KFPs was likely due to those complex processes. This may explain why quickness of responding did not relate as strongly to performance for KFPs as it did for MCQs. In other words, to a certain degree at least, it appears that it legitimately takes more time on KFPs to work through a more complex problem-solving process and provide a judgment and decision regarding the correct answer/answers.
To summarize, the findings of this study provide evidence that KFPs tap into higher level cognitive processes consistent with the intended problem-solving and clinical decision-making process. As with any test development program, some of the items in this study appeared to achieve this goal better than others. In other words, as seen in Figure 1, some KFPs were in fact rated at the factual level. On the other hand, some of the MCQs were rated at higher cognitive levels. On balance, however, the tendency was for MCQs to be rated at lower levels and KFPs to be rated at higher levels, and these differences had substantive relations with item response times and performance.
Through correlations between MCQ and KFP sections of tests (Fischer et al., 2005; Hatala & Norman, 2002), comparisons of schools with different degrees of problem-based learning programs (Schuwirth et al., 1999), and a think-aloud test-taking session (Schuwirth et al., 2001), past research has supported the notion that KFPs measure different, and higher order, cognitive processes compared to more conventional MCQs. Because of the relatively limited empirical research base in the published literature, Haladyna (2004) suggested that additional types of validity evidence should be sought to further evaluate KFPs. Following the principle of methodological triangulation, the current study adds breadth to this knowledge base supporting KFPs by introducing results based on SME judgments of the thought processes that underlie KFPs, and the degree to which those processes affect response times and ultimately item performance in theoretically meaningful ways. Therefore, this study contributes valuable new validity evidence in support of the notion that KFPs tap into clinical decision-making skills as intended.
Additional research, replicating the methodologies used to date or extending into other research strategies, is warranted to further expand the research base on KFPs. For example, research into construct validity issues is warranted to better understand the nature of the skills KFPs are attempting to measure. Schuwirth (2009) provided a thoughtful commentary on issues in disentangling the complex notion of clinical reasoning. In addition, there is a lack of extant research on criterion-related validity evidence demonstrating whether KFPs help predict future performance on the job. A substantial body of criterion-related validity evidence exists for the conceptually similar situational judgment tests (Weekley & Ployhart, 2006; Whetzel & McDaniel, 2009), which might be tapped in search of implications for KFPs. Research into more generic problem solving (e.g., Davidson & Sternberg, 2003) might also be explored in attempt to better understand psychological research into the processes involved in clinical reasoning and correlates of KFP performance.
Certification and licensure organizations continue to administer “high stakes” examinations to graduates of professional degree programs to determine who is allowed to practice as a health care professional. In addressing the issue of public protection, these examinations attempt to assess not only factual knowledge but application of knowledge and clinical decision making. In an effort to assess decision making at a higher cognitive level, the KFP item format has been utilized which offers an easily administered and objectively scored assessment process that can utilize a variety of clinical cases and cover a broad spectrum of content domains. Though this item format is not as difficult to construct as other formats, such as the patient management problem format, it still requires more work and resources to produce than more standard multiple-choice item types. Accordingly, it is appropriate to attempt to verify that the effort is “paying off” in the context of assessing skills at a higher cognitive level. The current study tends to support that assertion.
As measures of clinical decision making, KFPs play a key role in helping to ensure that licensed and certified allied health practitioners have the skills they need to make important decisions on the job and improve their practice and quality of patient care. If evidence continues to favor the KFP format in the allied health education and assessment field, it may be a format worth considering for examinations in other fields where decision-making skills are relevant to the quality of work performance.
Footnotes
Acknowledgments
This research was carried out in conjunction with test development work at Comira. The authors would like to thank the Commission on Dietetic Registration for allowing us to obtain data from certified specialists in gerontological and oncology nutrition.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article
