Abstract
This study evaluated the extent to which medical students with limited English-language experience are differentially impacted by the additional reading load of test items consisting of long clinical vignettes. Participants included 25,012 examinees who completed Step 2 of the U.S. Medical Licensing Examination®. Test items were categorized into five levels based on the number of words per item, and examinee scores at each level were evaluated as a function of English-language experience (English as a second language [ESL] status and scores on a test of English-speaking proficiency). The longest items were more difficult than the shortest items across all examinee groups, and examinees with more English-language experience scored higher than those with less experience across all five levels of word count. The effect of primary interest—the interaction of word count with English-language experience—was statistically significant, indicating that score declines for longer items were larger for examinees with less English-language experience; however, the magnitude of this interaction effect was barely detectable (η2 = .0004, p < .001). Additional analyses supported the conclusion that the differential effect for examinees with less English-language experience was small but worthy of continued monitoring.
Keywords
The past 25 years have witnessed an increased interest in the use of multiple-choice questions (MCQs) consisting of clinical vignettes that describe realistic medical cases. A clinical vignette typically opens with a paragraph-length description a patient’s social and demographic background, the presenting complaint and medical history, physical examination results, and other diagnostic information (e.g., laboratory tests, medical images). Examinees are required to identify which findings are important, differentiate relevant from irrelevant findings, interpret that information in the context of what they already know, and then determine the correct diagnosis or establish a treatment plan. Clinical vignettes are consistent with integrated curricula and problem-based instruction, and are presumed to provide a more authentic assessment of clinical reasoning than traditional MCQs (Case, Swanson, & Becker, 1996; Farmer & Page, 2005; Schuwirth, Verheggen, von der Vleuten, Boshuizen, & Dinant, 2001).
Because clinical vignettes present more information than standard MCQs, they also contain more words per item. Not surprisingly, clinical vignettes also require more time for examinees to read and, as a whole, are slightly more difficult than standard MCQs (Case et al., 1996). It is conceivable that the additional reading load of clinical vignettes is particularly challenging for examinees who speak English as a second language (ESL). Although research in higher education and credentialing indicates that ESL examinees generally obtain lower scores than native English speakers on traditional MCQs (Abedi, 2016; Bosher & Bowles, 2008; Lakin, Elliott, & Lui, 2012; Latch, 2009; Mann, Canny, Lindley, & Rajan, 2010; Morrison, Swanson, Dillon, Kiepert, & Sample, 2008), studies have not addressed whether the additional length of clinical vignettes interacts with English-language experience in a manner that specifically disadvantages ESL examinees. Such an interaction would be important to detect because it could signal the presence of language-related construct-irrelevant variance (Abedi, 2016; Messick, 1989). The present investigation was undertaken to identify potential sources of construct-irrelevant variance in scores for the U.S. Medical Licensing Examination (USMLE®) program. Its specific purpose was to determine whether examinees with less English-language experience are differentially impacted by the additional reading load required by lengthy clinical vignettes. Findings of differential impact for ESL examinees could have implications for the development and use of clinical vignettes.
Method
Data
Data were obtained from operational administrations of USMLE® Step 2 Clinical Knowledge (CK), a multiple-choice test consisting of approximately 325 items. The study included only text-based, stand-alone MCQs while excluding multiple-item sets and items with graphics. The data consisted of responses from 25,012 examinees to a total of 5,206 test questions distributed across numerous test forms, given in a single year, with subsets of examinees responding to different subsets of items. Test items were classified into one of the five levels of word count: less than 109, 109–129, 130–148, 149–175, and greater than 175 words. Each level of word count consisted of approximately 1,040 test items; any single examinee responded to an average of 43 items at each level of word count. Thus, each examinee was scored on the basis of his or her performance on five “subtests,” where each subtest consisted of about 43 items with the range of word counts specified above. The distribution of different content areas across levels of word count was approximately random (e.g., each level of word count contained similar numbers of items addressing cardiovascular problems, musculoskeletal problems, etc.).
English-language experience was defined into two ways. First, examinees were categorized as ESL or native English speakers on the basis of their self-reported language status on their application for examination. While ESL status is commonly used in research, it does have its limitations; for example, an ESL examinee may be completely fluent in English. Twenty-seven percent of the 25,012 examinees self-reported as ESL. The second measure of English-language experience was the spoken English proficiency (SEP) score obtained from each examinee’s participation in a separate 6-hr performance test during which they receive language proficiency ratings from 11 different raters. SEP scores are highly reliable, with generalizability coefficients exceeding .90 (Raymond, Swygert, & Kahraman, 2012). The use of spoken English as a proxy for general fluency is supported by research showing that ESL skills for writing, reading, speaking, and listening form a single factor with spoken English loading .78 on that factor (Sawaki, Stricker, & Oranje, 2009). The Pearson and biserial correlations between ESL status and SEP scores were .58 and .78, respectively.
Analyses
Data were entered into a general linear model that included five levels of word count as a within-subjects factor, ESL status as a between-subjects factor, and SEP scores as a covariate. The dependent variable was mean item difficulty expressed as a percentage of correct score. Of particular interest are the interaction effects for word count by ESL status and word count by SEP scores; the presence of such interactions would indicate that the impact of word count depends on language experience. Two additional analyses were conducted. The first determined if the number of test items left incomplete varied by word count and language experience as suggested by previous studies (Morrison et al., 2008). In addition, all items from a sample of four test-forms were subjected to differential item functioning (DIF) analyses using ESL status as the break variable. This included a total of 1,120 items; due to item overlap across forms, the number of unique items was 828. The purpose of DIF is to determine whether individual test items exhibit bias toward, or behave differently for, members of a particular group while controlling for item difficulty and examinee proficiency. The expectation for an unbiased test item is that examinees at the same level of proficiency have the same probability of answering that particular test item correctly regardless of their language experience. The Mantel–Haenszel procedure was used, which is one of the more common DIF methods (Holland & Wainer, 1993; Zumbo, 2007).
Results and Discussion
The main effects for word count, ESL status, and SEP scores were statistically significant (p < .001), as were the two interaction effects of interest (i.e., word count by ESL status and word count by SEP scores). However, none of the effects was large, and the levels of statistical significance can be attributed to the large sample sizes. Word count accounted for about 4% of the variance, SEP scores 3% of the variance, and all other effects together accounted for less than 1% of the variance. The main effect for word count indicates that longer items are generally more difficult for all examinees. The main effect for ESL status and SEP scores indicates that language experience is related to test performance across all items regardless of word count. Both of these effects have been documented elsewhere (e.g., Case et al., 1996; Morrison et al., 2008). The small but significant interactions (η2 = .0004, .0003) indicate that the effect of word count on test scores depends to some extent on ESL status and SEP scores.
Given that SEP exhibited the larger effect of the two language experience variables, those results were graphed to assist with interpretation. Figure 1 illustrates the effects after categorizing examinees into seven groups on the basis of SEP scores; each score group spanned a one half point interval (e.g., 6.5–7.0). The word count by SEP interaction effect is demonstrated by the slightly steeper declines for the lowest SEP examinees at the higher levels of word count. The figure indicates that even small effects can translate into notable score differences. For example, scores for the most proficient SEP group declined from .791 to .758 from the shortest to longest items, for a change of .033. Meanwhile, scores for the least proficient SEP group dropped from .728 to .678, for a change of .050. The larger decline for less proficient English speakers (.050 vs. .033) has not been demonstrated in previous studies and could be an important consideration for examinees near the pass-fail cut score.

Percentage of correct test score as a function of word count and spoken English proficiency.
While it is possible that the sheer number of words interferes with comprehension for ESLs, it also is possible that lower scores on longer items reflect a test-taking strategy whereby some examinees essentially skip the wordiest items. We tabulated the number of examinees who spent less than 20 s on an item and counted the responses as incompletes. Results indicated that examinees with less language experience were more likely to leave items incomplete, but only in the two highest word count groups. As one example, at the highest level of word count, the percentage of incompletes was 1.2% for ESLs and 0.6% for native English speakers (F = 439.7; p < .001). Finally, the DIF analyses indicated that 53 (6.4%) of the items exhibited DIF, with 32 items favoring native English speakers and 21 items favoring ESLs. Although there was a tendency for DIF items favoring native English speakers to be longer, this trend was not statistically significant (χ2 = 3.98, p = .41). For example, 53% of the DIF items favoring native English speakers were at the two highest levels of word count, while only 38% of the DIF items favoring ESLs were at the two highest levels (the expected percentages are 40%). This trend, while nonsignificant, is consistent with the general linear model results and the analysis of incomplete responses.
In conclusion, these analyses confirmed previously reported findings that longer items tend to be more difficult than shorter ones and that English-language experience is positively related to test scores. The unique contribution of the present investigation is in detecting differential impact: Less fluent examinees experienced a slightly sharper decline in performance on the longest test items, suggesting the presence of language-related construct-irrelevant variance. On one hand, the results are not too alarming because differential impact was very small in terms of η2. On the other hand, the larger score declines on the percentage of correct metric suggest that the impact on less fluent English speakers should not be completely dismissed and that continued monitoring is warranted. While the DIF analyses flagged more items that favored native English speakers, there was not strong evidence that word count was solely responsible for the DIF. Additional work incorporating response time, response accuracy, and additional linguistic features into a single analysis may help identify factors that affect test performance for ESL examinees. In the meantime, vignettes should be carefully edited to remove all nonessential information, and examinees should be given sufficient time to respond to them. A strategy currently being investigated at the National Board of Medical Examiners is to present the clinical information in tabular rather than narrative format, so that it resembles a patient chart or an electronic health record. While preliminary research shows a decrease in word count and response time for items presented in chart format, additional research is needed.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported solely by the NBME; the authors received no additional financial support for the research. authorship, or publication of this article.
