Abstract
This paper presents an approach to standard setting that combines the prototype group method (PGM; Eckes, 2012) with a receiver operating characteristic (ROC) analysis. The combined PGM–ROC approach is applied to setting cut scores on a placement test of English as a foreign language (EFL). To implement the PGM, experts first named learners whom they considered to be typical of each of five levels of language proficiency as specified by the Common European Framework of Reference for Languages (CEFR; Council of Europe, 2001). Out of a total of 3,310 examinees taking different trial versions of the placement test, 470 learner prototypes were identified. For this set of prototypes, Rasch model estimates of EFL proficiency served as input to a series of ROC analyses, one for each pair of adjacent proficiency levels. Cut scores were derived using the Youden index that maximizes the overall rate of correct classification and minimizes the overall rate of misclassification. Findings confirmed that this method allows for the setting of cut scores that show a high level of classification accuracy in terms of the correspondence with expert categorizations of examinee prototypes. In addition, the ROC-based cut scores were associated with higher classification accuracy than cut scores derived from a logistic regression analysis of the same data. Potential further uses and implications of the PGM–ROC approach in the context of language testing and assessment are discussed.
Cut scores are used to classify examinees into two or more categories based on their performance on a test or an assessment. These categories typically represent distinct levels of knowledge, skills, or abilities in a specified domain. The process of determining cut scores is referred to as standard setting (Cizek & Bunch, 2007; Hambleton, Pitoniak, & Coppella, 2012). Depending on the context, cut scores can have important consequences for examinees, institutions, and the society at large (Cizek, 2012). Within the context of placement tests typically employed in language programs, cut scores help ensure that examinees are matched to the most suitable course. Yet, of course, making use of cut scores can also result in misclassification, as when examinees are placed into courses that are too challenging or too easy (Fulcher, 1997; Green, 2012; Shin & Lidster, 2016).
As pointed out in the Standards for Educational and Psychological Testing (American Educational Research Association, American Psychological Association, & National Council on Measurement in Education, 2014), “cut scores embody value judgments as well as technical and empirical considerations” (p. 101). In fact, no matter which method of standard setting is used in an empirical study, the final cut scores to a large extent rest on judgments provided by subject matter experts (also called judges or panelists). Therefore, the Standards demand that “the process must be such that well-qualified participants can apply their knowledge and experience to reach meaningful and relevant judgments that accurately reflect their understandings and intentions” (p. 101).
A key difference between methods of standard setting refers to the way they attempt to meet the general requirements of meaningfulness, relevance, and accuracy of the expert judgments. To specifically take into account the high cognitive demands usually placed on experts when dealing with the judgmental task, the present study makes use of the prototype group method (PGM; Eckes, 2010a, 2010b, 2012). The PGM borrows heavily from research and theorizing on human categorization and decision making. Building on a large-scale assessment context where an online placement test of English as a foreign language (EFL) serves to classify examinees into one of five performance categories, this application of the PGM adopts a receiver operating characteristic (ROC) approach to deriving cut scores. Additionally, findings from the ROC analysis are compared to results obtained by a logistic regression procedure more commonly used in standard setting (Kingston & Tiemann, 2012; Morgan & Michaelides, 2005).
The critical role of judgments in standard setting
It is generally agreed that judgments play an essential role in standard setting (Cizek, 2012; Hambleton & Pitoniak, 2006; Kaftandjieva, 2010). Reflecting this role, Jaeger (1989) proposed a two-category scheme for classifying standard-setting methods that provides a useful basis for the present discussion. He distinguished between test-centered and examinee-centered methods. Within a test-centered approach, experts are asked to provide judgments of individual test items, assessment materials, or tasks. When using an examinee-centered approach, experts are asked to provide judgments of individual examinees (for a more detailed and comprehensive classification scheme, see Hambleton, Jaeger, Plake, & Mills, 2000; Hambleton et al., 2012).
One of the most widely used test-centered methods is the Angoff method (Angoff, 1971; Plake & Cizek, 2012). According to this method, judges are instructed to imagine a hypothetical borderline examinee, or minimally competent examinee, for example, an examinee on the borderline between two adjacent performance levels, and to indicate, for each item in a test, the probability (between 0 and 1) that the examinee will answer the item correctly. Other methods belonging to this category similarly refer to the notion of borderline examinees and require probability judgments of some kind in order to derive cut scores (e.g., the bookmark method; Lewis, Mitzel, Mercado, & Schulz, 2012).
Examinee-centered methods adopt a fundamentally different approach: they involve direct judgments of real examinees well known to judges (e.g., teachers judging individual students in their language course). For example, in the borderline group method (Livingston & Zieky, 1982), judges are asked to sort examinees into three groups, one group of examinees they believe to be clearly at (or above) the performance level in question (masters), one group of examinees they believe to fall clearly below that level (non-masters), and one group of examinees they believe to fall in between (borderline examinees). Then, the test score distribution for the group of borderline examinees is formed, and the median of this distribution is computed to yield the desired cut score.
A large body of research on human judgment, decision making, and categorization suggests that the aforementioned and related methods may suffer from a number of shortcomings (for research reviews, see Gilovich, Griffin, & Kahneman, 2002; Newell, Lagnado, & Shanks, 2007; Taylor, 2009). First and foremost, subjective probability judgments, and judgments of percentages or frequencies alike, are prone to various kinds of heuristics, errors, and biases; these judgmental tendencies pose severe threats to the informative value of the judgments, and thus, diminish the reliability and validity of the resulting cut scores (Bar-Hillel, 2001; Baron, 2014; Kahneman, 2011). Second, the abstract notion of a borderline examinee answering an item correctly with a given probability (e.g., .50, .67, or .80) is difficult to convey to judges in a clear and consistent manner; this is likely to exacerbate the differences between judges in the mental images of borderline examinees brought to bear upon the judgmental task (Skaggs & Hein, 2011; Skorupski, 2012). Third, compared to category members that are highly typical of their category, those located near the category boundary (as are, by definition, borderline examinees) tend to be rated or evaluated with lower within-judge consistency and lower between-judge agreement, are less easily recognized, and are more slowly and less accurately accessed and retrieved from memory (Hampton, 2006; Murphy, 2004). Fourth, the amount of unwanted variability of judgments is increased further by between-judge differences in the severity or leniency exhibited when applying performance standards or criteria (Engelhard, 2009, 2011; Longford, 1996; Van Nijlen & Janssen, 2008). Finally, when aligning tests to a framework that specifies sets of performance categories or proficiency levels, such as the Common European Framework of Reference for Languages (CEFR; Council of Europe, 2001), judges tend to use widely different judgmental strategies and are influenced by factors irrelevant to the judgment task (Harsch & Hartig, 2015; Papageorgiou, 2010; see also Deunk, van Kuijk, & Bosker, 2014).
The prototype group method (PGM) of standard setting has been developed to overcome these difficulties (Eckes, 2010a, 2012). Specifically, the PGM rests on an examinee-centered approach, in which experts provide judgments of best or prototypic examples of two or more performance categories. These prototypic category members form the basis of determining cut scores along the latent proficiency continuum. The PGM does not require experts to provide probability judgments, nor does it invoke the notoriously vague concept of borderline examinees. Rather, the PGM reconsiders standard setting as a natural categorization task, thus doing without hypothetical judgments involving high levels of ambiguity and uncertainty. Note that the kind of judgmental task presented to experts in a PGM study bears some resemblance to holistic methods of standard setting, in particular the body of work method (Kingston, Kahl, Sweeney, & Bay, 2001; Kingston & Tiemann, 2012), in which complete student (examinee) work samples or performances are assigned to a number of performance categories.
When experts are asked to identify examinees who are typical of a given proficiency level, their basic task is to provide similarity judgments, that is, judgments of the similarity between real, personally known EFL learners and the category prototype defined as the most representative or “ideal” category member. Similarity judgments have been shown to involve fundamental, readily available, and often automatic processes that demand only minimal cognitive effort and yet help to make classification decisions more quickly and possibly more accurately than more complex or deliberate decision-making strategies (Gigerenzer & Gaissmaier, 2011; Read & Grushka-Cockayne, 2011).
In the present study, categorical judgments were provided by language teachers or course instructors identifying EFL learners they considered typical of performance categories defined by CEFR proficiency levels A1 through C1 (Council of Europe, 2001). The CEFR proficiency descriptions specified the kinds of communicative skills reasonably expected of an EFL learner at a particular level. That is, the CEFR level descriptions were intended to function in a similar way as performance level descriptors (PLDs) that are used extensively in other standard-setting contexts (Egan, Schneider, & Ferrara, 2012; Papageorgiou & Tannenbaum, 2016; Tannenbaum & Cho, 2014).
The EFL learners took trial versions of a newly developed online test, the onSET-English, mainly serving the purposes of placement and screening with a focus on institutions of higher education in Germany (onSET = Online-Spracheinstufungstest; http://www.onset.de). The onSET-English builds on the C-test principle (Grotjahn, Klein-Braley, & Raatz, 2002; Klein-Braley, 1997). As a rule, C-tests consist of four to eight short authentic texts in which parts of words are missing; examinees have to insert the missing parts, that is, to restore the original words in each text (for more details on the C-test format used here, see the “Method” section below).
C-tests are easy to develop and to score, yield highly precise measures of general language proficiency, and correlate substantially with receptive and productive language skills measured in separate sections (Eckes, 2014; Read, 2015; for a discussion of contrasting views on what C-tests measure, see Eckes & Grotjahn, 2006). Recently, Harsch and Hartig (2016) have shown that C-tests outperformed Yes/No vocabulary tests as predictors of receptive language skills. The authors concluded that “the C-test, being a reliable, economical, and robust measure, appears to be an ideal candidate for placement and screening purposes” (Harsch & Hartig, 2016, p. 555).
Receiver operating characteristic (ROC) analysis
Rationale
Following data collection within a PGM standard-setting context, for each examinee prototype two independent data points are available: (a) category membership information (e.g., A2 or B1) based on human judgment; and (b) a Rasch estimate of examinee proficiency (in logits) based on test performance. In earlier research (Eckes, 2010a, 2012), binary logistic regression (BLR) was used to determine cut scores between adjacent proficiency levels; that is, category membership of examinee prototypes was predicted on the basis of prototype proficiency measures, and cut scores were defined as those measures that best discriminated between two adjacent categories.
A drawback, however, of the BLR procedure is that its results depend on the relative proportion of examinees in the performance categories. BLR generally utilizes parametric estimation methods, which implies that the regression coefficients, and thus the cut scores, may be unduly affected by changes in relative category proportion or category prevalence. For example, if the proportion of B1 learners in one data set were twice as high as in another data set, the cut scores estimated for each of these data sets will be different. More precisely, if the prevalence of a given category changes from one data set to the next, with all other variables held constant, the regression analysis will not suggest the same cut scores.
The ROC approach adopted in this study ensured that the basic statistics used for determining cut scores remained unaffected by changes in relative category proportions or prevalence (Fawcett, 2006; Zhou, Obuchowski, & McClish, 2011). As a practically relevant result, cut scores estimated from the present sample of examinees were applicable to other samples with different prevalence rates. 1
Originating from signal detection theory in technical sciences, ROC methodology has been used increasingly over the last two or three decades to examine the accuracy and utility of diagnostic decisions (Macmillan & Creelman, 2005; Swets, Dawes, & Monahan, 2000). More specifically, ROC techniques have proved to be a valuable tool for visualizing, analyzing, and evaluating the performance of diagnostic systems in terms of the trade-off between hit rates and false alarm rates, in particular in the wide-ranging field of medical diagnosis and decision making (Zhou et al., 2011). 2 Within the context of standard setting, Kaftandjieva (2010) was among the first to recognize the great potential of adopting a ROC approach (see also Hintze & Silberglitt, 2005; Sumbling, Viladrich, Doval, & Riera, 2014; Tavakol & Dennick, 2012).
Sensitivity, specificity, and the ROC curve
Two basic indicators of a test’s classification accuracy, also called its discriminatory power, are sensitivity and specificity. These indicators describe the rates of persons belonging to one of two categories (e.g., diseased vs. healthy) that are classified correctly based on their scores on the test. Sensitivity is the true positive rate that gives the fraction of persons belonging to category A (e.g., diseased, positive) that are classified as A. Specificity is the true negative rate that gives the fraction of persons belonging to category B (e.g., healthy, negative) that are classified as B. As mentioned before, sensitivity and specificity are not affected by changes in the proportions of positive to negative cases in a sample; for example, they are not affected by changes in the prevalence of diseased persons (Fawcett, 2006; Zhou et al., 2011). To highlight this statistical property, Zhou et al. characterized sensitivity and specificity as measures of a test’s intrinsic accuracy.
Within the present context, the categories or classes to which persons are assigned refer to the language proficiency levels ranging from A2 (low proficiency) to C1 (high proficiency). Table 1 introduces the basic terminology building on the fourfold classification for proficiency levels A2 and B1. Because this study aimed to set cut scores on a placement test, the classes that were provided by expert judgment are called “criterion classes” (learner prototype A2 vs. learner prototype B1) and those formed by applying a cut score to the test score distribution are called “predictor” or “test-based classes.” When an examinee’s test score, or proficiency estimate based on a Rasch analysis of item responses, is smaller than cut score c (also called “threshold value”), the examinee is classified as an A2 learner; that is, the examinee has not yet reached level B1. Otherwise, the examinee is classified as a B1 learner.
Fourfold classification table for learner prototypes at levels A2 vs. B1.
Note: A minus sign (in parenthesis) indicates “has not reached level B1.” A plus sign (in parenthesis) indicates “has reached level B1.” TN = true negative. FP = false positive. FN = false negative. TP = true positive.
As shown in Table 1, given a particular examinee and a proficiency estimate, there are four possible outcomes. If the examinee is (judged to be) a learner prototype at level A2, that is, if he or she has not yet reached the higher level B1, and is classified as an A2 prototype based on the estimate, this examinee is counted as a true negative (TN); if the examinee is classified as a B1 prototype, he or she is counted as a false positive (FP). On the other hand, if the examinee is (judged to be) at level B1 and is classified as an A2 prototype, the examinee is counted as a false negative (FN); if the examinee is classified as a B1 prototype, he or she is counted as a true positive (TP).
With the use of the terminology introduced above, the sensitivity q(c) associated with a given cut score c is given by
That is, q(c) gives the true positive rate (TPR, also called hit rate) when applying cut score c. Similarly, the specificity p(c) associated with cut score c is given by
That is, p(c) gives the true negative rate, and (1 – p(c)) gives the false positive rate (FPR, also called false alarm rate) when applying cut score c.
Equations 1 and 2 demonstrate that there are two different probabilities associated with any cut score c: (a) the probability of observing a true positive case (sensitivity); and (b) the probability of observing a true negative case (specificity). The plot of a test’s sensitivity (on the vertical axis) versus its false positive rate (FPR), or (1 – specificity), on the horizontal axis for an entire range of cut scores is called a ROC curve. Thus, a ROC curve depicts the trade-off between benefits (true positives) and costs (false positives) across all possible cut scores.
ROC curves have a number of attractive properties (Zhou et al., 2011): They (a) visually represent the measures of a test’s classification accuracy, (b) include all possible cut scores, (c) do not depend on category prevalence, (d) do not depend on the test’s measurement scale, and (e) provide direct visual comparisons of two or more tests.
ROC curve summary statistics
To evaluate the discriminatory power of a test, the information provided by a ROC curve is commonly summarized by a statistic called the area under the ROC curve (AUC). This statistic ranges from .50 to 1.0. In the ideal case, where q(c) = p(c) = 1, that is, when the test differentiates perfectly between the two classes under study, the AUC statistic assumes its maximum value (i.e., AUC = 1.0). Conversely, when the test has no discriminatory power at all, performing at a level no better than chance, AUC = .50; in the two-dimensional ROC graph, this corresponds to the case where the ROC curve coincides with the diagonal (called the chance diagonal) that runs from lower left to upper right. Hence, the closer the AUC value is to its maximum, the higher the discriminatory power of the test.
As a general guideline for interpreting AUC values, Hosmer, Lemeshow, and Sturdivant (2013, p. 177) made the following suggestions: AUC = .50, no discrimination (the same as flipping a coin); .50 < AUC < .70, poor discrimination; .70 ⩽ AUC < .80, acceptable discrimination; .80 ⩽ AUC < .90, excellent discrimination; AUC ⩾ .90 = outstanding discrimination.
Another frequently used summary statistic is the Youden index (Youden, 1950). This index is a simple function of q(c) and p(c); it is defined as follows (Fluss et al., 2005; Perkins & Schisterman, 2006; Zhou et al., 2011):
over all cut scores c, –∞ < c < +∞.
The J index ranges from 0 to 1, where J = 1 indicates complete separation of the test score distributions for the two categories under study, and J = 0 indicates complete overlap. Different from the AUC statistic, the Youden index also provides a criterion for choosing the “optimal” cut score, that is, the cut score for which q(c) + p(c) – 1 is maximized (Fluss et al., 2005). This cut score is denoted by cJ. In terms of the ROC graph, J is the maximum vertical distance between the ROC curve and the chance diagonal.
An alternative definition of a cut score is provided by the closest-topleft index (“closest-to-(0,1) criterion;” Perkins & Schisterman, 2006). This index has an intuitive geometric interpretation: the “optimal” cut score corresponds to the point on the ROC curve closest to the point (0,1), that is, closest to the point representing the ideal classification performance where both sensitivity and specificity are maximal. Formally, the closest-topleft index D01 can be defined as the minimum Euclidean distance of points on the ROC curve from point (0,1). It is given by (Perkins & Schisterman, 2006; Zhou et al., 2011):
over all cut scores c, –∞ < c < +∞.
Lower values of D01 indicate higher classification accuracy; when accuracy is highest, D01 assumes its minimum value (i.e., D01 = 0). The cut score that corresponds to D01 is denoted by c*.
The Youden index and the closest-topleft index may disagree on what cut score is “optimal”. In case of disagreement, that is, when cJ ≠ c*, Perkins and Schisterman (2006) have shown that cJ is preferable because it maximizes the overall rate of correct classification and minimizes the overall rate of misclassification. In the ROC analyses reported below both indices were computed to allow comparisons between their corresponding cut score estimates.
A simple summary statistic often reported in classification studies is the proportion of true positive (TP) and true negative (TN) cases in the entire sample. This statistic is usually referred to as a test’s overall classification accuracy (or predictive accuracy). Zhou et al. (2011) pointed out that it is more precisely called the probability of a correct test result, hereafter abbreviated to PCTR. The statistic PCTR(c) associated with a particular cut score c is given by:
where n is the total number of examinees in the sample. It should be noted that PCTR does not only depend on the sensitivity and specificity of the test, but also on the prevalence of the person category in question; hence, this statistic is not a measure of the test’s intrinsic accuracy (Zhou et al., 2011).
Research questions
The basic question motivating this research was as follows: To what extent is the prototype group method (PGM) suited for setting cut scores on an EFL placement test when combined with a receiver operating characteristic (ROC) analysis? As advanced here, the PGM–ROC approach to standard setting builds on an examinee-centered method characterized by three sources of input: (a) expert judgments of examinees most typical of a given performance category or proficiency level, (b) Rasch measures of examinee proficiency, and (c) ROC statistics to determine cut scores that maximize the overall rate of classification accuracy. More specifically, the research questions were as follows:
Do experts provide judgments of examinee prototypes that are sufficiently differentiating with respect to the intended CEFR proficiency levels A1 to C1 (Council of Europe, 2001)?
Is the measurement precision of the EFL placement test (onSET-English) high enough to separate examinees in a sufficiently great number of statistically distinct classes?
What is the overall classification accuracy of the onSET-English as indicated by ROC summary statistics? How accurately do the resulting cut scores differentiate between adjacent CEFR proficiency levels?
What cut scores does the Youden index computed from the ROC analysis suggest?
How does the ROC analysis compare to an analysis of the same data building on logistic regression?
Method
Overview
This research was part of an ongoing process of developing and extending a calibrated item bank for use with the online placement test onSET-English. In the present study, 20 independent samples of examinees provided responses to C-test trial versions comprising different sets of ten or six texts each. Item difficulty and examinee proficiency measures were estimated across all trial sets. Additionally, experts provided judgments of examinees that were typical of each of five proficiency levels defined by the CEFR scale (i.e., A1, A2, B1, B2, and C1). The proficiency estimates of the examinee prototypes were used to predict membership in adjacent levels of language proficiency building on a ROC approach.
Examinees
A total of 3,310 examinees participated in the trial sessions. Of these, 2,012 (60.8%) were female and 1,271 (38.4%) male; 27 (0.8%) participants did not indicate their gender. The age of 75.2% of the total sample of participants ranged from 18 to 28 years, 7.5% of the participants were younger than 18 years, 17.3% were older than 28 years (M = 24.57, SD = 9.67). All examinees participated on a voluntary basis.
At the time of testing, the majority of participants were attending English language courses as part of a preparatory study program in Germany or planning to take English-speaking courses at German universities while still in their home country. Participants came from 82 different countries around the world. In terms of the number of participants, the following 10 national groups ranked highest (percentage in parentheses): Germany (36.2%), Russia (8.7%), Ukraine (6.7%), Armenia (6.2%), People’s Republic of China (5.8%), Italy (4.0%), Denmark (3.5%), Vietnam (2.4%), Brazil (2.3), Bulgaria (1.7%).
Following data analysis, each participant received feedback on his or her performance on the C-test trial version. To provide the feedback in a timely manner, the information consisted of the overall test score earned in a given trial set and the percentile rank achieved in the sample to which the participant belonged (across-sample statistics were available only after the data from all samples had been processed).
Test material
In the first 12 samples, Samples S01 through S12, participants worked on 10 texts with 20 gaps each. All texts were constructed according to the classic deletion rule (the “rule of two”); that is, words were mutilated by deleting the second half of every second word, beginning with the second word of the second sentence. If a word had an odd number of letters, the larger part was deleted (Grotjahn et al., 2002). Throughout the texts, the missing part of each word was indicated by a single underline of constant length. In Samples S13 through S20, participants worked on shorter test versions consisting of six texts with 20 gaps each. Shorter test versions were used in later samples to facilitate the administration of a given trial set during regular language course sessions and, thus, to raise the institutions’ willingness to participate in the trialing process.
Across all trial sets, two texts were the same. These common texts served to provide the link between the different sets, following a common-item non-equivalent groups design (e.g., Kolen & Brennan, 2014). In each set containing 10 texts, the common texts appeared at the third and eighth position, respectively; in each shorter test version, the common texts appeared at the second and fifth position, respectively.
Along with the test booklets, test administrators (i.e., language teachers with an excellent command of the English language) received a brief questionnaire to provide the information required for implementing the PGM. In this questionnaire, administrators listed the names of maximally three learners they knew very well and considered to be the best examples of each of the relevant CEFR proficiency levels A1 through C1 (if present in the course), taking into account both reception (listening, reading) and production (writing, speaking). 3 They were to base their judgments strictly on course observations of learners’ language behavior across a period of several months.
For ease of review, Table 1 of the original CEFR publication (Council of Europe, 2001, p. 24) was reprinted on a separate page of the booklet. In this table, each reference level was described briefly with respect to characteristic language skills using can-do statements. For example, at level A2 (Waystage), the scale stated that learners “can understand sentences and frequently used expressions related to areas of most immediate relevance (e.g. very basic personal and family information, shopping, local geography, employment);” at level B2 (Vantage), learners “can understand the main ideas of complex text on both concrete and abstract topics, including technical discussions in his/her field of specialisation.” Level A1 was included because it was needed to determine the lower cut score for level A2; level C2 was not included, because it fell outside the range of levels for which cut scores were to be determined.
Across samples, administrators identified a total of 470 examinees as best examples of levels A1 through C1; these examinees formed the set of learner prototypes used for determining cut scores.
Procedure and scoring
Trial sets were administered at language centers of German institutions of higher education, at private language schools, at adult education centers (“Volkshochschulen”), or at so-called Lektorships (“Lektorate”) of the German Academic Exchange Service (DAAD). Administrators were TestDaF examination officers, DAAD-Lektors, or language teachers highly experienced in EFL teaching at a range of proficiency levels. They had been trained in the use of the CEFR scale at TestDaF workshops and/or had become familiar with this scale in their professional work.
The test booklet contained instructions on the first page and the set of gap-filling texts on the following pages, with each text printed on a separate page. The instruction read: “Complete the gaps in the following texts in a meaningful way. You have five minutes for each text.” The time allowed was also printed above each text. Administrators strictly controlled adherence to this time limit. Texts had to be worked on in the order and within the time limits given; paging up and down the booklet was not allowed. Completed test booklets and administrators’ learner prototype listings were sent back to the TestDaF Institute for scoring and data analysis.
Each correctly restored word, or each acceptable variant (e.g., use of a plural form instead of the singular, “those” instead of “these,” American or British English spelling), was scored one point. Each incorrectly restored word, including spelling errors, was scored zero points. Thus, the total score computed across all texts within a given set containing 10 texts (Samples S01 to S12) could range from 0 to 200 points; the total score computed across all texts within a given set containing 6 texts (Samples S13 to S20) could range from 0 to 120 points.
Data analysis
Recent research addressed the suitability of psychometric approaches for the analysis and evaluation of C-tests with a focus on the issue of local dependence within texts (Eckes, 2006, 2011; Eckes & Baghaei, 2015; Schroeders, Robitzsch, & Schipolowski, 2014). Local dependence violates a critical psychometric assumption, stating that an examinee’s response to an item (gap) does not affect the probability of the examinee’s response to another item (gap) within the same text. Violations of this assumption can have adverse consequences for reliability and parameter estimates. Building on a range of different approaches including polytomous Rasch models (rating scale, partial credit, and continuous rating scale models), testlet response models, and copula models, 4 research findings confirmed that (a) the degree of local dependence between gaps was generally small, and (b) the correspondence between the models’ estimates of examinee proficiency and item difficulty, as well as between the models’ estimates of test reliability (Cronbach’s alpha or similar indices), was high. Therefore, in keeping with previous analyses of C-tests for item calibration and banking purposes (Eckes, 2007, 2011), a polytomous Rasch modeling approach was adopted, building on Andrich’s (1978) rating scale model (RSM).
The Rasch analysis was run in two stages. In the first stage, the data within each sample of examinees were analyzed separately based on the RSM. As judged by statistical indicators of model fit, texts that did not function properly were excluded from further consideration. In the second stage, all remaining texts were put on the same difficulty scale using a concurrent estimation procedure. The concurrent Rasch analysis estimated examinee and item parameters using the data from the aggregated sample of 3,310 examinees. Aggregation was accomplished by linking the 20 independent samples of examinees based on the two anchor texts that were common to all trial sets. Throughout the Rasch analyses, the computer program WINSTEPS (Version 3.91; Linacre, 2015) was used.
As discussed earlier, cut scores between adjacent proficiency levels were determined by using the ROC procedure. In the present analysis, the AUC index, together with an estimate of the 95% confidence interval, was used to judge the overall quality of the examinee classification (e.g., A2 vs. B1). To determine the optimal cut score for a particular level comparison, two statistical indices were considered: (a) the Youden index, and (b) the closest-topleft index. ROC analyses were performed using the R package pROC (Version 1.8; Robin et al., 2015). A brief introduction to this and related R packages can be found in Robin et al. (2011).
For the purposes of comparison, cut scores were determined additionally by using binary logistic regression (BLR; Hosmer et al., 2013). In this procedure, the dependent variable was the category membership of the examinee prototypes (e.g., A2 vs. B1), and the independent variable was the Rasch-based prototype proficiency measure. The BLR analyses were performed using the IBM SPSS Statistics program (Version 20). For more details on the BLR approach to setting cut scores within the context of the PGM, see Eckes (2010b, 2012).
Results
Descriptive sample statistics
Table 2 displays descriptive and Rasch statistics for each of the 20 trial samples.
Descriptive and Rasch statistics for 20 independent trial samples of the EFL placement test.
Note: I = number of texts (with 20 gaps each). Mo = mean observed (raw) score. The maximum score possible in Samples S01 through S12 was 200; in Samples S13 through S20 the maximum score possible was 120. Rasch statistics Ml through R refer to person measures. Ml = mean logit. SE = mean standard error. H = number of person strata. R = test reliability of person separation.
Across samples, the number of participants ranged from 131 to 207, the average sample size was about 165. Building on the two anchor texts, these samples were interconnected to provide the input to the concurrent Rasch analysis over all 3,310 examinees.
Separate Rasch analyses
WINSTEPS produced summary Rasch statistics as proposed by Wright and Masters (1982): The person separation index, from which the number of person strata index H was computed, and the test reliability of person separation R. In the present Rasch analyses, H indicated the number of statistically distinct classes of examinee proficiency as measured by the test in a given sample of examinees (Fisher, 1992; Wright & Masters, 2002). Note that H generally reflects the degree of a test’s measurement precision; that is, the higher the precision, the larger the number of examinee classes that the test can reliably distinguish. Therefore, these statistically derived classes should not be confused with proficiency levels defined in terms of the CEFR scale. Finally, R is another indicator of a test’s measurement precision that can be interpreted in a way analogous to Cronbach’s alpha (Linacre, 2015).
As shown in the right part of Table 2, the H values ranged from 6.0 to 8.5 in most samples. For example, the trial set used in Sample S07 had such a high measurement precision that eight-and-a-half classes of examinees could reliably be distinguished. The overall high measurement precision was also indicated by the person separation reliability, ranging from .89 to .97. Not surprisingly, owing to the smaller number of texts, both separation statistics assumed somewhat lower values in Samples S13 to S20.
In order to examine the psychometric quality of each individual text within samples, two commonly used mean-square fit statistics (Wright & Masters, 1982) were considered: an unweighted fit statistic (outfit) and a weighted fit statistic (infit). Both outfit and infit have an expected value of 1 and can range from 0 to infinity. Linacre (2015) suggested 0.50 as a lower-control limit and 1.50 as an upper-control limit. Other researchers suggested to use a narrower range defined by a lower-control limit of 0.70 (or 0.75) and an upper-control limit of 1.30 (e.g., Bond & Fox, 2015; Boone, Staver, & Yale, 2014; Wright & Linacre, 1994).
Table 3 presents the frequencies of infit and outfit statistics for three different fit intervals, computed at the level of individual texts within samples.
Frequency of infit and outfit statistics of texts from 20 trial sets using different fit intervals.
Note: Infit and outfit are mean-square fit statistics. Twelve sets contained 10 texts, eight sets contained 6 texts.
There was only one text (out of the total of 168 texts) in which the analysis yielded infit and outfit statistics exceeding the upper-control limit of 1.50. Concerning the narrower (0.70–1.30) interval, the percentage of misfitting texts increased to 6% (infit); the outfit statistic identified a subset of these texts (5%). Finally, when the very strict (0.90–1.10) interval was applied, the percentage of misfitting texts was 19% (infit and outfit statistics identified the same texts).
In order to select texts for inclusion in the concurrent Rasch analysis, two kinds of criteria were employed. The first criterion was based on the results of the fit analysis discussed above. Given that the language test would be associated with low to medium stakes, the 1.30 upper control limit was applied for infit and outfit statistics. This led to the elimination of 10 texts. The second criterion made use of the results from an analysis of differential item functioning (DIF; Linacre, 2015) related to (a) examinee gender and (b) region of origin (European vs. non-European examinees). Five gender-related DIF texts were identified and excluded from further analysis. Three of these texts were significantly more difficult for females than for males, the remaining two texts were significantly more difficult for males than for females. Expert review of item content did not suggest any unintended factor that could be hypothesized to account for the observed group differences in item difficulty. None of the texts showed DIF related to region of origin.
Concurrent Rasch analysis
Figure 1 displays the result of the concurrent analysis in form of an examinee–item map. Note that the concurrent analysis built on the set of 115 different items (texts); this set resulted from (a) using the same two texts across the 20 samples for linking purposes (leaving 168 − 38 = 130 different texts) and (b) excluding 10 texts owing to model misfit, as well as 5 texts owing to gender-related DIF. Through this analysis, all examinees and texts were put on a common measurement scale, the logit scale, shown on the left-hand side. For ease of presentation, the scale was truncated at ± 4.0 logits.

Examinee–item map showing the cut score locations on the measurement scale (horizontal lines).
Immediately to the right of the logit scale, the locations of the examinees are shown. These locations correspond to the estimates of the proficiency measures. Each “#” in the examinee column stands for 10 examinees, and each dot stands for fewer than 10 examinees. The horizontal lines inserted in the examinee column indicate the location of the cut scores. Each logit value defines the boundary between two adjacent performance categories in terms of CEFR levels. Precisely how these boundaries were determined is explained later.
On the right-hand side, the locations of the items, corresponding to the item (or text) difficulty, are shown. Each “X” in the item column stands for two items, and each dot stands for one item. Along the line in the middle, markers summarize the distribution of examinee and item measures, respectively. An “M” marker represents the location of the mean measure, “S” markers are placed one sample standard deviation away from the mean, and “T” markers are placed two sample standard deviations away from the mean.
The Rasch summary statistics H and R confirmed that the present item pool was well suited to differentiate between examinees in terms of general EFL proficiency. Considering the total sample of examinees, the results were H = 6.49 and R = .96. As indicated by the H statistic, about six-and-a-half classes of examinees were reliably distinguished by the set of texts studied here. Thus, the measurement precision of the test was higher than the minimal requirement of H = 5.0 derived from the test’s purpose to place examinees into one of five proficiency levels.
Determining cut scores
Prototype categorizations
Table 4 presents descriptive statistics for the proficiency distributions of examinee prototypes (in logits) at each performance category considered by the experts. The majority of judgments were provided at levels B1 and B2; much fewer were provided at level A2. Not surprisingly, the levels that were represented by relatively small numbers of examinee prototypes were A1 and C1. Overall, 470 examinees (14.2%) were judged to be the best examples of their respective performance category. 5
Descriptive statistics for examinee prototype logit distributions.
Note: n = Number of examinee prototypes per category. Percentage values in the third column refer to the total sample of examinees (N = 3,310). Logit means for categories A1 to C1 differ significantly from one another (p < .05).
The mean logits showed a strictly monotonic increase from level A1 to level C1; mean logit differences between levels stayed within a narrow range of 0.41 to 0.75 logits. The correlation between (subjective) prototype categorizations and (objective) prototype measures was statistically highly significant and substantial, r(470) = .66, p < .001 (Spearman rank order correlation was the same). An analysis of variance on the logit values yielded a highly significant effect of the performance category, F(4, 465) = 92.73, p < .001, η2 = .44. Finally, pair comparisons (Tukey HSD test) revealed that all logit mean differences were statistically significant (all p’s < .05).
ROC analysis
Results are reported separately for two related analyses: first, an analysis of the complete data set, comprising all prototypes and the associated proficiency estimates; second, an analysis of the same data set with extreme or outlying data points (“outliers”) deleted, where outliers were defined as prototypes showing strong deviations between expert categorizations and proficiency estimates. The rationale behind this was to examine the extent to which the results of the ROC analysis and, in particular, the results of the additional logistic regression analysis were affected by severe CEFR category–proficiency estimate differences (for a discussion of dealing with outliers in a regression context, see Myers, Well, & Lorch, 2010). Moreover, it was expected that the analysis of the data with outliers deleted would yield more reliable estimates of the desired cut scores.
Specifically, though the prototype categorizations were in fine overall agreement with proficiency measures, in quite a number of cases individual expert judgments may have been subject to stronger errors or biases such that, for example, a highly proficient examinee was categorized as a typical B2 or even B1 learner instead of a typical C1 learner. Conversely, low-proficient examinees may have been categorized at a higher level than actually warranted. In addition, experts provided judgments of learner prototypes under local conditions that were inevitably less standardized than in typical standard-setting situations where all panelists are brought together at the same location for one or two days. To compensate for the possible impact of these factors, in each original prototype logit distribution of adjacent level categories, the upper and lower 10% of examinee prototypes were removed from the data set. Table 5 presents the results for both ROC analyses: the first one with outliers deleted; the second one with outliers included.
ROC results for determining cut scores on the onSET-English.
Note: n = Number of examinee prototypes in the two categories considered. AUC = Area under the ROC curve. CI = 95% Confidence interval. J = Youden index. cJ = Cut score (logits) corresponding to the Youden index. q(cJ) = Sensitivity (associated with cut score cJ). p(cJ) = Specificity (associated with cut score cJ).
c* = Cut score (logits) corresponding to the closest-topleft index. PCTR(cJ) = Probability of a correct test result (predictive accuracy) based on cut score cJ. aThere was another, equally optimal cut score cJ = –.880, with q(cJ) = .863 and p(cJ) = .550.
When outliers were deleted (upper part of Table 5), the AUC assumed values that were “acceptable” for the first comparison (i.e., A1 vs. A2) and “excellent” for the remaining three comparisons, using the AUC classification system suggested by Hosmer et al. (2013). As judged by the 95% confidence intervals, the AUC values were significantly greater than the .50 chance level for all comparisons.
The cut scores cJ (in logits) that corresponded to the Youden index (J) showed a clear progression across the proficiency levels. For each cut score, the specificity and sensitivity values confirmed that the onSET-English was highly effective at discriminating between learner prototypes belonging to adjacent CEFR categories, maximizing the overall rate of correct classification (e.g., correctly classifying B1 learners as B1) and minimizing the overall rate of misclassification (e.g., falsely classifying A2 learners as B1). For two comparisons (i.e., A1 vs. A2, B1 vs. B2), the cut scores corresponding to the closest-topleft index (c*) were identical to those corresponding to the Youden index. Different cut scores were obtained for the remaining two comparisons; the differences were .335 logits for A2 vs. B1, and .435 logits for B2 vs. C1. Finally, the PCTR values (cut scores based on the Youden index) indicated that the overall classification accuracy was satisfactorily high, reaching its maximum value for comparison B2 vs. C1, where the proportion of correctly classified examinees (i.e., true B2 learners and true C1 learners) was 85.6%.
In the lower part of Table 5, results are shown for the ROC analysis of the complete data set. For two comparisons, that is, A1 vs. A2 and B1 vs. B2, the AUC was “poor,” in the remaining two cases the AUC was “acceptable” (Hosmer et al., 2013). Overall, the discriminatory power was clearly reduced when outliers were included. This also was reflected in lowered values for the Youden index, resulting from lowered specificity and sensitivity. Still, all of the AUC values in this analysis were statistically significant, given the associated 95% CIs. Compared across the two kinds of analyses (with outliers deleted or included), the cut scores corresponding to the Youden index were identical (as were those corresponding to the closest-topleft index), although the sensitivity and specificity values were different. The reason for this was that by deleting outlying prototypes at both sides of the logit distributions, false negatives and false positives were similarly eliminated, increasing sensitivity and specificity but leaving the prevalence rates for each category largely unaffected. For example, referring to A2 vs. B1, the prevalence rate for B1 was 103/183 = .563 when outliers were deleted, and 114/203 = .562 when outliers were included. 6 Also, not surprisingly, the PCTR values were much lower when outliers were included.
Figures 2 to 5 display the ROC curves for each comparison and also show the points that represent the respective cut scores (in logits). In each figure, the chance diagonal is indicated by a straight line running from lower left to upper right. Note that the ROC analysis for the comparison A1 vs. A2 (Figure 2) yielded two numerically slightly different yet equally optimal cut scores; classification decisions rested on the larger of these two cut scores owing to its agreement with the cut score suggested by the closest-topleft index (Table 5).

ROC curve for proficiency levels A1 vs. A2 (AUC = .751). Also shown are two equally optimal cut scores cJ(1) and cJ(2) (with their coordinates; i.e., specificity and sensitivity) corresponding to the Youden index J = .413.

ROC curve for proficiency levels A2 vs. B1 (AUC = .873). Also shown is the cut score cJ (with its coordinates; i.e., specificity and sensitivity) corresponding to the Youden index J = .551.

ROC curve for proficiency levels B1 vs. B2 (AUC = .803). Also shown is the cut score cJ (with its coordinates; i.e., specificity and sensitivity) corresponding to the Youden index J = .463.

ROC curve for proficiency levels B2 vs. C1 (AUC = .829). Also shown is the cut score cJ (with its coordinates; i.e., specificity and sensitivity) corresponding to the Youden index J = .557.
Turning to the results for the logistic regression analysis (Table 6), it is evident that the cut score estimates (second-last column) differed strongly across the analyses, depending on whether outliers were deleted or included. When the analysis was run with outliers included (lower part of Table 6), the cut scores were located farther away from one another; that is, differentiation between examinees near the important middle area of the proficiency distribution was lowered. Though all regression coefficients proved to be statistically significant, goodness-of-fit of the regression model (as indicated by the Nagelkerke-R2 index, a measure of explained variation in logistic regression) was unacceptably low for three comparisons; only when level A2 was compared to level B1, could model fit be considered just about acceptable.
Logistic regression results for determining cut scores on the onSET-English.
Note: n = Number of examinee prototypes in the two categories considered. b0 = Regression constant. b1 = Regression coefficient. SE = Standard error (regression coefficient). Fit = Nagelkerke-R2 index. xc = Cut score (logits). PCTR(xc) = Probability of a correct test result (predictive accuracy) based on cut score xc.
p < .01. **p < .001.
Comparing the cut scores xc from the BLR procedure (with outlier deletion; upper part of Table 6) to the cut scores cJ from the ROC analysis (Table 5), there was a close correspondence: the maximum difference between the cJ and xc values amounted to 0.195 logits (A2 vs. B1), which was far less than the logit difference commonly considered critical in Rasch measurement contexts (i.e., 0.5 logits; Linacre, 2015). However, in terms of the classification accuracy as indicated by the PCTR statistic shown in the rightmost columns of Tables 5 and 6, it is clear that the ROC procedure outperformed the logistic regression analysis. Using the cut scores from the ROC analysis, higher accuracy was achieved in three of the four comparisons, with identical PCTR values resulting for one comparison, no matter if outliers were deleted or included. In light of these findings, it was decided to set the final cut scores by following the Youden index estimated in the ROC analysis. The location of the cut scores cJ is shown graphically in Figure 1.
Summary and discussion
Data input to the prototype group method (PGM) consists of the following: (a) judgments of experts about which examinees are best or prototypic examples of specified performance categories; and (b) responses of the same examinees to items on a test or an assessment instrument for which cut scores are desired. In the present study, the PGM was used to set cut scores on an EFL placement test building on a receiver operating characteristic (ROC) analysis. The research questions (RQs), and the associated answers derived from this study, were as follows.
RQ1: Do experts provide judgments of examinee prototypes that are sufficiently differentiating with respect to the intended CEFR proficiency levels A1 to C1? The analysis of the expert judgment data revealed that the experts clearly differentiated between examinee prototype categories and that these categorizations were sufficiently in line with the measures of proficiency independently estimated on the basis of examinee performance on the test. The correlation between category assignments and Rasch estimates of examinee proficiency was .66, and the mean logits of the examinee prototype distributions increased in a strictly monotonic fashion from the lowest CEFR level (A1) to the highest level (C1). These findings support the conclusion that the experts were able to categorize the majority of learner prototypes according to the CEFR proficiency scale.
RQ2: Is the measurement precision of the EFL placement test (onSET-English) high enough to separate examinees in a sufficiently great number of statistically distinct classes? It was of critical importance for a successful application of the PGM that the test’s measurement precision was high enough to distinguish at least five classes of examinees. Generally speaking, the number of examinee classes that a measurement instrument can reliably distinguish should be at least as high as the number of intended proficiency levels. Since the present EFL placement test was to sort examinees into one of five levels of language proficiency (i.e., below A2, A2, B1, B2, and C1), the H index should take on values of at least 5.0. As indicated by the number of examinee strata index, this requirement was met. For the total sample (N = 3,310), the number of statistically distinct classes of examinees (H = 6.49) was much greater than minimally required. In line with this result, the reliability of examinee separation was highly satisfactory (R = .96). Hence, for the present purposes, the overall measurement precision of the onSET-English could be considered as entirely sufficient.
RQ3: What is the overall classification accuracy of the onSET-English as indicated by ROC summary statistics? This study was the first one to implement the PGM in combination with a ROC approach. Within the present standard-setting context, the ROC methodology was deemed particularly useful because relevant ROC statistics like the area under the ROC curve (AUC), sensitivity, and specificity are known to be unaffected by changes in prevalence rates (Fawcett, 2006; Zhou et al., 2011). To ensure comparability of findings across data sets and statistical methods, both ROC and BLR analyses were performed, once deleting and once including outliers. When outliers were deleted, the values of the AUC statistic provided strong evidence of high classification accuracy. For each comparison between adjacent performance categories, the AUC statistic assumed values that were significantly higher than the .50 chance level. In terms of the evaluative system proposed by Hosmer et al. (2013), the AUC was excellent for three of the four comparisons and acceptable for one comparison (A1 vs. A2). These findings were confirmed by the PCTR statistic, which gives the probability of a correct test result (Zhou et al., 2011). PCTR values ranged from .73 (B1 vs. B2) to .86 (B2 vs. C1).
RQ4: What cut scores does the Youden index computed from the ROC analysis suggest? The Youden index suggested cut scores that remained the same irrespective of whether outliers were deleted or included in the ROC analysis. This finding attested to the robustness of the ROC-based cut scores. The cut scores (in logits) showed a clear progression from −0.86 (A1 vs. A2) to 1.56 (B2 vs. C1), placing 348 examinees (10.5%) into the lowest level (below A2), 450 examinees (13.6%) into A2, 1,010 examinees (30.5%) into B1, 1,178 examinees (35.6%) into B2, and 324 examinees (9.8%) into the highest level (C1).
RQ5: How does the ROC analysis compare to an analysis of the same data building on logistic regression? The cut scores suggested by the logistic regression analysis strongly varied depending on the way outliers were treated (i.e., deleted vs. included). The BLR cut scores showed a moderately high agreement with the ROC cut scores only when outliers were deleted. Moreover, the ROC cut scores were clearly superior to the BLR cut scores in terms of classification accuracy as measured by the PCTR statistic.
In previous work, three issues had been raised concerning the requirements posed by the PGM approach (Eckes, 2012). First, judges have to be subject matter experts qualified to determine examinees’ level of proficiency in the relevant domain. The domain in this study was EFL proficiency, and the basic system for judging examinee proficiency was taken from the CEFR – a system with which judges were highly familiar. Hence, when judges have to consider a group of examinees well known to them from their professional context and to simply identify those examinees that are prototypic examples of commonly used performance categories, meeting this requirement does not seem to be too difficult. Second, two different kinds of data need to be collected, analyzed separately, and finally, related to one another to determine cut scores: prototype judgments and item responses. Note that within each sample, only one round of collecting these data is required, that is both data sets are provided in the same testing session. Therefore, the PGM seems to be more feasible than most other examinee- or test-centered methods, considering particularly the conditions of trialing different test versions around the world over an extended period of time. Third, owing to its parametric estimation procedure the BLR analysis demands relatively large sample sizes. This may pose real challenges for applying the PGM in contexts where such large samples are difficult to obtain (for a similar point, see Harsch & Hartig, 2015).
Whereas the prototype judgment and data collection issues apply to the PGM more generally, the sample size issue is alleviated to a considerable extent in the present proposal. As demonstrated by the measures of the test’s discriminatory power or classification accuracy, the combined PGM–ROC approach to standard setting worked fine with relatively small samples, ranging from 120 to 241 examinee prototypes. In fact, this approach would have worked even with much smaller sample sizes. To illustrate, by means of statistics available in the pROC package (Robin et al., 2015), one can calculate what sample size is actually required for a particular AUC value, with additional specifications needed for Type I (α) and Type II (β) error rates, and the classification ratio (e.g., number of A2 relative to B1 examinees). Using conventional error rates, that is, α = .05, β = .10 (i.e., statistical power 1 – β = .90), and setting the classification ratio to 1 (i.e., about the same number of examinees assumed to belong to both groups), the minimally required total sample sizes (rounded to the next higher integer; AUC values from Table 5, upper part) were as low as 50, 20, 33, and 27 examinees, respectively (for a detailed discussion of sample size calculations, see Zhou et al., 2011). 7
Finally, in light of the highly informative and easy-to-interpret statistics provided, the ROC methodology may be reasonably used instead of, or at least in addition to, a regression analysis even when the sample size is large. A further advantage of the ROC approach not specifically addressed in this research, concerns the possibility of comparing ROC curves estimated for two or more different tests. Presenting a number of test-specific ROC curves side by side in a single graph provides the opportunity to evaluate the classification accuracy of a given test in direct relation to the other tests (Robin et al., 2011; Zhou et al., 2011).
In addition, the assumed costs of different types of classification error may be taken into account using specifically designed weighting schemes. As implemented in the present case, the Youden index gave equal weight to false positives and false negatives. In other cases, reducing the risk of false positives may be more important than reducing the risk of false negatives. An example of such a trade-off between classification errors was provided by Xi (2007). She studied the use of the TOEFL iBT Speaking test as a screening instrument for international teaching assistants. Here, false positives were judged to have a more serious impact than false negatives, and the optimal cut score was chosen to minimize false positives while yielding reasonably high sensitivity. Similarly, Papageorgiou and Cho (2014) argued that false positive classifications should be minimized in an ESL placement context, because these might lead to academic failure; false negative classifications would not appear to be as serious, because they can be corrected more easily. In situations like these, the utility of classification decisions becomes more important than their accuracy (Cizek & Bunch, 2007; Swets et al., 2000).
One option to incorporate utility considerations is to use the generalized Youden index (Robin et al., 2015; Perkins & Schisterman, 2006; Zhou et al., 2011). This index allows weighting sensitivity and specificity differently. A wide range of definitions of optimality are conceivable, each one reflecting the demands of the particular field of application. The R package OptimalCutpoints (López-Ratón, Rodríguez-Álvarez, Cadarso-Suárez, & Gude-Sampedro, 2014) offers no less than 34 criteria that take cost–benefit or utility aspects of different classification decisions into account.
In this research, use of the PGM–ROC approach was illustrated with the example of an EFL placement test building on the C-test format. It goes without saying that the present methodology is not confined to C-tests. The combination of examinee prototype listing, Rasch modeling of examinee proficiency, and determining cut scores by means of ROC techniques, in particular by computing the Youden index, may be applied to a wide range of tests and assessments in the field of language testing and beyond.
Conclusion
In discussing the ROC-curve method within the context of over 60 different approaches to standard setting, Kaftandjieva (2010, p. 80) noted that “a disadvantage of this method . . . is the necessity of using more complex statistical methods and corresponding software in setting the cut scores;” she went on to conclude that “the ROC-curve method is more appropriate as a secondary method – for scientific research and the validation of already set cut scores – than as main method for setting cut scores.” In the present study, the ROC methodology was successfully combined with the prototype group method (PGM), providing strong evidence that the PGM–ROC approach does indeed qualify as a “main method” for standard setting. Moreover, the statistical methods underlying the ROC component of this approach are no more complex than the statistics needed in many prominent standard-setting methods such as the bookmark method, which is heavily rooted in item response theory.
The intuitively plausible notions of sensitivity, specificity, and classification accuracy along with the unique feature of visualizing the discriminatory power of tests make ROC analyses readily accessible and practically useful. Generally available R packages like the pROC package (Robin et al., 2011, 2015) offer many helpful functions, including various graphical output options, sample size calculations, and different optimality criteria for selecting cut scores. Thus, applying ROC methodology within the context of standard setting has much to recommend it.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
