The Response Vector for Mastery Method of Standard Setting

Abstract

Proposed is a new method of standard setting referred to as response vector for mastery (RVM) method. Under the RVM method, the task of panelists that participate in the standard setting process does not involve conceptualization of a borderline examinee and probability judgments as it is the case with the Angoff and bookmark methods. Also, the RVM-based computation of a cut-score is not based on a single item (e.g., marked in an ordered item booklet) but, instead, on a response vector (1/0 scores) on items and their parameters calibrated in item response theory or under the recently developed D-scoring method. Illustrations with hypothetical and real-data scenarios of standard setting are provided and methodological aspects of the RVM method are discussed.

Keywords

standard setting assessment cut-score

Introduction

The standard setting is a complex process of establishing cut-scores on assessment scales to classify examinees into two groups (e.g., mastery/nonmastery) or more than two groups (e.g., below basic, basic, proficient, and advanced). The derivation of cut-scores is based on judgments of content experts (panelists) guided by the methodology of the selected standard setting method. There is a variety of standard-setting approaches, with most popular to date being the Angoff’s (1971) method (e.g., Clauser et al., 2006; Hambleton, 2001; Hambleton & Plake, 1995; Plake & Cizek, 2012), and the bookmark method (Lewis et al., 1999; Mitzel et al., 2001; see also Cizek et al., 2004; Karantonis & Sireci, 2006; Lin, 2006). Other methods of standard setting are, for example, the mapmark method (Schulz & Mitzel, 2005, 2011), item-mapping method (Wang, 2003), body of work method (Cizek & Bunch, 2007; Hambleton & Pitoniak, 2006; Wyse et al., 2014), item-descriptor (ID) method (Ferrara et al., 2008; Ferrara & Lewis, 2012), benchmark method (Phillips, 2012), and other (e.g., Zwick et al., 2001).

The ongoing efforts for improving existing methods of standard setting and developing new ones are motivated, in general, by the need to address problems related to consistency and accuracy of cut-scores produced by such methods and the validity of examinees’ classifications into targeted levels of performance. Furthermore, there is no single (“best”) approach to setting standards for a variety of assessment scenarios and policy guidelines. Without engaging into a comprehensive overview of existing methods of standard setting and related problems, provided next is a brief description and comments on the widely used Angoff’s (1971) and bookmark methods, as well as the relatively new ID matching method, to highlight issues addressed with the proposed response vector for mastery (RVM) method of standard setting.

Angoff Method

Under the Angoff (1971) method and its modifications, content experts (panelists) are required to conceptualize the minimally proficient (borderline) examinee―the examinee whose proficiency level is just high enough to justify a given classification (e.g., see Cizek & Bunch, 2007). Then the panelists are asked to review each test item and estimate the probability with which the borderline examinee would answer correctly the item. The sum of such probabilities over the test items is an estimate of the true score for the borderline examinee, which is then mapped on the test characteristic curve to obtain the cut-score (“theta”) on the item response theory (IRT) logit scale. Main problems with this method are (a) the panelists’ error-prone conceptualizations of a borderline examinee and (b) probability judgments for correct response on each test item by the borderline examinee. As noted by Berk (1986), “judges have the sense that they are pulling the probabilities from thin air” (p. 147) (see Chang, 1999; Ricker, 2006; van der Linden, 1982).

Bookmark Method

Under the bookmark method, the test items are sorted by increasing difficulty in an ordered item booklet (OIB), and the panelists are asked to place a bookmark at the point between the items in the booklet at which the probability of correct response by the borderline examinee drops below a prespecified value referred to as response probability (RP). Most frequently used is RP = .67 (denoted RP67), that is, 67% (or 2/3) chances of correct item response (e.g., Huynh, 1998, 2006), but other RPs (e.g., .50 and .80) have also been used (e.g., Beretvas, 2004; Wang, 2003). The cut-score is the point on the IRT scale that corresponds to the selected RP of a correct response for the item located just before the bookmark; (in some cases, the cut-score is set equal to the midpoint between the bookmarked item and the previous item).

Although the bookmark method is considered a better alternative to the Angoff’s (1971) method and its variants, there are serious doubts about the conceptualization of key concepts and understanding of the bookmark procedure by participating panelists (e.g., Baldwin, 2018; Davis-Becker et al., 2011; Ferrara & Lewis, 2012; Lewis et al., 1999; Lewis et al., 2012; Skaggs & Tessema, 2001; Williams & Schulz, 2005; Zieky, 2001). Main problems with the bookmark method relate to the conceptualization of the borderline examinee, the choice of a RP value, the probability judgment for placing the bookmark, item disordinality, and restricted focus on item difficulty. Provided next are brief details in this regard.

Conceptualizing the Borderline Examinee

As with the Angoff method, a major validity hurdle with the bookmark method is the proper and consistent conceptualization of the borderline examinee by the panelists. There is no persuasive research evidence on the panelists’ ability to create a valid mental model of the borderline examinee at the training stage (or other rounds) of the bookmark procedure. The bookmark approach to this task is challenged by researchers seeking for alternative solutions, such as the ID matching method (Ferrara et al., 2008; Ferrara & Lewis, 2012).

The RP Choice

Research on the bookmark method shows that the choice of RP values (e.g., .67, .50, or .80) systematically affects the resulting cut-score (e.g., Baldwin, 2018; Baldwin et al., 2019; Beretvas, 2004; Hauser et al., 2005; Lewis et al., 2012; Williams & Schulz, 2005; Wyse, 2011). For example, in a study using three RP values (1/2, 2/3, and 4/5), Beretvas (2004) found that the ordering of the bookmark difficulty locations changes depending on the RP used. In a different study, investigating the destabilizing role of the RP choice, Baldwin (2018) noted that “the implications of these findings are alarming—after all, if panelists are unable to adjust their judgments to reflect the choice of RP, what do their judgments actually mean?” (p. 483). Also, based on rigorous analytic derivations in that study, he demonstrated that “the often-repeated claim that the .67 [RP] value corresponds with the maximum information for a correct response, which is believed to be beneficial in some way, is mistaken” (Baldwin, 2018, p. 481).

Item Disordinality

The term item disordinality refers to the disagreement among bookmark panelists on the order of items within the OIB (e.g., Lewis & Green, 1997; Skaggs & Tessema, 2001). Such a disagreement typically occurs when the panelists differ in education curricula and/or judgments on item difficulty. Item disordinality is particularly problematic when it occurs near the cut-scores produced by individual panelists. As noted by Lewis and Green (1997), item disordinality issues arise in virtually all applications of the bookmark method. In another study, Davis-Becker et al. (2011) compared cut-score results of experts using OIBs with results of experts placing bookmarks in test forms where the items were randomly ordered by difﬁculty, and they found similar recommendations on cut-scores under both conditions.

Narrow Focus on Item Difficulty

The bookmark panelists focus on item difficulty to mark an item within the OIB. As noted by Zieky (2001), “this does not allow participants to distinguish purposefully among the items above the bookmark, or among the items below the bookmark on the basis of importance, curricular relevance, or necessity for performance on the job” (p. 35). This issue, not fully addressed in the extant research on the bookmark method, has negative affect on the substantive meaning of the cut-score―that is, the cut-score do not reflect adequately the substantive structure of the dimension(s) measured by the test.

ID Matching Method

The ID matching method (Ferrara et al., 2008; Ferrara & Lewis, 2012; see also Cizek & Bunch, 2007) involves three key elements―OIB, item response demands (IRDs), and performance level descriptors (PLDs). Specifically, (a) the OIB contains all test items sorted from the easiest to the most difficult based on IRT scale location, just like under the bookmark method; (b) the IRDs of an item represent the content knowledge, skills, and cognitive processes required by the item; and (c) the PLDs describe the knowledge and skills that the examinees in a particular performance level are expected to be able to demonstrate (Ferrara et al., 2008; Lewis & Green, 1997; Perie, 2008).

Under the ID matching, the panelists match the IRDs of each item and the PLDs. This results in (a) one sequence of items that most closely match the PLDs in a given performance level, (b) another sequence of items that match the PLDs of the next (higher) performance level, and (c) a “threshold region” with items that do not match clearly either of the PLDs of the two adjacent performance levels. The cut-score is located between the scale values of the first and last items in the threshold region. Typically, the cut-score is obtained (a) by asking the panelists to identify the first item in the sequence of items in the threshold region whose IRDs match more closely the PLDs of the higher performance level and use the scale value of this item as a cut-score or (b) computing the cut-score as the midpoint between the scale locations of the first and last item in the threshold region (e.g., Ferrara & Lewis, 2012).

Unlike the bookmark method, the ID matching does not require panelists to conceptualize a borderline examinee and make probability judgments (e.g., using RP67). As stated by Ferrara et al. (2008),

This simplifies the cognitive complexity of the panelists’ judgmental task, relative to the Bookmark method. In ID Matching, panelists can focus on matching the knowledge and skill requirements of each item to the knowledge and skills articulated in performance level descriptors. (p. 2)

Response Vector for Mastery Method of Standard Setting

Motivation of the RVM Method

Along with the advantages of the ID matching over the bookmark, some problems remain under both methods. Specifically, the location of cut-scores on the IRT scale under both the bookmark and ID matching methods is affected by (a) the RP-based item ordering in the OIB, as described earlier and (b) their dependence on a single test item, whereas the estimation of ability scores in IRT is based on the likelihood of response vectors of binary (1 or 0) scores on all test items.

The proposed RVM method is designed to (a) free the panelists from psychometric conceptualizations and judgments, (b) avoid the use of OIB and related problems, (c) produce cut-scores based on response vectors of item scores (1/0), instead of using a single OIB item, and (d) produce cut-scores on the IRT scale and (if needed) on the D-scale of a recently developed “delta-scoring” (D-scoring) method (DSM; Dimitrov, 2016, 2018, 2020; Dimitrov & Atanasov, 2021). The DSM is used, for example, with large-scale assessments by the National Center for Assessment (NCA) in Saudi Arabia. The proposed RVM method has been already used for deriving cut-scores on the D-scale for teacher licensure tests (Dimitrov & Alsadaawi, 2019) and multiple cognitive ability tests (Dimitrov et al., 2020) in Saudi Arabia.

The remainder of this article is organized as follows. First, provided is a brief description of the DSM and its D-scale. Second, the proposed RVM method of standard setting is described and illustrated with computations of cut-scores on the D-scale and the IRT logit scale. Third, summary comments, limitations, and recommendations for future research on the RVM method are provided in the discussion part.

D-Scoring Method

The DSM is developed in a classical framework (DSM-C; Dimitrov 2016, 2018, 2020) and latent framework (DSM-L; Dimitrov & Atanasov, 2021). The DSM-C and DSM-L share the same D-scale, which ranges from 0 to 1, with D = 0 when all test items are answered incorrectly and D = 1 when all items are answered correctly. Also, the DSM-C and DSM-L share the same analytic expression for item response functions (IRFs) on the D-scale, but differ in approaches to the estimation of item and person parameters.

IRF on the D-Scale

A key feature of the DSM (classical and latent versions) is the two-parameter rational function model (RFM2) which is used to obtain IRFs on the D-scale (Dimitrov, 2020). The analytic form of RFM2 is

P = \frac{1}{1 + {[\frac{b (1 - D)}{D (1 - b)}]}^{s}},

(1)

where P is the probability of correct item response, D is the person’s test score on the D-scale (from 0 to 1), b is the item location on the D-scale (i.e., the location where the probability of correct item response is 0.5), and s is a fit parameter for shape. The item discrimination, a, is obtained “post hoc” as a function of the parameters b and s as follows (see Dimitrov, 2020).

a = \frac{s}{4 b (1 - b)} .

(2)

The RFM2 model can be reduced to a one-parameter model (RFM1) by fixing s = 1 in Equation 2 or extended to a three-parameter model (RFM3) by introducing a pseudo-guessing parameter, c, as it is done in IRT (e.g., Hambleton et al., 1991).

DSM-C

Under the DSM-C, Equation 1 is treated as a classical nonlinear regression where the person’s score D (computed a priori) is used as a predictor of the item score (1 or 0), whereas the item parameters b and s are estimated as regression coefficients. Specifically, the D-score is based on the examinee’s response vector of (1/0) item scores weighted by the expected item difficulties, $δ_{i}$ (“delta”; hence the name “delta-scoring” method, DSM). It should be noted that $δ_{i}$ = 1 − $π_{i}$ , where $π_{i}$ is the expected “easiness” of the item (i.e., the expected proportion of correct item responses for a targeted population of examinees). Once the $δ_{i}$ values are estimated for a test of n binary items (e.g., via bootstrapping; Efron, 1979), the D-score of person s on the test is computed as follows:

D_{s} = \frac{\sum_{i = 1}^{n} X_{si} δ_{i}}{\sum_{i = 1}^{n} δ_{i}},

(3)

where $X_{si}$ is the score (1/0) of person s on item i. Clearly, 0 $\leq$ $D_{s}$ $\leq$ 1, indicating what proportion of the ability needed for total success on the test is demonstrated by the examinee. For ease of interpretation in test reports, the D-scores is multiplied by 100 to range from 0 to 100.

DSM-L

Under the latent DSM-L, Equation 1 is treated as a latent (IRT-like) model where the D-scores are not known in advance but, instead, estimated along with the item parameters b and s using a maximum-likelihood method (e.g., Dimitrov & Atanasov, 2021) or other methods, such as the Markov Chain Monte Carlo estimation method (e.g., Sheng, 2008).

As a side note, as shown by Robitzsch (2021), the latent RFM2 model in Equation 1 is theoretically equivalent (at population level) to the two-parameter logistic (2PL) model in IRT (e.g., Hambleton et al., 1991) bounded to the scale interval [0-1]; (at sample level, however, the 2PL realization of the RFM2 is complicated by practically inconvenient restrictions).

RVM Method of Standard Setting

The RVM method is aligned with the conception according to which, as stated by Ferrara and Lewis (2012), “panelists are expected to develop a shared understanding of the cognitive requirements associated with a given method’s cognitive judgmental task so that each panelist can make independent judgments with a common frame of reference” (p. 263).

Under the RVM method, the cognitive judgmental task for the panelists is not performed for ordered items in a OIB, as this is done under the bookmark and ID matching methods but, instead, for items grouped into response vector units (RVUs) on substantive basis.

Response Vector Units

Along with avoiding issues with the OIB, described earlier, grouping the items into RVUs preserves the substantive structure of the test. This is important because the items of standardized tests are usually grouped into domains, subdomains, and so on. For example, the items of an operational test, used for licensure of teachers in Saudi Arabia, are associated with three content domains and further grouped by teaching standards within each domain. To illustrate, the items associated with the domain “professional practice” are grouped into three teaching standards―(a) planning and implementing teaching, (b) creating interactive and supportive learning environments for learners, and (c) assessment. In this case, 10 teaching standards in the content domains were used as RVUs in RVM-based computations of cut-scores for mastery by standards, domains, and the entire test (Dimitrov & Alsadaawi, 2019).

RVM Cognitive Judgmental Task

For each RVU, the panelists are asked to mark the items that they consider as sufficient (if answered correctly) for mastering the respective unit. This task requires matching of IRDs and PLDs like in the ID matching method, but the matching is performed on substantively grouped items in a RVU, instead of individual items in a OIB. Of course, the validity of a RVM produced by panelists depends on the clarity and completeness of the IRDs and PLDs―a critical condition for the quality of any standard setting method (e.g., see Egan et al., 2009; Ferrara et al., 2009; Mills & Jaeger, 1998; Skorupski & Hambleton, 2005). The panelists are also provided with the difficulty of each item as an auxiliary information for their work on producing RVMs by units.

As an example, the Arabic language test (ALT), developed and administered by the NCA in Saudi Arabia, consists of 80 binary items grouped into four content domains as follows: reading comprehension, rhetoric expression, structure, and writing accuracy. Each domain is divided into two subdomains, with the items in each subdomain grouped by subdomain elements. In a workshop with panelists for standard setting on the ALT, the subdomain elements were used as RVUs. This is illustrated in Table 1 for reading comprehension, with two subdomains (explicit and implicit) and five subdomain elements used as RVUs. The selected items (“circled” in Table 1) were identified by the panelists as sufficient to evidence “mastery” of the respective RVU, which resulted in a RVM of reading comprehension.

Table 1.

Specification Table for Panelists to Provide a Response Vector for Mastery (RVM) of the Reading Comprehension Domain of the ALT.

Subdomain	Subdomain element (RVU)	$δ -$ item difficulty	RVM
Explicit	Word meaning	0.674	0
		0.511	1
	Interpret phrases and paragraphs	0.738	0
		0.526	1
		0.217	1
		0.812	0
		0.643	0
		0.386	1
	Interpret the whole text	0.405	0
		0.335	1
		0.579	1
		0.515	1
Implicit	Implicit ideas	0.367	1
		0.791	0
		0.254	1
		0.520	0
		0.249	1
	Determine consequences	0.641	0
		0.593	1
		0.431	1

Note. Working in “steps” by subdomain elements selected as RVUs, the panelists are required to “circle” the items that (if answered correctly) are sufficient to evidence “mastery” of the respective unit. The selected items are scored as 1 and unselected as 0 in the response vector for “mastery” of the respective unit, which results in a RVM for the reading comprehension domain. ALT = Arabic language test; RVU = response vector unit.

This RVM, along with the RVMs for the other three domains produced in the same way, resulted into a RVM for the entire test (not shown here for space consideration; e.g., Dimitrov & Al-Shamrani, 2019).

RVM-Based Computation of Cut-Scores

The cut-score based on a RVM can be placed on (a) the D-scale of the DSM or (b) the IRT logit scale. In any case, the computation is based on the RVM and item parameters (known in advance). Illustrated next is the computation of cut-scores in three cases where the examinees’ test scores are obtained using (a) Equation 3 under the DSM-C, (b) the two-parameter IRF model (RFM2) under the DSM-L (see Equation 1), and (c) the 2PL model in IRT (e.g., Hambleton et al., 1991). In all three cases, item parameters are estimated for simulated data on a test of 20 binary items and 3,000 persons. The items are grouped into four RVUs (e.g., content domains) with a hypothetical response vector (RVM) for each unit and a resulting RVM for the entire test.

Case 1

Under the DSM-C, expected item difficulties, $δ_{i}$ (i = 1, . . ., 20), were estimated via bootstrapping using a computer program for DSM called DELTA (Atanasov & Dimitrov, 2019), but popular statistical packages, such as R, SPSS, STATA, and so on, also provide bootstrap estimations. The $δ_{i}$ values and the hypothetical RVMs are provided in Table 2. Note that in a real standard setting scenario the panelists are supposed to “circle” the items that they believe are sufficient (if answered correctly) for mastery of the respective RVU. The resulting RVMs of the four test units (RVU1, . . ., RVU4) directly render the RVM for the entire test. Then, the cut-score for each RVM (and eventually for the entire test) are computed via Equation 3 using the response vector (sequence of 1/0 item scores) and the expected item difficulties, $δ_{i}$ . The resulting cut-score on the D-scale for the entire test ( $D_{c}$ = 0.612) is used for a decision on mastery classification of examinees, whereas the cutting scores by RVUs provide diagnostic feedback on their performance by content domains.

Table 2.

Computing Cut-Scores for “Mastery” on the D-Scale: A Hypothetical Test of 20 Items Grouped in Four Response Vector Units (RVUs).

RVU	Item	Response vector for mastery (RVM)	$δ$	Computation the cut-score (D_c) corresponding to the RVM	D cut-score
RVU1	①	1	0.8333	$D_{c 1} = \frac{δ_{1} + δ_{3} + δ_{4} + δ_{6}}{δ_{1} + δ_{2} + δ_{3} + δ_{4} + δ_{5} + δ_{6}}$	0.652
	2	0	0.4633
	③	1	0.7300
	④	1	0.4000
	5	0	0.9733
	⑥	1	0.7233
RVU2	7	0	0.1633	$D_{c 2} = \frac{δ_{8} + δ_{9} + δ_{11}}{δ_{7} + δ_{8} + δ_{9} + δ_{10} + δ_{11} + δ_{12}}$	0.604
	⑧	1	0.4833
	⑨	1	0.5400
	10	0	0.2700
	⑪	1	0.6600
	12	0	0.6700
RVU3	⑬	1	0.2867	$D_{c 3} = \frac{δ_{13} + δ_{15} + δ_{17}}{δ_{13} + δ_{14} + δ_{15} + δ_{16} + δ_{17}}$	0.495
	14	0	0.5600
	⑮	1	0.5300
	16	0	0.6267
	⑰	1	0.3467
RVU4	⑱	1	0.6333	$D_{c 4} = \frac{δ_{18} + δ_{20}}{δ_{18} + δ_{19} + δ_{20}}$	0.677
	19	0	0.6433
	⑳	1	0.7167
Total test	$D_{c} = \frac{δ_{1} + δ_{3} + δ_{4} + δ_{6} + δ_{8} + δ_{9} + δ_{11} + δ_{13} + δ_{15} + δ_{17} + δ_{18} + δ_{20}}{δ_{1} + δ_{2} + \dots + δ_{18} + δ_{20}}$				0.612

Notes. δ = expected item difficulty (for a target population of examinees). For each RVU, it is assumed that panelists have come to agreement that answering correctly the “circled” items is sufficient for mastery of the RVU. Using Equation 3, cut-scores on the D-scale are computed for each RVU (e.g., for diagnostic feedback) and then on the entire test (for final “mastery” classification).

Case 2

Under the DSM-L, the item parameters for location, b, and shape, s, under the RFM2 model (see Equation 1) were estimated using a maximum-likelihood estimation (MLE) approach (e.g., Dimitrov & Atanasov, 2021) implemented in the computer program DELTA. The estimates of b and s, along with the hypothetical RVMs, are provided in Table 3. The cut-sore on the D-scale ( $D_{c}$ = 0.566) was obtained via MLE using the RVM and the values of b and s shown in Table 3. The MLE solution (0.566) is graphically depicted in Figure 1.

Table 3.

Response Vector for Mastery (RVM) and Estimates of Item Parameters Under the Two-Parameter Models in the DSM-L (b and s) and IRT (a and b) for 20 Simulated Items.

Item	RVM	DSM-L (RFM2)		IRT (2PL)
		b	s	a	b
1	1	0.66	1.50	0.69	2.48
2	0	0.33	1.03	0.60	−0.18
3	1	0.59	1.25	0.53	2.03
4	1	0.25	0.92	0.56	−0.80
5	0	0.89	1.46	0.61	5.88
6	1	0.56	1.31	0.58	1.69
7	0	0.04	0.71	0.67	−2.75
8	1	0.34	1.11	0.66	−0.12
9	1	0.39	1.21	0.68	0.23
10	0	0.13	0.83	0.62	−1.78
11	1	0.50	1.22	0.59	1.11
12	0	0.52	1.35	0.66	1.21
13	1	0.15	0.86	0.63	−1.53
14	0	0.41	1.21	0.66	0.40
15	1	0.39	1.14	0.63	0.25
16	0	0.48	1.19	0.56	0.97
17	1	0.20	0.92	0.59	−1.22
18	1	0.47	1.26	0.61	0.88
19	0	0.48	1.42	0.76	0.88
20	1	0.57	1.21	0.50	1.87
Cut-score		D = 0.566		θ = 0.651

Note. The cut-score on the D-scale is D = 0.566 and that on the IRT logit scale is θ = 0.651. As a side note, the correlation between the estimates of the location (difficulty) parameter b, obtained under the DSM-L and IRT calibrations, is 0.987. DSM-L = D-scoring method latent; RFM2 = two-parameter rational function model; IRT = item response theory; 2PL = two-parameter logistic.

Figure 1.

The cut-score (0.566) on the D-scale estimated for the RVM and two item parameters (b and s) of 20 simulated items (see Table 3) via MLE under the RFM2 model of the DSM-L.

Case 3

In IRT framework, the item parameters were estimated under the 2PL model using the computer program IRTPRO. The estimates of item parameters for discrimination, a, and location (model-based difficulty), b, are also provided in Table 3. The cut-sore on the IRT logit scale ( $θ_{c}$ = 0.651) was obtained via MLE using the RVM and the values of a and b. The MLE solution (0.651) is graphically depicted in Figure 2.

Figure 2.

The cut-score (0.651) on the IRT logit scale estimated for the RVM and two item parameters (a and b) of 20 simulated items (see Table 3) via MLE under the 2PL model in IRT.

An Example of Using the RVM Methods

The main purpose in this section is to describe procedures and results related to using the RVM method in a standard setting workshop aiming at the derivation of cut-scores on a test referred to as multiple cognitive abilities assessment (MCAA). For comparison, cutting scores based on the bookmark method under the RP67 and RP50 criteria were also derived in the standard setting process involving the same panelists (experts on the MCAA content). The cut-scores are used for admission of students to education programs (specialized curricula, summer and after school programs, and competitions) under the Mawhiba Project for gifted and talented students in Saudi Arabia (http://www.mawhiba.org). Three versions of the MCAA are developed for three school levels, namely: Level 1 (Grades 3-5), Level 2 (Grades 6-8), and Level 3 (Grades 9-12) (e.g., seeMourgues et al., 2016). For consistency with the terminology used here, the term mastery is used for the category of students with MCAA performamce above the cut-score and nonmastery for those who perform below the cut-score at the respective school level.

Given the illustrative purpose of this example, only results for Level 1 (Grades 3-5) are presented, but the RVM and bookmark procedures are identical for the panelists at all three school levels. The MCAA for Grades 3 to 5 consists of 52 dichotomously scored (1/0) items associated with four content domains (a) mental flexibility, (b) verbal reasoning and reading comprehension, (c) mathematical and spatial reasoning, and (d) scientific and mechanical reasoning, with 13 items per domain. A brief description of these domains is presented next.

Verbal reasoning and reading comprehension (VR&RC). This subscales measures (a) linguistic reasoning—the ability to use language to reach conclusions by processing available information and facts according to specific logical rules and procedures and (b) reading comprehension—the ability to apply grammar and use it in dealing with reading content.

Mathematical and spatial reasoning (MR&SR). This subscale measures (a) mathematical reasoning—the ability to use mathematical skills and logical thinking to obtain solutions or results through specific strategies and (b) spatial reasoning—the pictorial ability to find a logical relationship between forms, whether in terms of change, similarity, congruence, difference, folding or counting.

Scientific and mechanical reasoning (SR&MR). This subscale measures (a) scientific reasoning—the ability to use available data and facts in a science, experience and logic to obtain information from data and facts that have not been tried before, to build natural objects and to visualize their past and future, as well as the ability to process logical reasoning based on evidence-based extrapolation, and (b) mechanical reasoning—the ability to use principles and concepts in natural sciences to understand and solve different phenomena, such as light, sound, electricity, magnet, motion, diverse forces, pressure, heat, properties of materials, and gases.

Mental flexibility (MF). This subscale measures the ability to produce diverse ideas, direct thinking to visualize what is the opposite of what exists, directing and linking areas of use and changing and diversifying ways of dealing with things and situations according to their nature. This can be achieved through analyzing factors of the difficulties that can be identified and used in finding solutions.

Procedure

The work of participating panelists was organized according to the methodology of standard setting, including training, presentation of the RVM and bookmark methods, and rounds of implementation (e.g., Dimitrov et al., 2020; Ferrara et al., 2008; Lewis et al., 2012; Mitzel et al., 2001). The RVM rounds were conducted in 2 days, whereas the bookmark rounds took 6 days using the RP67 and RP50 criteria of response probability in two separate sessions. Some details are provided next.

RVM Rounds

The RVM method was applied in two rounds over 2 days and 5 hours per day. Twenty-eight panelists were divided into groups based on their expertise in MCAA domains. Specifically, four panelists were assigned to each of the first three domains described here above and four panelists to each the four subdomains (areas of science) of the domain SR&MR, namely: Biology, Physics, Chemistry, and Geology. The panelists were asked to identify RVMs for seven RVUs, (a) RVU1 for MF, (b) RVU2 for VR&RC, (c) RVU3 for MR&SR, (d) RVU4 for Biology, (e) RVU5 for Physics, (f) RVU6 for Chemistry, and (g) RVU7 for Geology.

During the first round, after receiving training on the RVM method, each panelist worked independently on the test unit (RVU) in their area of expertise to identify a RVM for that unit. The panelists were given a booklet describing the RVU items, their IRDs and difficulty, and PLDs for mastery of the RVU. During the second round, the panelists worked collectively in groups of four to come up with a final RVM for the respective RVU of the test. They were provided with a feedback from Round 1, including their individual RVMs, the resulting cut-scores, and the percentage of examinees expected to pass those cut-scores. The identification of RVMs for each of the seven RVUs resulted “automatically” in a RVM for “mastery” on the entire test, MCAA (not shown here for space consideration). This procedure was repeated for each of the three MCAA tests by grade levels (L1, L2, and L3). The time schedule is shown in Table 4.

Table 4.

Time Schedule of the Panelists’ Work Under the RVM Method.

Time	Domains	Grade level^a	No. of items in the domain	Time (minutes) for individual work at each level	Time (minutes) for group work at each level
Day 1	VR&RC, MR&SR, MF	1	13	30	30
		2	20	45	45
		3	25	60	60
Day 2	SR&MR	1	13	30	30
		2	20	45	45
		3	25	60	60

Note. VR&RC = verbal reasoning and reading comprehension; MR&SR = mathematical and spatial reasoning; MF: mental flexibility; SR&MR = scientific and mechanical reasoning.

Grade levels: 1 = Grades 3 to 5; 2 = Grades 6 to 8, and 3 = Grades 9 to 12.

Bookmark Rounds

The bookmark method was conducted in 6 days, 4 hours per day, involving the same experts who worked under the RVM method. The test forms by grade levels were structured in OIBs under the bookmark method. After providing instructions to the experts, their work was conducted in two sessions.

During the first session, the experts worked individually under two scenarios for each domain of the MCAA tests by grade levels. In the first scenario the panelists used the OIB under the RP67 rule; that is, RP = 2/3 (67% chances of correct item response by the borderline student). In the second scenario they worked under the RP50 rule; that is, RP = 1/2 (50% chances of correct item response by the borderline student). The experts worked individually and then by groups related to the four MCAA subscales (domains). The first day they used the MCAA tests for Grade Levels 1 and 2, under the RP = 2/3 rule. The second day they used the MCAA test for grade levels and RP values as follows: (a) Level 3, RP = 2/3 and (b) Level 1, RP = 1/2. The third day, the used the MCAA tests for Levels 2 and 3, under RP = 1/2.

During the second session, the experts worked only in four groups, by content domains of their expertise, using the OIBs for the entire MCAA test by grade level. The first day they used the MCAA tests for Grade Levels 1 and 2, under the RP = 2/3 rule. The second day they used the MCAA test for grade levels and RP values as follows: (a) Level 3, RP = 2/3 and (b) Level 1, RP = 1/2. The third day, they used the MCAA tests for Levels 2 and 3, under RP = 1/2. The number of booklets (OIBs) with marked items for “mastery” on the entire MCAA test is given in Table 5.

Table 5.

Number of Booklets (OIBs) Used by Experts Individually and in Groups by Grade Level and RP Values Under the Bookmark Method.

Type of work	Level 1 (Grades 3-5)		Level 2 (Grades 6-8)		Level 3 (Grades 9-10)
	RP = 2/3	RP = 1/2	RP = 2/3	RP = 1/2	RP = 2/3	RP = 1/2
Individual	16	16	16	16	16	16
Group	4	4	4	4	4	4

Note. RP = 2/3 (67% chances of correct response); RP = 2/3 (50% chances of correct response). RP == response probability (likelihood of correct item response); OIB = ordered item booklet.

Computation of Cut-Scores

Of primary interest was the derivation of cut-scores on the scale of the entire MCAA test for decisions on the examinees’ acceptance to programs for gifted and talented students in Saudi Arabia. Cut-scores by MCAA content domains were also computed for diagnostic feedback to the education program.

RVM Cut-Scores

Under the RVM method, cut-scores were computed on the D-scale. As noted earlier, provided here are only results for the MCAA at Level 1 (Grades 3-5) due to space consideration. For illustration, the cut-score on the IRT logit scale, based on RVM provided by the panelists for the entire test (Grades 3-5) was also computed and provided here. Cut-score were computed in three scenarios, using (a) Equation 3 under the DSM-C, (b) MLE under the RFM2 model in DSM-L, and (c) MLE under the 2PL model in IRT. Under the DSM-C, the cut-score on the D-scale (0-1) was found to be $D_{C}$ = 0.533. Under the RFM2 model in DSM-L, the MLE solution for cut-score on the D-scale was $D_{L}$ = 0.502 (see Figure 3). Under the 2PL model in IRT, the MLE solution for cut-score on the logit scale was $θ_{c}$ = 0.087 (see Figure 4). As one may notice, the solutions (classical and latent) on the D-scale are slightly above the scale mean (0.5) and the MLE solution on the IRT logit scale is also slightly above the mean of the scale (zero). Thus, in either case examinees with a score slightly above the average level of ability measured by the MCAA are eligible for acceptance into the education program.

Figure 3.

The cut-score (0.502) on the D-scale estimated for the RVM and two item parameters (b and s) of the MCAA items via MLE under the RFM2 model of the DSM-L.

Validation

Regarding the validity of performance standards, Kane (1994, 2001) suggested three types of validity evidence―procedural, internal, and external. The procedural validation of the RVM application in the context of this example can be addressed from two perspectives. First, the RVM method simplifies and facilitates the panelists’ judgmental task by focusing on their expertise and entirely eliminating probability judgments about performance of a “borderline” examinee on individual items. Second, the panelists were surveyed about their understanding of the RVM method and confidence in the derived cut-scores for “mastery” of the domains of their expertise and the entire test, MCAA. Their responses were in strong support to the RVM method.

The internal validation of the RVM method was addressed by estimating the panelists’ consistency in the produced RVMs and resulting cut-scores by the domains of their expertise. For example, the panelists’ agreement on the domain “mental flexibility” (see Table 6) was estimated by comparing their RVMs, produced at the first round, with the final RVM produced at the second round of the procedure. For example, 3/4 (75%) of the panelists agreed that Item 5 should be selected (scored as 1), whereas 4/4 (100%) agreed that Item 7 may not be selected (scored 0) in the RVM for “mental flexibility.” The average agreement over all 13 items in this domain is 79%. Furthermore, the cut-score on the D-scale, obtained with the RVM for this domain via Equation 3 is $D_{c}$ = 0.52, whereas the cut-scores obtained with the RVMs of the four panelists are $D_{c 1}$ = 0.54, $D_{c 2}$ = 0.60, $D_{c 3}$ = 0.52, and $D_{c 4}$ = 0.70. The cut-sore produced by the fourth panelist (0.70) can be seen as an “outlier” at the first round of the procedure. Nevertheless, the mean of the four cut-scores ( ${\bar{D}}_{c}$ = 0.59), based on the four panelists’ RVMs at round one, is relatively close to $D_{c}$ = 0.52, obtained with the final RVM for “MF” at the second round. Similar results were obtained with the RVMs of the panelists assigned to the other content domains of the MCAA.

Table 6.

Response Vectors for Mastery (RVM) of the “Mental Flexibility (MF)” Domain of the MCAA Produced by Four Panelists at Round 1 and Their Final Solution Produced at Round 2 of the Standard Setting Procedure.

Item	Delta, δ	RVM-final	RVM-P1	RVM-P2	RVM-P3	RVM-P4
5	0.5093	1	1	0	1	1
6	0.4167	0	0	0	0	1
7	0.4739	0	0	0	0	0
8	0.4384	0	0	0	0	1
9	0.2655	1	1	1	1	0
10	0.4652	1	1	1	0	1
11	0.4552	1	0	1	1	1
12	0.4950	0	0	0	1	1
13	0.4316	0	0	1	0	0
14	0.2587	1	1	1	1	1
25	0.5653	1	1	1	1	1
29	0.5342	1	1	1	1	1
40	0.5597	0	1	1	0	0

Note. For the RVM of each panelist, the shaded cell indicates consistency of the item score (1 to 0) with the item score in the final RVM for “mental flexibility.” For Item 5, for example, three panelists (P1, P3, and P4) are in agreement with the score for that item in the final RVM. The items are numbered according to their original location in the test, MCAA. MCAA = multiple cognitive abilities assessment.

One aspect of external validation of the RVM method in the context of this example was to compare the location of cut-scores on the D-scale [0-1], obtained via MLE under the DSM-L, and on the IRT scale, obtained via MLE under the 2PL model in IRT. As shown earlier for simulated data (see Figures 1 and 2), the cut-scores based on the RVM for all test items are (a) $D_{cut}$ = 0.566, slightly above the mean (0.5) of the D-scale, and (b) $θ_{cut}$ = 0.651, slightly above the mean of the IRT logit scale (zero). For the real-data study, the cut-scores for the entire MCAA test were slightly above the mean of the scale via the MLE under the DSM-L ( $D_{cut}$ = 0.502; see Figure 3) and under the 2PL in IRT ( $θ_{cut}$ = 0.087; see Figure 4). Also, the cut-scores on the contend domain “mental flexibility” were (a) D = 0.498, slightly below the mean on the D-scale, and (b) θ = −0.131, slightly below the mean of the IRT logit scale (see Table 7). Clearly, there is a location consistency of the RVM cut-scores.

Table 7.

Response Vector for Mastery (RVM) and Estimates of the Item Parameters Under the 2PL Models in the DSM-L (b and s) and IRT (a and b) for the Items in the “Mental Flexibility” Domain of the MCAA.

Item	RVM	DSM-L (RFM2)		IRT (2PL)
		b	s	a	b
5	1	0.45	1.47	0.85	0.02
6	0	0.38	1.24	0.74	−0.49
7	0	0.43	1.59	0.94	−0.12
8	0	0.40	1.26	0.76	−0.35
9	1	0.24	1.18	0.75	−1.49
10	1	0.42	1.38	0.79	−0.21
11	1	0.40	1.07	0.59	−0.35
12	0	0.44	0.73	0.35	−0.07
13	0	0.37	1.02	0.54	−0.56
14	1	0.25	1.34	0.91	−1.32
25	1	0.50	1.70	0.97	0.34
29	1	0.47	1.78	1.04	0.15
40	0	0.62	0.34	0.03	7.53
Cut-score		D = 0.498		θ = −0.131

Note. DSM-L = D-scoring method latent; MCAA = multiple cognitive abilities assessment; IRT = item response theory; MLE = maximum-likelihood estimation; RFM2 = two-parameter rational function model.

Figure 4.

The cut-score (0.087) on the IRT logit scale estimated for the RVM and two item parameters (a and b) of the MCAA items via MLE under the 2PL model in IRT.

Bookmark Cut-Scores

The bookmark method was applied for derivation of cut-scores on the D-scale to control for the scale factor in the comparison of cut-scores obtained under RVM and bookmark methods. Therefore, instead of IRT-based item parameters, the OIB of the bookmark was based on DSM-L item parameters for location (model-based difficulty), b, and shape/discrimination, s, obtained under the latent RFM2 model given in Equation 1. Another adjustment to using the bookmark method under the DSM-L relates to the formula of the cut-score computation based on the item marked in the OIB by panelists. Details on this matter are provided in the appendix. Apart from these two minor technical adjustments, the bookmark method is the same for derivation of cut-scores on the D-scale and IRT logit scale. The resulting cut-scores on the D-scale for the entire test and by content domains for Grades 3 to 5, obtained under the bookmark rules RP67 and RP50, are reported in Table 8. For comparison, the corresponding cut-scores obtained under the RVM method, as described in the previous section, are also reported in Table 8.

Table 8.

Cutting Scores for “Mastery” Under the RVM and Bookmark Method (BM) on the D-Scale (0-100) for Students at Grades 3 to 5.

Test/domain	RVM (% mastery)	BM: RP = 2/3 (% mastery)	BM: RP = 1/2 (% mastery)
Entire test	53 (30)	59 (17)	55 (25)
VR&RC	53 (52)	60 (33)	56 (59)
MR&SR	62 (20)	57 (28)	48 (52)
SR&MR	46 (20)	71 (0)	78 (0)
MF	51 (39)	54 (32)	44 (53)

Note. RVM = response vector for mastery; VR&RC = verbal reasoning and reading comprehension; MR&SR = mathematical and spatial reasoning; MF = mental flexibility; SR&MR = scientific and mechanical reasoning.

Comparison of RVM and Bookmark Results

The RVM and bookmark results are examined by comparing (a) values of cut-scores obtained under the two methods on the D-scale, (b) agreement among panelists on produced cut-scores, and (c) panelists’ opinions about the two methods. The D-scores are given here on a scale from 0 to 100 for ease of interpretation and consistency with their presentation in reports on test results. First, as shown in Table 8, the bookmark cut-scores under the RP67 and RP50 rules (59 and 55, respectively) are higher than the RVM cut-score on the entire test (D_cut = 53). This trend holds for the cut-scores by content domains, with an exception for the domain of MR&SR. From practical perspectives, related to targeted proportions of acceptance to Mawhiba education programs and diagnostic feedback on expected performance by content domains, the panelists and other stakeholders were unanimously in favor of the RVM cut-scores. Provided in Figure 5 is the distribution of D-scores for the study sample of examinees in school Grades 3 to 5 (N = 16,075), with RVM cut-score (D_cut = 53) producing 30% acceptance to the Mawhiba education programs. Furthermore, the bookmark cut-scores on the content domain SR&MR are totally unrealistic, with 0% passing rate compared with 20% produced by the RVM cut-score.

Figure 5.

D-score distribution for students at Level 1 (Grades 3-5) with the response vector for mastery (RVM) cut-score (D_cut = 53) producing 30% acceptance to Mawhiba education programs.

Second, there was higher level of agreement among the participating panelists when they work under the RVM method compared to the bookmark method. This can be seen, for example, from the results in Table 9 where the standard deviation (SD) of cut-scores produced independently by the panelists is smaller under the RVM method (SD = 5.50) compared with the bookmark method using the RP67 rule (SD = 6.67) or RP50 rule (SD = 9.79).

Table 9.

Means and Standard Deviations of Cut-Scores Obtained Under the RVM and Bookmark Methods.

Method	M	SD
RVM	54.25	5.50
Bookmark (RP =2/3)	60.53	6.76
Bookmark (RP = 1/2)	70.28	9.79

Note. Cut-scores on the D-scale (0-100). RVM = response vector for mastery; RP = response probability.

Third, at the end of the standard setting workshop the participating panelists answered survey questions about their opinions on comparing the RVM method and the bookmark method (with RP = 2/3 and RP = 1/2). Overall, the panelists were in favor of the RVM method as being “more comprehensive, sound, and taking into account all items (in a RVM unit).” Regarding the bookmark methods, the panelists noted that “this method is difficult to practice and apply, and it needs more work and good training” as well as that the bookmark method “ignores the rest of the questions” (after the marked item in the OIB).

Discussion

The proposed method of standard setting, referred to as RVM method, was designed to (a) overcome some problems related to the Angoff’s (1971) method, the bookmark method (Lewis et al., 1996; Mitzel et al., 2001), and the relatively new and promising ID matching method (Ferrara et al., 2008; Ferrara & Lewis, 2012), (b) reflect more adequately the content structure of typical standardized assessment tools, and (c) align the computation of cut-scores with the computation of ability scores on the D-scale and the IRT logit scale. Summarized briefly are some arguments in this regard.

First, similar to the ID matching method, the RVM method does not require panelists to conceptualize a borderline examinee and make probability judgments as this is done under the Angoff and bookmark methods. Second, the RVM method avoids problems related to the OIB (e.g., item disordinality), which is used in bookmark and ID matching methods. Third, the RVUs reflect more adequately the structure of the test compared with a single-item focus under the other methods. Fourth, the estimation of cut-scores based on RVMs is aligned with the estimation of examinees’ ability levels using their response vectors (e.g., via MLE) in the framework of DSM or IRT. Such an alignment is not performed with OIB-based estimations of cut-scores using a single item, leaving out ability information coded in the other items (e.g., Zieky, 2001). Also, in the illustrative example provided in the previous section, the RVM method performed better than the bookmark method (using the RP67 and RP50 rules) in several aspects, (a) the RVM cut-scores were associated with more realistic (targeted) passing rates, (b) the participating panelists were in a higher level of agreement on cut-scores produced under the RVM method, and (c) the panelists found the RVM method easier to apply, more comprehensive, and more adequately reflecting the content structure of the test.

Practical Considerations

The Choice of RVUs

The RVUs are obtained via partitioning the content structure of the test into substantively meaningful units that (a) are relatively short (e.g., 5-10 items) and (b) can be evaluated for “mastery” on relevant descriptors. For each RVU, the panelists should identify items that are sufficient to evidence mastery of the unit (if answered correctly). The identification of response vectors for mastery by units “automatically” results in a response vector for targeted ‘mastery’ over the entire test. With the example on teacher licensure test provided earlier, the test is structured by content domains and “teaching standards” related to each domain. In this case, the teaching standards were used as RVUs as they were relatively short and well described according to targeted performance. Similarly, the ALT is structured by content domains, with each domain further partitioned into substantively defined units that were selected as RVUs (see Table 1).

Computation of Cut-Scores

When the goal is to place cut-scores on the D-scale, their computation under the DSM-C (see Table 2) can be referred for its transparency, simplicity, and dependability. Specifically, the transparency of a cut-score is provided by its explicit and direct computation via the scalar product of two vectors―the RVM (a sequence of 1/0 item scores) and the vector of expected item difficulties, δ (see Equation 3). The estimation of $δ_{i}$ values can be easily performed via bootstrapping which is available in widely used statistical packages such as R, SPSS, STATA, and so on. In fact, for very large samples of examinees (e.g., N > 3,000), a quick and quite accurate estimate of item difficulty is ${\hat{δ}}_{i}$ = 1 − $p_{i}$ , where $p_{i}$ is the proportion of correct responses (“easiness”) of the item. The estimates of D-scores under the DSM-C and DSM-L approaches are quite close and highly correlated (typically, r > 0.99; e.g., Dimitrov & Atanasov, 2021).

Thus, when cut-scores are presented on the D-scale from 0 to 100 (i.e., multiplying by 100) and rounded to the nearest integer for report purposes, it would be appropriate to use the DSM-C approach to cut-score computation (see Table 2). When cut-scores are placed on the IRT scale, MLE-based computations using the RVM produced by panelists and estimates of item parameters (e.g., under the 2PL) is a viable option. Technically, this option is facilitated by the availability of MLE procedures for IRT estimations in R, MATLAB, and other software packages.¹

Limitations and Future Research

The RVM method is appropriate when the content structure of the test allows for a substantively based identification of units, called RVUs, with availability of PLDs for each unit. Also, the length of each RVU must be relatively small (e.g., 5-10 items) to allow for manageable and efficient work by the panelists related to the identification of response vectors for “mastery” (RVMs) by units. If these conditions are not in place, the RVM method may not be appropriate―a scenario which is, however, highly unlikely to occur with tests designed for standard setting.

Some limitations of the present study (not necessarily of the RVM method per se) relate to issues that are not addressed here but need to be examined in future research. Additional studies are needed to further understand the suitability of the RVM method in different contexts of standard setting. For example, the RVM method is discussed here in the context of binary (e.g., mastery/nonmastery) classifications, so additional research can investigate its suitability for the classification of examinees into more than two performance categories (e.g., basic, proficient, advanced). Also, the validation of RVM-based cut-scores need to be examined from a variety of perspectives, including modern approaches to estimating errors under controlled conditions of “mastery” classifications (e.g., Grabovsky & Wainer, 2017). Comparisons of the RVM method with other standard setting methods on methodology and practical efficiency are recommended.

Conclusion

With the understanding that there is no single (“best”) approach to setting standards for a variety of assessment scenarios and policy guidelines, the proposed RVM method is an efficient approach to standard setting with unique features of transparency, simplicity, dependability, and separability of panelists’ judgments and psychometric estimations of cut-scores on the IRT scale and/or the D-scale of the recently developed D-scoring method.

Footnotes

Appendix

Acknowledgements

The author would like to thank Dr. Abdullah Qataee and Dr. Abdullah Sadaawi, from the National Center for Assessment at the Education & Training Evaluation Commission (ETEC) in Saudi Arabia, for their support related to piloting of the RVM method at ETEC; Dr. Hanan Ghamdi and Dr. Maisaa Alahmadi for their help with the logistics of the standard setting study for the MCAA test, as well as participating panelists and Mawhiba representatives.

Declaration of Conflicting Interests

The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

Dimiter M. Dimitrov

Notes

References

Angoff

W. H.

(1971). Scales, norms, and equivalent scores. In Thorndike

R. L.

(Ed.), Educational measurement (2nd ed., pp. 508-600). American Council on Education.

Atanasov

D. V.

Dimitrov

D. M.

(2019). DELTA: A computer program for D-scoring and equating of test data (v.2.0). National Center for Assessment of the Education & Training Evaluation Commission.

Baldwin

(2018). Some problems with the analytical argument in support of RP67 in the context of the bookmark standard setting method. Applied Psychological Measurement, 43(6), 481-492. https://doi.org/10.1177/0146621618800272

Baldwin

Margolis

Clauser

B. E.

Mee

Winward

(2019). The choice of response probability in bookmark standard setting: An experimental study. Educational Measurement, 39(1), 37-44. https://doi.org/10.1111/emip.12230

Beretvas

S. N.

(2004). Comparison of bookmark difficulty locations under different item response models. Applied Psychological Measurement, 28(1), 25-47. https://doi.org/10.1177/0146621603259903

Berk

R. A.

(1986). A consumer’s guide to setting performance standards on criterion-referenced tests. Review of Educational Research, 56(1), 137-172. https://doi.org/10.3102/00346543056001137

Chang

(1999). Judgmental item analysis of the Nedelsky and Angoff standard-setting methods. Applied Measurement in Education, 12(2), 151-165. https://doi.org/10.1207/s15324818ame1202_3

Cizek

G. J.

Bunch

M. B.

(2007). Standard stetting: A guide to establishing and evaluating performance standards on test. Sage. https://doi.org/10.4135/9781412985918

Cizek

G. J.

Bunch

M. B.

Koons

(2004). Setting performance standards: Contemporary methods. Educational Measurement, 23(4), 31-50. https://doi.org/10.1111/j.1745-3992.2004.tb00166.x

10.

Clauser

B. E.

Margolis

M. J.

Case

S. M.

(2006). Testing for licensure and certification in the professions. In: Brennan

R. L.

(Ed.) Educational Measurement (4th ed., pp. 701-731). Praeger.

11.

Davis-Becker

S. L.

Buckendahl

Gerrow

(2011). Evaluating the bookmark standard setting method: The impact of random item ordering. International Journal of Testing, 11(1), 24-37. https://doi.org/10.1080/15305058.2010.501536

12.

Dimitrov

D. M.

(2016). An approach to scoring and equating tests with binary items: Piloting with large-scale assessments. Educational and Psychological Measurement, 76(7), 954-975. https://doi.org/10.1177/0013164416631100

13.

Dimitrov

D. M.

(2018). The delta scoring method of tests with binary items: A note on true score estimation and equating. Educational and Psychological Measurement, 78(5), 805-825. https://doi.org/10.1177/0013164417724187

14.

Dimitrov

D. M.

(2020). Modeling of item response functions under the D-scoring method. Educational and Psychological Measurement, 80(1), 126-144. https://doi.org/10.1177/0013164419854176

15.

Dimitrov

D. M.

Alsadaawi

(2019, September). A standard setting method using the D- scoring method: Procedures and application to assessment for teacher certification [Paper presentation]. 2019 Conference of the International Association for Educational Assessment. Baku, Azerbaijan.

16.

Dimitrov

D. M.

Al-Shamrani

(2019, March). A new approach to setting cutting scores for mastery in language testing [Paper presentation]. Second Applied Linguistics & Language Teaching International Conference and Exhibition, Dubai, UAE.

17.

Dimitrov

D. M.

Atanasov

D. V.

(2021). Latent D-scoring modeling: Estimation of item and person parameters. Educational and Psychological Measurement, 81(2), 388-404. https://doi.org/10.1177/0013164420941147

18.

Dimitrov

D. M.

Ghamdi

Alahmadi

(2020, December). Setting cutting scores for mastery on Mawhibah’s multiple cognitive ability tests (Technical Report [TR 12-2020]). Education & Training Evaluation Commission, Riyadh, Saudi Arabia.

19.

Efron

(1979). Bootstrap methods: Another look at the jackknife. Annals of Statistics, 7, 1-26. https://doi.org/10.1214/aos/1176344552

20.

Egan

K. L.

Ferrara

Schneider

M. C.

Barton

K. E.

(2009). Writing performance level descriptors and setting performance standards for assessments of modified achievement standards: The role of innovation and importance of following conventional practice. Peabody Journal of Education, 84(4), 552-577. https://doi.org/10.1080/01619560903241028

21.

Ferrara

Lewis

D. M.

(2012). The item-descriptor (ID) matching method. In Cizek

G. J.

(Ed.), Setting performance standards: Foundations, methods, and innovations (2nd ed., pp. 255-282). Routledge.

22.

Ferrara

Perie

Johnson

(2008). Matching the judgmental task with standard setting panelist expertise: The item-descriptor (ID) matching procedure. Journal of Applied Testing Technology, 9(1), 1-22.

23.

Ferrara

Swaffield

Mueller

(2009). Conceptualizing and setting performance standards for alternate assessments. In Schafer

W. D.

Lissitz

R. W.

(Eds.), Alternate assessments based on alternate achievement standards: Policy, practice, and potential (pp. 93-111). Brookes.

24.

Grabovsky

Wainer

(2017). The cut-score operating function: A new tool to aid in standard setting. Journal of Educational and Behavioral Statistics, 42(3), 251-263. https://doi.org/10.3102/1076998617696495

25.

Hambleton

R. K.

(2001). Setting performance standards on educational assessments and criteria for evaluating the process. In Cizek

G. J.

(Ed.), Setting performance standards (pp. 89-115). Lawrence Erlbaum.

26.

Hambleton

R. K.

Pitoniak

M. J.

(2006). Setting performance standards. In Brennan

R. L.

(Ed.), Educational measurement (4th ed.). American Council on Education/Praeger.

27.

Hambleton

R. K.

Plake

B. S.

(1995). Using an extended Angoff procedure to set standards on complex performance assessments. Applied Measurement in Education, 8(1), 41-55. https://doi.org/10.1207/s15324818ame0801_4

28.

Hambleton

R. K.

Swaminathan

Rogers

H. J.

, (1991). Fundamentals of item response theory. Sage.

29.

Hauser

R. M.

Edley

C. F.

Jr. Koenig

J. A.

Elliott

S. W.

(2005). Measuring literacy: Performance levels for adults. National Academies Press

30.

Huynh

(1998). On score locations of binary and partial credit items and their applications to item mapping and criterion-referenced interpretation. Journal of Educational and Behavioral Statistics, 23(1), 35-56. https://doi.org/10.3102/10769986023001035

31.

Huynh

(2006). A clarification on the response probability criterion RP67 for standard settings based on bookmark and item mapping. Educational Measurement: Issues and Practice, 25(2), 19-20. https://doi.org/10.1111/j.1745-3992.2006.00053.x

32.

Kane

M. T.

(1994). Validating the performance standards associated with passing scores. Review of Educational Research, 64(3), 425-461. https://doi.org/10.3102/00346543064003425

33.

Kane

M. T.

(2001). So much remains the same: Conception and status of validation in setting standards. In Cizek

G. J.

(Ed.), Setting performance standards (pp. 53-88). Lawrence Erlbaum.

34.

Karantonis

Sireci

S. G.

(2006). The bookmark standard-setting method: A literature review. Educational Measurement: Issues and Practice, 25(1), 4-12. https://doi.org/10.1111/j.1745-3992.2006.00047.x

35.

Lewis

D. M.

Green

(1997, June). The validity of PLDs [Paper presentation]. National Conference on Large Scale Assessment, Colorado Springs, CO, United States.

36.

Lewis

D. M.

Mitzel

H. C.

Green

D. R.

Patz

R. J.

(1999). The bookmark standard setting procedure. McGraw-Hill.

37.

Lewis

D. M.

Mitzel

H. C.

Mercado

R. L.

Schulz

E. M.

(2012). The bookmark standard setting procedure. In Cizek

G. J.

(Ed.), Setting performance standards: Foundations, methods, and innovations (2nd ed., pp. 225-254). Routledge.

38.

Lin

(2006). The bookmark procedure for setting cut-scores and finalizing performance standards: Strengths and weaknesses. Alberta Journal of Educational Research, 52(1), 36-52.

39.

Mills

C. N.

Jaeger

R. M.

(1998). Creating descriptions of desired student achievement when setting performance standards. In Hansche

(Ed.), Handbook for the development of performance standards (pp. 73-85). U.S. Department of Education and Council of Chief State School Officers.

40.

Mitzel

H. C.

Lewis

D. M.

Patz

R. J.

Green

D. R.

(2001). The bookmark procedure: Psychological perspectives. In Cizek

G. J.

(Ed.), Setting performance standards: Concepts, methods and perspectives (pp. 249-281). Lawrence Erlbaum.

41.

Mourgues

C. V.

Tan

Hein

Al-Harbi

Aljughaiman

Grigorenko

E. L.

(2016). The relationship between analytical and creative cognitive skills from middle childhood to adolescence: Testing for the threshold theory in the Kingdom of Saudi Arabia. Learning and Individual Differences, 52, 137-147. https://doi.org/10.1016/j.lindif.2015.05.005

42.

Perie

(2008). A guide to understanding and developing PLDs. Educational Measurement, 27(4), 15-29. https://doi.org/10.1111/j.1745-3992.2008.00135.x

43.

Phillips

G. W.

(2012). The benchmark method of standard setting. In Cizek

G. J.

(Ed.), Setting performance standards: Foundations, methods, and innovations (2nd ed., pp. 323-346). Routledge.

44.

Plake

B. S.

Cizek

G. J.

(2012). Variations on a theme: The modified Angoff, extended Angoff, and Yes/No standard setting methods. In Cizek

G. J.

(Ed.), Setting performance standards: Foundations, methods, and innovations (2nd ed., pp. 181-200). Routledge.

45.

Ricker

K. L.

(2006). Setting cut-scores: A critical review of the Angoff and modified Angoff methods. Alberta Journal of Educational Research, 52(1), 53-64.

46.

Robitzsch

(2021). On the equivalence of the latent D-scoring model and the two-parameter logistic item response model. https://doi.org/10.20944/preprints2021105.0699.vl

47.

Schulz

E. M.

Mitzel

(2005, April). The mapmark standard setting method [Paper presentation]. Annual meeting of the National Council on Measurement and Education, Montreal, Canada.

48.

Schultz

E. M.

Mitzel

H. C.

(2011). A mapmark method of standard setting as implemented for the National Assessment Governing Board. Journal of Applied Measurement, 12(2), 165-193.

49.

Sheng

(2008). Markov chain Monte Carlo estimation of normal ogive IRT models in MATLAB. Journal of Statistical Software, 25(8), 1-15. https://doi.org/10.18637/jss.v025.i08

50.

Skaggs

Tessema

(2001, April). Item disordinality with the bookmark standard setting procedure [Paper presentation]. Annual meeting of the National Council for Measurement in Education annual meeting, Seattle.

51.

Skorupski

W. P.

Hambleton

R. K.

(2005). What are panelists thinking when they participate in standard-setting studies? Applied Measurement in Education, 18(3), 233-256. https://doi.org/10.1207/s15324818ame1803_3

52.

van der Linden

W. J

. (1982). A latent trait method for determining intrajudge inconsistency in the Angoff and Nedelsky techniques of standard-setting. Journal of Educational Measurement, 19, 295-308. https://doi.org/10.1111/j.1745-3984.1982.tb00135.x

53.

Wang

(2003). Use of the Rasch IRT model in standard setting: An item mapping method. Journal of Educational Measurement, 40, 231-252. https://doi.org/10.1111/j.1745-3984.2003.tb01106.x

54.

Williams

N. J.

Schulz

E. M.

(2005, April). An investigation of response probability (RP) values used in standard setting[Paper presentation]. Annual meeting of the National Council on Measurement in Education, Montreal, Quebec, Canada.

55.

Wyse

A. E.

(2011). The similarity of bookmark cut scores with different response probability values. Educational and Psychological Measurement, 71(6), 963-985. https://doi.org/10.1177/0013164410395577

56.

Wyse

A. E.

Bunch

Deville

Viger

S. G.

(2014). Body of work standard-setting method with construct maps. Educational and Psychological Measurement, 74(2), 236-262. https://doi.org/10.1177/0013164413502037

57.

Zieky

M. J.

(2001). So much has changed: How the setting of cutscores has evolved since the 1980s. In Cizek

G. J.

(Ed.), Setting performance standards: Concepts, methods, and perspectives (pp. 19-51). Lawrence Erlbaum.

58.

Zwick

Senturk

Wang

Loomis

S. C.

(2001). An investigation of alternative methods for item mapping in the National Assessment of Educational Progress. Educational Measurement: Issues and Practice, 20(2), 15-25. https://doi.org/10.1111/j.1745-3992.2001.tb00059.x