Abstract
Generalizability theory (G theory) allows researchers to assess the many sources of variance inherent in complex standard setting procedures involving the determination of cut scores. The flexibility of G and D studies provides a way to conceptualize and quantify the results of different standard settings once the universe of admissible observations and the universe of generalization are defined. The current article applies a multivariate single-facet design for estimating standard errors of cut scores. For practical purposes, several multivariate D study designs are used to investigate what effect various panel sizes and test lengths have on the precision of the standard setting process. The current study demonstrates the advantages and usefulness of multivariate G theory in determining the accuracy of cut scores in practical applications of standard setting procedures.
Keywords
Introduction
Setting performance standards in elementary and secondary education has been an integral part of the development of statewide assessments mandated by the No Child Left Behind Act (NCLB; 2002). In addition, setting cut scores is essential in the development of licensure and certificate examinations. Earlier literature viewed standard setting as an outcome variable of the test development process, but a more recent, contemporary perspective (e.g., Cizek, 2012a, 2012b; Cizek & Bunch, 2007) advocates treating the standard setting method as a design parameter. The goals of the standard setting process, the standard setting method chosen, as well as the characteristics of the standard setting participants should be tightly aligned with the purpose and design of the test.
Despite the political, social, and economic complexity of any specific application of the standard setting process, it is important to gather evidence for evaluating the precision of the results of the standard setting procedure. To evaluate standard setting results (i.e., cut scores), it is critical to report both validity and reliability evidence for assuring the adequacy and appropriateness of the performance standards. Hambleton and Pitoniak (2006) indicated that the internal constituent of the standard setting evaluation elements emphasizes the consistency within method by inspecting the precision of the estimate of the cut scores. As will be demonstrated, multivariate generalizability theory (G theory) can provide such quantitative evidence.
Early literature by Brennan (1995), Brennan and Lockwood (1980), and Kane and Wilson (1984) clarified the application of G theory to standard setting. As outlined by Brennan (1995), the key question in evaluating standard setting results is “how variable would the cut scores be if the process were replicated?” (p. 271). To answer this question, later researchers focused on the standard errors (SEs) of cut scores from different standard setting procedures. For example, in some medical examinations, researchers applied G and D study designs, which considered rater and item facets in the Angoff-based procedures (e.g., Verhoeven et al., 1999; Verhoeven, Verwijnen, Muijtjens, Scherpbier, & van der Vleuten, 2002) and in the borderline regression method (e.g., Kramer et al., 2003). Others discussed more complex designs that went beyond rater and item facets (e.g., Chang, 1999; Chang & Hocevar, 2000).
Because the Angoff-based methods are classic standard setting procedures and are often implemented for large-scale statewide assessments as well as licensure and certificate examinations, researchers have focused on using G theory for estimating the SEs of cut scores (Arce-Ferrer & Yin, 2007; Yin & Sconing, 2006, 2008), the stability of cut scores across rounds (Clauser et al., 2009), or both (Tzou, Wu, & Lin, 2008; Wu & Tzou, 2010). The popularity of item response theory (IRT) has led to the development of other standard setting methods, such as the bookmark method. Lee and Lewis (2001, 2008) have investigated the application of G theory for estimating the SEs of cut scores in the bookmark standard setting.
Although past studies used G theory in the standard setting analyses, none have considered multivariate G theory. Statewide or nationwide assessments contain different content categories, usually with a different set of items within each content category. In licensure and certificate examinations, tasks diverse and the number of tasks associated may vary by skills. Taking into account these multivariate aspects of such assessments leads naturally to analysis by multivariate G theory. For each object of measurement, there are multiple universe scores, each of which is associated with a level of one or more fixed contents, and the items nested with each fixed levels are randomly parallel. That is, the design is mixed with the random item factor nested within the fixed content category factor. Using multivariate G theory can appropriately identify all sources of error and accurately evaluate their contributions to the SEs of cut scores. In addition, it is possible that each category or skill has unequal number of items or tasks, which creates an unbalanced design and increases the complexity of the analyses. Multivariate G study designs deal with both balanced and unbalanced situations and they are much simpler than univariate G study designs for unbalanced cases (e.g., Brennan, 2001a), and the computer program mGENOVA (Brennan, 2001b) can be used to analyze a variety of multivariate G study designs.
Motivated by the above considerations, the purpose of the current study was to demonstrate a method of quantifying SEs of cut scores when standard setting data involve fixed and random effects. A multivariate G theory approach was applied to estimate the SEs of the cut scores resulted from the modified Angoff standard setting method. In the authors’ definition, the universe of admissible observations (UAOs) and the universe of generalization (UG) had the same structure. A single-facet multivariate G study design was demonstrated to estimate variance–covariance components after which several D studies with various numbers of panelists and items were analyzed to understand how varied the cut scores could be if the modified Angoff procedures were replicated under different but randomly parallel conditions. The SEs of the cut scores were then considered. The formulas for the SEs were derived and their applicability was demonstrated to the modified Angoff standard setting results.
Angoff-based methods require the panelists to provide item-level performance estimates for borderline groups, and this presents the panelists with quite a lot of work. To reduce this workload, an item-grouping approach was incorporated by grouping items in cells of similar difficulty and examined its impact on the accuracy of the cut scores. The D studies were also used to investigate whether a reduction in the number of items rated by each panelist would affect measurement precision.
In an effort to investigate the accuracy and stability of the cut scores as well as to illustrate the practicability of applying a multivariate G theory approach to standard setting, the current study used empirical standard setting data to answer the following questions:
It is expected that the application of multivariate G theory will provide practical information for interpreting results and for planning future standard settings.
Method
Standard Setting Process
Materials
The data were from a standard setting study for determining the cut scores for a national fourth-grade mathematics achievement test. Students’ performances were classified into four levels (i.e., below basic, basic, proficient, and advanced) by three cut scores. The test consisted of 104 operational multiple-choice items within five content categories: algebra, statistics, geometry, measurement, and number sense and computation. As in Table 1, the average item difficulties (i.e., conventional p values) of the five categories are .452, .600, .468, .536, and .515 based on 11, 10, 10, 31, and 42 items, respectively. The overall test difficulty is .518 based on a sample size of 8,120 students. When the math test was scored by a three-parameter logistic IRT (3PL IRT) model, three items had unusual characteristics (e.g., extreme b parameter estimates) and then were excluded from both the item pool and the standard setting. Thus, the numbers of items within the five content categories used for setting the cut scores were 11, 10, 10, 30, and 40, respectively.
Data Structures for Implementing the Modified Angoff Standard Setting Procedures.
The number of items scored is also the operational number of item.
These conventional p values are calculated from the number of items scored by content.
These conventional p values are calculated from the number of items rated by content.
Panelists
Twelve panelists participated in the standard setting. One was a new assistant professor in mathematics education and the others were elementary school math teachers. Their teaching experience in math was 8 years on average. The assistant professor who had been an elementary school teacher for more than 10 years did not behave differently from the other panelists. In addition, most had prior experience in the modified Angoff procedures.
Procedure
The agenda of the modified Angoff standard setting procedures was guided by Angoff (1971), the National Assessment of Educational Progress (Allen, Jenkins, Kulick, & Zelenak, 1997), Reckase (1998, 2000), Hambleton (2001), Hambleton and Pitoniak (2006), and Cizek and Bunch (2007). The panel participated in the standard setting which consists of three rounds in three consecutive days. All items were rated by each panelist; that was, each panelist provided item-level performance estimates for the borderline groups of basic, proficient, and advanced levels. To investigate item-grouping effect (i.e., prior to being rated, items had been grouped with respect to their item difficulties), six panelists were randomly selected and then exposed to the effect. These six panelists who were unaware of the effect formed a group called the “item-preordered group” (IPG) whereas the other six panelists who were not introduced the effect constituted the “non-item-preordered group” (NIPG). Except for the item grouping, the two groups followed exactly the same standard setting procedures.
Figure 1 shows the modified Angoff standard setting procedures integrated with regular “normative and reality feedback” (Cizek & Bunch, 2007, pp. 54-56) as well as other strategies for improving interpanelist and intrapanelist consistency. To improve the interpanelist consistency, the normative feedback was done by providing panelists the information about one’s ratings in comparison with other panelists’ ratings. The means and standard deviations of the panelists’ ratings on the score scale (i.e., the estimated minimal passing scores) and a correlation matrix that contained the pairwise correlations among panelists’ ratings were provided. The reality feedback offered the empirical item difficulty (i.e., conventional p value) such that panelists could figure how their judgments on the items were compared with actual examinee performances. To improve the intrapanelist consistency at the final round, every panelist was given a customized chart where his or her panelist’s ratings at the second round were listed in the order of empirical item difficulties. Panelists then reviewed the items with similar p values and checked their ratings to detect discrepancy, and finally made adjustments if necessary. In short, different types of feedback were intended to decrease the inconsistency within the panel and guarantee the validity of the final cut scores (Hambleton & Pitoniak, 2006).

The multistage standard setting procedure.
A Multivariate G Theory Approach
A multivariate G theory approach was applied to analyze panelists’ item ratings from the standard setting. Under the multivariate G theory framework, the SE of the final average rating (
Facet
Cut scores are determined after the final standard setting round. Thus, factors involved in the standard setting procedure are possible facets in a G theory model. First, the item facet contributes to error; it is a random facet because a different set of items can be involved in each replication. Second, panelist facet is essential in assessing the adequacy of certain standard setting procedures like the Angoff-based method (Brennan, 1995). It contributes to the SEs of the cut scores; the panelist effect is also random. Third, the math test is composed of five content categories, and thus, the content (category) facet that is fixed is considered. The content category facet is fixed because every replication of the standard setting procedure involves the same categories. Finally, it is noted that round is not regarded as a facet. In standard setting, rounds are not randomly parallel, especially when the standard setting incorporates different feedback procedures. In short, the multiple standard setting rounds are fixed procedures and they are not interchangeable. As Brennan (1995) emphasized, the entire standard setting procedure should be viewed as one replication as the final cut scores depend on the final round only. The variability over items and the panelists are of the most interest.
Multivariate G study design
To clearly identify SEs of measurement, the specification of the UAO is now defined. As stated, the content category facet is fixed. The items are random in the sense that each replication of the standard setting procedure could involve a different set of items within each level of the content category facet. Every item is rated by each panelist. In a univariate G study sense, this is a p× (i: h) design, where p, i, and h represent panelist, item, and content facets, respectively. Note that the unequal numbers of items within each content category make the standard setting data unbalanced. The fixed content facet and the random-effects variance components design associated with each fixed content level yields a multivariate G study design,
The standard setting data have five content categories (i.e.,
In Equation 1,
Multivariate D study designs
To understand the impact of panel size and test length on the SEs of the cut scores, several multivariate D studies were conducted, considering different numbers of panelists and items by varying
Numbers of Items by Content for the D Studies.
The SE of the cut score under the
design
The SE of the cut score under the
The SE(
First, let
In Equation 2, the sum of the two pieces in the square bracket represents the weighted sum of all elements in the
Results and Discussion
G Study Results—Quantifying the Variability of the Cut Scores
The panel was divided into two groups, IPG and NIPG, for investigating the item-grouping effect. The performance standards provided by one group could not be generalized to those by the other because IPG and NIPG was not randomly parallel to one another. As this group effect was fixed, two multivariate single-facet G study designs have been applied to IPG and NIPG, respectively. 1 The item-grouping effect could be examined by comparing the G study results from IPG to those from NIPG.
Table 3 presents the estimated variance–covariance matrices of panelist (
The Estimated Variance–Covariance Component Matrices for Basic (B), Proficient (P), and Advanced (A) Cut Scores From the Multivariate G Study
Note. Lower diagonal elements in the panelist matrices are covariances. IPG = item-preordered group; NIPG = non-item-preordered group.
Similar findings were found for NIPG. Comparing the estimated matrices between IPG and NIPG, the differences suggested that the item-grouping effect do exists. For item effect, the variance estimates from IPG were uniformly smaller than the variance estimates from NIPG irrespective of content areas. In other words, the item variability from IPG was less than the item variability from NIPG. This was found for each cut score, implying that the item-grouping effect was functioning when the panelists of IPG provided their item ratings at each achievement level. The item–panelist interactions were quite similar in both groups, except for the proficient level. As to the panelist matrices, the variance–covariance estimates from IPG were larger than the variance–covariance estimates from NIPG for proficient and advanced cut scores. The panelist effect seemed to be stronger in IPG, which implied that there was a lack of consensus among the IPG panelists with respect to their opinion about the borderline groups at the proficient and advanced levels. On the contrary, the NIPG panelists have reached a greater agreement in their opinion on the two borderline groups. The panelist effect shown for the basic cut score told a very different story. The reason why the variance–covariance estimates of the panelist matrix from IPG were uniformly smaller than those from NIPG at the basic level needs a further investigation.
In sum, the variability among items dominated the variability of the cut scores because the estimated variances of the item facet were larger than the variances of the panelist facet and the interactions. Also, the effect of grouping items in cells of similar difficulty was reflected by less variability among item ratings from IPG than NIPG.
Change of the SEs Across Different Standard Setting Rounds
As defined previously, the UAO and the UG had the same structure. Thus, the SEs were estimated from the same number of panelists and items in the G study designs (i.e.,

The SEs of basic, proficient, and advanced cut scores (on the metric of item proportion correct) at three standard setting rounds under the multivariate
First, the SE of the basic cut score from IPG in Round 1 was .0401—much higher than the rest SEs. In the standard setting, panelists were asked to provide the item-by-item Angoff estimates for the borderline examinees of the basic level first, followed by the proficient and advanced levels. For IPG, setting the Round 1 cut score for the basic level was the first time they examined all the items. Rating so many items might cause the panelists mentally burdened, which introduced instability to the panel. In the panelists’ discussion followed by Round 1, some panelists reported heavy loads in the beginning and found themselves accommodated, especially when they provided the Angoff estimates for the proficient and advanced borderline groups. Although not reported, the
The SEs of the cut scores from Rounds 1 and 2 fluctuated at different achievement levels but finally, for most cases, they minimized in Round 3. The exceptions included the advanced cut score from IPG, where a suspiciously low SE = .0130 was obtained in Round 1, and the basic cut score from NIPG, where a slightly low SE = .0169 was obtained in Round 2. In general, the panelists’ discussion and the normative and reality feedbacks that were used to improve the panel consistency worked as expected. In Round 3, smaller SEs were reported for the basic and advanced cut scores (i.e., .0132 and .0156) than the proficient cut score (i.e., .0179) from IPG and roughly the same size of SEs were reported from NIPG (i.e., .0172, .0170, and .0168 for basic, proficient, and advanced levels). In sum, provided the higher SE, it seemed that the proficient cut score was not as consistent as the basic and advanced cut scores from IPG. However, the three cut scores from NIPG reached almost the same precision.
D Study Results—Decisions on Panel Size and Test Length
In the D studies, the SEs of basic, proficient, and advanced cut scores were computed by varying D study sample sizes and using the estimated variance and covariance component matrices from the G study. Only the variance and covariance component matrices from Round 3 were used for reporting the D study results.
As expected, the SE became smaller as the panel size enlarged. For IPG, the SE lines became flat after
The D study designs also took test length into consideration. To maintain the relative importance of the five content domains, the D study test lengths varied proportionally. For example, the “half” test length had a total of 50 items, where
As increasing panel size reduced the SEs, varying test lengths had an impact on the SEs as well. As in Figures 3 and 4, longer tests led to smaller SEs than shorter tests as expected. Regardless of panel size, under half of the original test length, the SEs were increased by .005 at different achievement levels. However, doubling the test length reduced the SEs to a maximum of .005, especially when the panel size also increased. Although increasing the test length resulted in much more improvement in the measurement precision than enlarging the panel size did, doubling the test length might not be realistic in practice. In addition, the fact that both 80% and 90% SE lines were not deviated much from the SE line of the original test length indicated that allowing panelists to rate fewer items in standard setting can be a possibility (assuming content representativeness of items is not an issue). Similar results were found for NIPG, except that the SE lines became level for all cut scores after

The SEs of basic, proficient, and advanced cut scores for different D study sample sizes for IPG.

The SEs of basic, proficient, and advanced cut scores for different D study sample sizes for NIPG.
The Desirable Number of Panelists Under Various Test Length Proportions for IPG and NIPG.
Note. Table values are the number of desired panelists. IPG = item-preordered group; NIPG = non-item-preordered group. B = basic; P = proficient; A = advanced.
To reduce SEs from .020 to .015 for IPG, one might need to, at least, double the number of panelists. The only exception was for the basic cut score with length proportions of 1.4 and more. Not surprisingly, when the test length proportion was small, a bigger panel was required to result in more consistent cut scores (i.e., SE < .015). At the proficient level, in particular, a large panel containing more than 30 panelists was desired if SE < .015 and the full test (or fewer items) was rated. But the panel size required for an acceptable SE at the basic level was small, due to the fact that the resulting estimated SE of the basic cut score from the G study analysis was quite small (i.e., .0132). In practice, however, a standard setting panel containing two experts is unrealistic because a representative sample of panelists usually enlarges the panel size.
For NIPG, reducing SEs from .020 to .015 required twice, thrice, or more of the panel size especially for small test length proportions. For the full test or tests with fewer items, a large panel containing more than 30 panelists was needed to achieve SE < .015 at all levels. Furthermore, by inspecting the differences between the results of IPG and NIPG, it was found that when SE < .020, the desirable numbers of panelists for IPG and NIPG were usually 10 or less; in some cases, NIPG required more panelists under selected test length proportions for a certain levels (e.g., basic level under 80% of the test length) but sometimes it was IPG that required more (e.g., proficient level under 80% of the test length). When SE < .015, NIPG tended to need more panelists than IPG except when long tests (e.g., 160% or more) were used at the proficient level. In conclusion, building a larger standard setting panel for NIPG than for IPG was necessary no matter how many items would be rated.
Conclusions and Limitations
The complexity of an empirical standard setting study sometimes makes it hard to quantify the results. The nature of G theory makes itself a powerful tool in terms of detecting different sources of errors though it matters how researchers conceptualize the UAOs (e.g., items). The flexibility of G studies and D studies allows researchers to design their future standard setting agenda.
In this study, a multivariate G study
The D study results assisted in practical decisions, if the same UG were applicable. The current study found that the desirable number of panelists had better no less than 10. Brandon (2004) also indicated that the desirable panel size of the modified Angoff method was at least 10. Based on the findings, it was also concluded that recruiting more panelists may not result in significant improvement in measurement precision (i.e., SE < .015, a decrease of .005 only). Finally, if one were to replicate the modified Angoff standard setting procedures, 80% or 90% of the test length could be considered, because the SE lines of the original test were not discrepant much from the SE lines of the 80% or 90% of the test length. With such an adjustment, the complexity of cognitive burden for panelists could be reduced. In addition, item grouping could lend a hand to improve the precision of the cut scores based on the findings that the SEs from the item-preordered group were smaller than the SEs from the non-item-preordered group. Other things being equal, the desirable panel sizes were smaller for the item-preordered group than the non-item-preordered group under most test length conditions and achievement levels. Nevertheless, for practical considerations, the acceptable magnitude of SEs still needs some deliberation, because they should depend on the real context as well as the G or D study designs. To evaluate, it is also of interest to study the SEs or confidence intervals for the estimated variance components in the future.
In the study, a multivariate G study crossed design was fully discussed and demonstrated. It is also possible to apply a nested design, such as
Under a multivariate G theory framework, the current study focuses on the modified Angoff method in standard setting. The Angoff family of procedures is one of the most popular and thoroughly researched of all currently used standard setting methods and will remain its widespread use in the future (Cizek & Bunch, 2007; Plake & Cizek, 2012). However, no method is free of criticisms and/or problems. Recently, the issues with the Angoff standard setting method with multiple-choice items have been well documented in Plake and Cizek (2012) to which readers of interest could refer. Major criticisms include the inability of the Angoff standard setters to form and maintain the integrated conceptualizations of proficiency required for implementing item-based procedures, and the incapability to making accurate item performance estimates (i.e., item ratings), especially for difficulty or easy items. Notwithstanding, the Angoff family is still attractive in practice due to its long history, the support from a massive body of literature, and some favoring features as the leading psychometricians with expertise in standard setting defended (see, for example, Hambleton et al., 2000). Not surprisingly, the Angoff family will continue its popularity in the future.
In conclusion, the utility of generalizability theory allows researchers to access the many sources of errors in standard setting. By the D study designs, the current study shows the impact of panel size and test length on measurement precision. For those who might replicate the modified Angoff procedures in a similar way, the study provides useful information for making decisions on desirable numbers of panelists and/or items such that the acceptable precision of cut scores is achieved. For those who might apply the other item- or test-centered standard setting methods, the study demonstrates a way of applying multivariate G theory designs when fixed effects exist. By inspecting the SEs of cut scores, the designers of standard setting agenda will be able to know whether the method delivered works properly. Finally, although the panelist facet is random, a convenience sample has been used in collecting the standard setting data. Therefore, making clear arguments about generalization to a well-specified panelist population is relatively difficult, and one should be careful and cautious about that.
Footnotes
Appendix
Acknowledgements
The authors express their gratitude to the editor, Dr. Hua-Hua Chang, and three anonymous reviewers for their comments, which greatly improved the clarity and focus of the manuscript. They thank Dr. David Woodruff and Dr. Kate Hancock for editing the manuscript. They also thank Ms. Isabel Zheng and Mr. Edison Choe for their help in the process of publication. The first author is grateful to Dr. Stephen Dunbar, the Director of the Iowa Testing Programs, for his encouragement for this paper.
Authors’ Note
An earlier version of this article was presented at the 2010 Annual Meeting of the American Educational Research Association.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The conduction of the standard setting was part of the Taiwan Assessment of Student Achievement in Mathematics, funded by the National Academy for Educational Research, Taiwan.
