Abstract
When implementing standard setting procedures, there are two major concerns: variance between panelists and efficiency in conducting multiple rounds of judgments. With regard to the former, there is concern over the consistency of the cutoff scores made by different panelists. If the cut scores show an inordinately wide range then further rounds of group discussion are required to reach consensus, which in turn leads to the latter concern. The Yes/No Angoff procedure is typically implemented across several rounds. Panelists revise their original decisions for each item based on discussion with co-panelists between each round. The purpose of this paper is to demonstrate a framework for evaluating the judgments in the standard setting process. The Multifaceted Rasch model was applied as a tool to evaluate the quality of standard setting in a context of language assessment. The results indicate that the Multifaceted Rasch model offers a promising approach to examination of the variability in the standard setting procedures. In addition, this model can identify aberrant decision making for each panelist, which can be used as feedback for both standard setting designers and panelists.
The purpose of standard setting is to establish cut points which correspond to an institution’s desired performance standards and which allow classification of individuals into different performance levels based on the test results (Hambleton & Pitoniak, 2006). As such, the outcome of the standard setting will affect the lives of individuals to certain degrees, depending on the purpose for testing. In some assessment situations, especially in high-stakes assessments, an inappropriate cut point derived from the standard setting process may have serious consequences. For example, an unreasonably high cut point for a college entrance exam may deprive qualified students of the right to admission to college. However, an unduly low cut point may allow a large number of unqualified students entry into the school. The consequences of standard setting impact all stakeholders (teachers, administrators, and students); therefore, it deserves greater attention and effort (Pellegrino, Jones, & Mitchell, 1999; Shepard, Glaser, Linn, & Bohrnstedt, 1993).
The process of setting standards for student achievement can be both challenging and time-consuming, particularly if the test is required to identify multiple levels of ability (Cizek & Bunch, 2007). In the case of teacher certification exams, one may want to establish a minimum pass score to identify qualified teachers; whereas achievement tests would require multiple standards to allocate students to categories such as basic, proficient or advanced. These standards must be agreed upon and clearly delineated in performance-level descriptors (PLDs), full-sentence text descriptions which panelists use to set cut scores. Basically, standard setting involves judgments about the ideal performance standard and the test score that reflects this standard. It seems quite simple and straightforward; yet standard setting typically causes some controversies, uncertainties, and ambiguities since the decisions are laden with subjective judgments based on the values and beliefs of a select group of people (Glass, 1978; Popham, 1978).
The selection of panelists for the standard setting introduces another facet where reliability and validity of the cut scores may be compromised. Panelists are selected based on their expertise in the academic subject matter being tested or for their experience in education in general. They may be students’ parents, school administrators, university faculty, or teachers, but regardless of professional background, the panelists should include representatives of all groups that have a legitimate stake in the outcome of an assessment and the decisions that derive from its use (Cizek & Bunch, 2007). For example, the panels that are used to set performance standards for National Assessment of Educational Progress (NAEP) are composed of 70% classroom teachers and educators, and 30% non-educators who are selected from such populations as the business community, the military, and parents’ groups. With the various backgrounds represented by the panelist groups, it becomes necessary to ascertain the degree of consensus among the panelists with regard to cutoff scores and the standards such scores are intended to operationalize.
In the standard setting meeting, the panelists have to compromise their own decisions to varying degrees or alternatively, persuade others to subscribe to their own point of view via exchange of judgments over several rounds until consensus is achieved (Kane, 1994). The entire standard setting process is time-consuming on the order of several days or weeks. Often institutions operate under restrictive budgets or narrow time frames, so it is imperative to provide efficient feedback to the panelists to facilitate their work of examining and revising the judgments (Reckase, 2001), with such feedback possibly consisting of a summary of their own judgments, a summary of their internal consistency and indication of how their judgments compare to the judgments of other panelists, an indication of variability in participants’ ratings, and the likely impact of judgments on the examinee population.
Based on Cizek & Bunch (2007), this feedback can be categorized into three kinds. The first kind of feedback is normative data which involves providing the overall cutoff scores for all panelists, extreme cutoff scores, standard deviation, mean and/or median of the cutoff scores. The overall cutoff scores could be the mean, median, the trimmed mean, or the trimmed median (the latter two being calculated after deletion of the outliers). The second kind of feedback is the impact data, which could show the raw score distributions of the examinees or the percentages of the test takers in each category. The third kind of feedback is reality information which involves providing the item difficulty or provisional standard to panelists. The reality information is usually provided in practice. As described in the Common European Framework of Reference manual (CEFR, Council of Europe, 2009), after the first round of standard setting, the cut score can be calculated. The examinees with a score in the vicinity of this cut score in the real test data can be regarded as borderline students. The proportion of correct answers to each item for these borderline students is calculated and this reality information can be provided as feedback to the panelists before proceeding to the next standard setting round. Moreover, the Reckase chart, which lists the probability of a correct response at each cut score, in the case of multiple choice items, and the estimated raw score, in the case of open-ended questions, is a typical example of the third kind of feedback.
Because all standard setting methods are based on subjective judgments which could be suspect with respect to reliability and validity, it is important to systematically evaluate the quality of judgments obtained from panelists. Hambleton and Pitoniak (2006) proposed a systematic, three-pronged framework for evaluating standard setting methods utilizing procedural, internal, and external criteria. Procedural criteria, focusing on implementation issues and documentation, are typically found in qualitative case studies such as Papageorgiou (2010a). External criteria are used to corroborate validity of cut-scores and standards via comparison with other standard setting methods and/or placement success rates. Examples of this type of study include Buckendahl, Smith, Impara, and Plake (2002) which compare the cut scores derived from Angoff and Bookmark method for a Grade 7 Mathematics Assessment in a midwestern school district. Finally, internal criteria address the decision consistency by examining the reliabilities, both intra-participant and inter-participant. The complexity of examining this kind of criteria in the standard setting has been raised. For example, Engelhard (2009) examined the internal criteria within the framework of the Rasch measurement theory; five categories were evaluated, including rater’s severity or leniency, halo effect, central tendency restriction of range, and interrater reliability or agreement. In addition, during the Michigan English Test standard setting process, both intrajudge and interjudge consistency, such as standard error judgment, agreement coefficient, and Kappa indices to ensure the internal validation were examined (Papageorgiou, 2010b).
The present study also focused on evaluating the quality of judgments from the perspective of internal criteria. However, a new approach based on the Multifaceted Rasch model (MFRM; Linacre, 1994) was used for identification of both the variability between panelists and the internal conflicts between the judgments of each panelist. This article first reviews literature on standard setting in language assessment, Yes/No Angoff method and theoretical background of the Multifaceted Rasch model. The review is then followed by the research purpose, the implementation of this study, data analysis, and discussion of the results.
Literature review
Standard setting in language assessment
Many methods are available for setting performance standards. Regardless of the methods, performance standards must be defensible and valid; therefore, the process used in the standard setting needs to be reasonable, systematic, and thoughtful (Hambleton & Pitoniak, 2006). The standard setting process involves several steps, including selection of a standard setting method, choosing a panel and design, preparing descriptions of performance categories (the PLDs), training panelists to use the method, collecting item ratings, providing feedback and facilitating discussions, compiling panelist ratings and obtaining performance standards, conducting panelist evaluation and compiling validity evidence, and preparing technical documentation. Some guidelines of standard setting for these steps are provided in the Standards for Educational and Psychological Testing (AERA, APA, & NCME, 1999).
The Common European Framework of Reference (CEFR), a reference publication for learning, teaching and assessment, also has played a crucial role in raising awareness of standard setting in language testing (Council of Europe, 2001). The purpose of the CEFR is to provide a descriptive system of language activities to illustrate the levels of proficiency which can be used by educators, learners and test designers. By doing so, the CEFR provides proficiency level descriptions at each stage of language learning, which can be used for standard setting. Several scales are used in the CEFR to describe the language levels. The most general one is the global scale of common references. This global scale can be used to differentiate basic users (Level A), independent users (Level B), and proficient users (Level C). Each of these levels is further divided into two levels: A1, A2, B1, B2, and C1, C2. To help test developers link their assessments to the CEFR, the Council of Europe has developed a manual as well as reference supplements which include suggested standard setting procedures (Council of Europe, 2009; Figueras, North, Takala, Verhelst, & Van Avermaet, 2005; Takala, 2004).
There have been several studies in the context of relating language test scores to the CEFR. For example, Bechger, Kuijper, and Maris (2009) report on two studies that link the state examination of Dutch to the CEFR for languages. In the first study, key persons from institutions for higher education were asked to decide the minimally required language level of beginning students. In the second study, the contrast group method was used to determine whether students who passed the state test had the required language proficiency. The raters were asked to infer whether examinees would be able to perform certain language acts in real life. However, the results show that the raters found this task difficult and agreement between the panelists was low.
Additional issues plaguing standard setting discussions were recently described by Papageorgiou (2010a) which analyzed participants’ group discussions for a CEFR standard setting research project. Twelve native speakers of English with degrees in TESOL or Applied Linguistics were employed as panelists in the project. Qualitative data were collected by recording the group discussions for the entire standard setting process. Papageorgiou’s findings indicated that decision making might be impacted by factors that are irrelevant to judgment work and that setting cut scores was not without problems for the panelists. Even though qualitative research from an operational meeting can be quite informative, it would be worthwhile to further understanding of the panelists’ judgments from the perspective of quantitative analysis.
The aforementioned CEFR studies raise awareness of the importance of setting defensible cuts cores in the field of language assessment. Even beyond the CEFR, the implications inherent in establishing and implementing cut scores for high-stakes assessments have motivated numerous quantitative studies to focus on the application of standard setting methods (Bechger, Kuijper, & Maris, 2009; Harsch & Rupp, 2011; Kaftandjieva, 2004, 2010; Martyniuk, 2010; O’Neill, Buckendahl, Plake, & Taylor, 2007).
Yes/No Angoff standard setting method
The Angoff method, first introduced in the ‘Scales, Norms, and Equivalent Scores’ chapter of the second edition of Educational Measurement (Angoff, 1971), is currently one of the most widely used standard setting methods, as it is quite suitable for tests composed of multiple-choice items, a popular test format for certification exams or tests related to medical professions. In this method, panelists review multiple-choice items and provide an estimate of the probability of minimally competent examinees answering an item correctly. The average of these probabilities would then serve as the minimal cutoff score. However, a major criticism of this method is that judges’ estimation of item difficulties for the prescribed borderline students would likely be inaccurate and inconsistent. Based on Bejar (1983), estimating item difficulties is not easy for judges, despite extended training periods. Bejar found that judges were merely able to rank order the items by difficulty instead of estimating the actual levels of examinee performance. Impara and Plake (1998) further examined the extent that teachers could estimate students’ performance on a district-wide science test. The study shows that teachers tend to underestimate how the low-performing students would perform on the test. In addition, the teachers’ estimates of the average proportion correct for the borderline students were quite inaccurate.
Impara and Plake (1997) simplified Angoff’s original approach by asking panelists to decide if a borderline examinee could answer each item on a test correctly. Panelists rated as ‘Yes’ if the borderline student could answer the item correctly, or ‘No’ if otherwise; this approach was named the Yes/No Angoff method.
The implementation of the Yes/No Angoff method is quite similar to other standard setting methods. After training and discussion of the characteristics of the borderline groups for each performance level based on descriptions of proficiency in the PLDs, the panelists rate each item in the test booklet to complete the first round judgment. The panelists are provided with feedback on their first round ratings, usually the normative information or scatter plots of cut scores by panelists. Based on the feedback or group discussion, panelists generate the second round of yes/no judgments for each item. At the end of the second round, panelists receive additional feedback, such as item difficulty, percentages of examinees predicted to pass/fail based on their judgments, to further revise their decisions. The same process is repeated across several rounds as necessary, but the calculation of the cutoff score is based on the judgments made in the final round.
The Yes/No Angoff method could be used to set more than one cut score. For example, in the case where four performance categories (e.g. three cutoff scores) are required, the Yes/No Angoff method could require the panelists to generate three sets of ratings per round, that is, the boundaries between limited/basic, basic/proficient, and proficient/advanced. In such a situation, training would be implemented to help panelists form three key conceptualizations involving borderline performance for the respective levels. Panelists would then generate three sets of ratings simultaneously based on these conceptualizations. An alternative technique is to rate all items with one boundary (i.e. one hypothetical examinee) first and then rate all items again with the second and third boundary respectively.
Although the Yes/No Angoff method was developed with a view towards enhancement of the original Angoff method, it still possesses an inherent drawback. The report of NAEP indicated that evaluating the difficulty of test items for borderline students is too difficult for panelists to accomplish in an accurate manner (Pellegrino et al., 1999; Shepard et al., 1993). However, this criticism has not received much empirical attention for the reason that these observations are incomplete evaluations based largely on dated and secondhand evidence (Hambleton et al., 2000). Overall, as indicated in Cizek & Bunch (2007), Angoff methods will continue to see widespread use in the near future.
Multifaceted Rasch model
The Multifaceted Rasch model (MFRM), extended from earlier Rasch models, is implemented to address the measurement needs of testing contexts where judges are involved (Andrich, 1978a, 1978b; Linacre, 1994; Masters, 1982; Rasch, 1980). MFRM can be regarded as one of the various Item Response Theory (IRT) models, which is useful for the resolution of issues that typically occur in Classical Test Theory (CTT), such as sample dependencies of the items and test indices (P values, discrimination and reliability indices), where each examinee’s ability depends on particular items used in the test. MFRM has three distinct advantages which can overcome these issues. First, in the context of standard setting, the severity of the expert judge could threaten the appropriateness of the cutoff scores. Typically, the accounting for parameters such as panelist severity was attempted through techniques such as averaging across multiple panelists’ ratings and iteration. Through the use of MFRM, the severity can be modeled within the equation instead of making adjustments afterwards (Stone, Beltyukova, & Fox, 2008). Second, Rasch fit statistics can be used to determine the aberrant response patterns, that is, the extent to which items, tasks, or panelists are inconsistent with the item response model. In the context of standard setting, the fit statistics can serve as feedback for panelists to adjust their decisions. Third, the Rasch model places each facet of the measurement context on a common underlying linear scale that is relevant to the difficulty of the items as estimated by the judges (Linacre, 1999). The scores are reported in units named logits, which can be used to make statistical analyses and comparisons that are useful for examination of the educational gains, for display of the strengths and weaknesses, and for comparison of groups.
The MFRM approach has been used in many applications in the fields of language testing, psychological measurement, and health science (Bond & Fox, 2007; Engelhard, 2002; Harasym, Woloschuk, & Cunning, 2008; McNamara, 1996; Wolfe & Dobria, 2008). In particular, this approach has formed the cornerstone of the descriptor scales advanced by the CEFR (North, 2000, 2008; North & Jones, 2009; North & Schneider, 1998). Section H of the reference supplement to the Manual for relating language examinations to the CEFR (Council of Europe, 2009) provides clear illustrations of how to use MFRM to measure the severity (or leniency) of raters, assess the degree of rater consistency, correct examinee scores for rater severity difference, examine the functioning of the rating scale, and detect the interactions between facets from writing assessment data (Eckes, 2009).
Several studies have employed MFRM to evaluate panelists’ standard setting judgments (Kecker & Eckes, 2010; O’Sullivan, 2010). For example, Kecker and Eckes (2010) examined the relation between the Test of German as a Foreign Language (TestDaF) and the CEFR levels. The Basket procedure, which derived from the Angoff methods and the benchmarking method, was employed. Based on the final judgments of items provided in the standard setting session, FACETS analysis was conducted to eliminate the inconsistently performed judgments when calculating the final cut scores.
Kozaki (2010) also conducted a standard setting study for a certification test of medical translation from Japanese to English, utilizing MFRM to develop a new method for performance assessment. Her study provided a promising procedure for use in low-budget and relatively low-stakes contexts when it is not possible to gather all the panelists together. In Kozaki’s procedure, the meeting materials, including the item booklets, are sent to the panelists for independent work, which entails a large element of risk for test security. In such a case, this approach may only be suitable for low-stakes assessment because items may be leaked to the public.
Engelhard’s (2009) study provides some criteria to evaluate the quality of standard setting in the context of MFRM. The rating data they examined were obtained based on a new standard setting method called objective standard setting, which is not the commonly used approach in standard setting. In addition, the panel utilizing objective standard setting was composed of a small number of individuals. It would be informative to know whether the MFRM is still a promising approach to evaluate the standard setting judgments when the circumstances involve a popular standard setting procedure (e.g. Yes/No Angoff) with more panelists.
MFRM extends the basic Rasch model by incorporating multiple facets that are typically included in a test (i.e. examinees and items), such as raters, scoring criteria, and tasks. This model has promising features to integrate theorists’ and practitioners’ interests in measuring language proficiency (Eckes, 2009). Besides the studies mentioned previously, further examples of MFRM model application in standard setting procedures can be found in Eckes (2009), Engelhard and Gordon (2000), Engelhard and Stone (1998), Kecker and Eckes (2010), Kozaki (2004), Lumley, Lynch, and McNamara (1994), Lunz (2000), Stone (2006), and Stone, Beltyukova, and Fox (2008). These studies show that the MFRM modeling approach is a valuable instrument for the purposes of evaluating standard setting data and setting cut scores on examinations.
Purpose of this study
The purpose of this article is to use the MFRM to evaluate the quality of judgments in an English assessment test for elementary students. Thus, the present study falls under the category of quantitative evaluation of internal criteria as envisioned by the Hambleton and Pitoniak (2006) framework. Criteria for evaluating the judgments of panelists is based on the use of various criteria developed for examining rater’s variability and bias within the context of Rasch measurement theory. The Yes/No Angoff method, a commonly used standard setting procedure, will be implemented. Accordingly, the main research question addressed by this paper is: How can the MFRM help evaluate the quality of judgments in the Yes/No Angoff method?
Methods
Participants
A one-day workshop for setting cutoff scores for a sixth-grade English test was held for a panel of 32 experts. These panelists were selected based on the recommendation list provided by the content specialists. All panelists are knowledgeable about the school curricula and test content, as well as the characteristics of the examinees. The demographic composition of the 32 panelists is shown in Table 1. The panel was composed of elementary school teachers (n = 24), university professors (n = 4), and school administrators (n = 4). Six of them were male and 26 of them were female. The panelists represented the geographic areas of Taiwan with 20 of the panelists from the north area, seven from the middle area, two from the southern area and three from the eastern area of Taiwan. The average working experience of panelists is 10 years. With these characteristics it can be said that the selected panelists are quite representative with the target distribution.
Demographic information of panelists.
Assessment description
The Taiwanese Assessment of Student Achievement (TASA) is a nationwide assessment to evaluate academic achievement for sixth, eighth, and tenth grades. The test battery includes five major tests: Mathematics, Science, Social Science, Mandarin, and English. Standard setting is planned to be conducted for each of the tested grades in the next few years. But for the current stage, the standard setting was only implemented for Grade 6.
The sixth-grade English test, which was constructed with content based on the nine-year curriculum plan issued by the Taiwanese Ministry of Education (MOE), was used in this study. The English test is composed of two main sections: listening and reading, which are subdivided into sections presenting different tasks. The listening section contains items which require testees to differentiate English words and sentences and to understand the meaning of a daily conversation, while the reading section contains items that require testees to identify the correct vocabulary and sentences and to understand short paragraphs, tables, and graphs. The purpose of the test is to determine the effectiveness of elementary school instruction and curriculum as well as the general performance of Taiwanese students. Roughly 10,000 students take this test annually. The students are selected based on the two-stage random sample design. All items in the test are multiple-choice items with three options. A total of 103 items were used operationally which includes 24 listening items and 79 reading items.
Because the item parameters are jointly calibrated in operational work, the overall cut scores were set instead of setting the cut scores separately for reading and listening sections. In addition, the examinees received a composite score instead of receiving separate listening and reading scores in the real testing situation. As this test consists of two subtests which could threaten the IRT assumption of unidimensionality, the IRT computer software TESTFACT 4 (Wood, Wilson, Gibbons, Schilling, Muraki, & Bock, 2003) was used to examine unidimensionality. The results showed that the first latent root of this assessment was 23.72, the second latent root was 1.65, the third is 1.03, the fourth was 0.89, the fifth was 0.80 and the sixth was 0.75. Based on the rules of thumb that (1) the first root is large compared to the second and (2) the second root is not much larger than any of the others (Lord, 1980, p. 21), the items are approximately unidimensional. Thus, it should be acceptable to set standards for the overall exam.
Considering the test was administered to sixth grade students, it was difficult for these youngsters to take a test with many items. In order to reduce test-taking time, an equating design called partially Balanced Incomplete Block (pBIB) approach was used for the assignment of items to test forms and students (Allen, Donoghue, & Schoeps, 2001; Bose & Nair, 1939). The purpose of this design is to provide several booklets containing different blocks of items, so that no student receives too many items. In this design, the cognitive blocks are balanced. Each cognitive block appears an equal number of times in every possible position and each cognitive block is paired with every other cognitive block in at least one test form (van der Linden, Veldkamp, & Carlson, 2004). Through equating analysis, the scores in different forms of a test can be compared to each other. In this study, a total of 12 test forms were constructed and each test form consisted of 40 items.
Although TASA was initiated in 2005, there is no well-accepted learning principle that explicates the performance criteria for this assessment. To make the public more aware of the test purposes and standards, the program director invited experts in English assessment to write up the PLDs consisting of several paragraphs that provide full and precise descriptions of the knowledge, skills or attributes of test takers within each performance level. A standard setting conference was conducted to set three cutoff scores, which can be used to classify students into four categories – below basic, basic, proficient, and advanced.
Procedures
After confirming the attendance lists of the panelists, the researchers mailed out the informational packages which contained copies of the content standards, test specifications, sample tests, and statement of purpose for the standard setting meeting. These materials were delivered to the participants one week before the standard setting meeting to allow sufficient time for participants to read materials, familiarize themselves with the meeting agenda, understand the purpose of the standard setting, and communicate with the researchers if there were any questions or concerns.
In the standard setting meeting, the researcher illustrated the method they would use and demonstrated the importance of the activities. The researcher also explained the purpose of the tests and provided the panelists with a clear understanding of test content, which was followed by an illustration of the PLDs. In the training section, understanding the PLDs at each performance level and how they relate to student performance on the assessment is the primary feature. Several exercises were given to the panelists to help them practice applying the PLDs. They were asked to evaluate practice items that would not be part of their rating items to determine which PLD statement more closely matched the knowledge and skill requirements for correctly answering the item. The classification task was done independently and then followed by a group discussion. In addition, the panelists were given a set of actual student responses; they reviewed the papers and selected the one that best represented the borderline of each achievement level. They were informed that it was possible that none of the papers were thought to represent the borderline students. Again, the panelists worked independently and then participated in a group discussion of their work. The discussion was lively and elicited thoughtful exchange regarding the pros and cons of specific papers as examples of borderline student performance. These tasks helped panelists to form a clearer understanding of the PLDs.
After training and discussion of the characteristics of the borderline groups for each performance level, the panelists rated each item in the test booklet to complete the first round judgment. Four performance categories (i.e. three cutoff scores) were required, and the participants accordingly generated three sets of ratings in the first round. After completing all judgments in the first round, the working sheets of panelists were collected, data were entered, and averaged cutoff scores for each level were calculated.
Based on analysis of the worksheets, panelists were provided with feedback on their first round ratings, which included the holistic rating by each panelist for each item, the mean and median of the cutoff score for each level, item difficulty values and item variances, and percentages of examinees in each performance category based on the 2009 real testing data. The panelists reviewed the feedback and began a one-hour discussion. The same procedures were repeated over three rounds. Unless indicated, the analytical results in this study are conducted based on data from the third round since the standard setting decisions are made based on the final round. Some analyses were conducted using data from all three rounds to show the historical change of judgments.
Multifaceted Rasch model
The data used in this analysis comprises a 32 × 103 × 3 × 3 matrix, that is, 32 panelists rendering judgments for 103 items at three rounds. In addition, each panelist was labeled three times, one per cut score. For example, 1b represents the judgment at a basic level made by panelist 1, 1p represents a proficient level designation, and 1a represents an advanced level designation. These data were analyzed with the computer program FACETS (Linacre, 2007), which provides estimates for the parameters of the MFRM. The Multifaceted Rasch model, operationalized by the computer program FACETS, is an additive linear model based on a logistic transformation of the observed ratings to a logit scale. In this model, the dependent variable is the logistic transformation of ratios of successive category probabilities, and the independent variables are the facets of item difficulty and panelist severity. The MFRM used in this study can be written as follows:
Where
Pnijk1 = the probability of a ‘Yes’ being rated on item i by panelist n,
Pnijk0 = the probability of a ‘No’ being rated on item i by panelist n,
β n = the severity of panelist n,
δ i = difficulty of item i,
ω j = judgment of performance standard for round j,
τ k = difficulty of rating a ‘Yes’ relative to ‘No’.
As demonstrated in the model, each element of a facet is represented by one parameter. The FACETS program produces a logit scale that theoretically varies from negative infinity to positive infinity. The logit scale in MFRM eliminates the problem of sample dependence mentioned earlier, as the logarithmic transformations create an interval scale in which unit intervals between the locations on the variable map have uniform values, or meanings, rendering comparisons within and among various facets possible. In this case, when panelists’ facets are negatively oriented, indicated by negative logits, it means that relatively lenient judging practices established lower cutoff scores. On the other hand, positive logits indicate more severe judgments as panelists set higher cutoff scores.
It is expected that no data could fit this model exactly, but it is possible to investigate the extent that data fit the specified model using the fit statistics which indicate the extent to which the ratings actually observed differ from those predicted by the model, given the parameter estimates. Because the standardized fit statistics are more dependent on the sample size, two types of mean squared fit statistics are observed: infit and outfit. The infit statistics are based on weighted mean squared residual statistics and are less sensitive to outlying unexpected ratings while the outfit statistics are unweighted mean squared residual statistics that are particular sensitive to outlying unexpected ratings (Linacre, 1994). Ranges for infit and outfit statistics depend on different testing situations, but the generally acceptable range is from 0.5 to 1.5 (Engelhard, 1992). The reliabilities of separation coefficients for each facet are also examined. This index provides information about how well the elements within a facet are separated in order to reliably define the facet. This coefficient is very similar to the estimates of internal consistency (Smith, 1991) that provide a measure of the extent to which components in each facet are separated along the continuum. A separation index of less than 2 indicates that the items may not be sensitive enough to distinguish between high and low performers or that the sample size is not large enough to confirm the item difficulty hierarchy of the instrument (Linacre & Wright, 2006). The statistical significance of the separation of components within a facet is given by a chi-square test with a null hypothesis of no variation (Schumacher & Lunz, 1997).
The panelist’s unexpected judgments for items and the standardized residuals are output for aberrant analysis. These standardized residuals provide for a detailed examination of the correspondence between observed and expected rating patterns based on the model and can be summarized over different facets and elements within a facet in order to provide indices of how closely the data fit the FACETS model. Ideally, there should be no more than 5% of the standardized residuals outside the ± 2 band when the model is fit (Linacre & Wright, 2006). However, if the residuals concentrated on particular elements show local misfit, it may not have a major impact on the overall summary fit statistics.
Results
Figure 1 displays the infit and outfit plots for each item at each level. It is readily apparent that there is strong agreement among the panelists regarding the judged difficulty of these items, with only a few items exhibiting infit and outfit mean square errors greater than 1.5. Similar plots have been constructed for panelists at each level, displayed in Figure 2. It can be seen that with regard to inter-rater agreement most panelists appear to be highly consistent in their judgments at each level. However, there are still some panelists (e.g. panelist 17 at the proficient level) displaying infit or outfit statistics greater than 1.5. In order to further explore the nature of the misfitting items and panelists, standardized residual plots should be constructed for each of them. As shown in Figures 1 and 2, panelist 17 has the largest infit values and Item 97 has the largest infit and outfit values among these misfitting panelists and items. Thus, this panelist and this item were selected to create sample residual plots to show the utilization of standardized residual plots.

Infit and outfit plots for all items at each level.

Infit and outfit plots for all panelists at each level.
Figure 3 shows the historical judgments of panelist 17 from round 1 to round 3. The standardized residual plots can be explained as follows: items with standardized residuals above zero tend to be viewed by the panelist as more difficult than the items evaluated by the other panelists, while those items below zero tend to be evaluated by the panelist as easier. Items with standardized residuals outside the interval of −2 and +2 indicate problematic disagreement. The data presented in Figure 3 show that in round 1, panelist 17 views item 94 as significantly more difficult than the other panelists, as indicated by the standard residuals larger than +2. Conversely, item 43 with values less than −2, indicates an item that the panelist viewed as relatively easier. In round 2, panelist 17 viewed items 12 and 102 as significantly easy. In round 3, he viewed items 3, 12, 30, 36, 45, 49, and 93 as significantly easy. Figure 3 shows panelist 17 was inconsistent and unpredictable in his judgments for each round, and he may require more consultation on the PLDs. These standardized residual plots can be provided as feedback for each panelist to introspect their own judgments. Panelists can review the items with large standardized residuals and ascertain if they made these decisions unintentionally, and accordingly revise the judgments.

Analysis of standardized residuals for panelist 17 at proficient level for each round.
As seen in Figures 1 and 2, item 97 is an example that exhibits large infit and outfit statistics at the advanced level. The data presented in Figure 4 indicate that at round 1, panelists 3, 5, and 10 viewed this item as being significantly easier than the other panelists, as indicated by the standardized residual of less than 2. At round 2, panelists 1, 2, 3, 5, and 6 viewed this item as significantly easier. However at round 3, panelists 19, 20, 21, 22, 24, and 28 viewed this item as being relatively harder than the other panelists, as indicated by the standardized residual greater than +2.0. The large standardized residual in the last round is not unusual because some panelists used the feedback (e.g. percentages of examinees in each performance category) to adjust their judgments higher or lower than their initial cut scores. However, this may indicate that the panelists departed from their conceptualization of the borderline groups and focused on their personal beliefs about the appropriate cut scores. It may also reveal that panelists were unsure how to use the feedback to change their judgments to better reflect the performance of the borderline students (Buckendahl et al., 2002).

Analysis of standardized residuals for item 97 at advanced level for each round.
Figure 4 shows the judgments for item 97 are inconsistent and more discussion is needed for this item. In addition, there might also be a substantive reason for the apparent grouping of the panelists related to the background characteristics and personal experiences of each panelist, which requires further exploration. Similar standardized residual plots can be constructed for the other misfitting items with large infit and outfit statistics for better understanding of the historical change in these judgments.
Figure 5 provides the variable map for panelists’ views of the cutoff scores for each performance level. This variable map graphically presents the location of each panelist’s view of the cutoff scores on the latent variable scale of English ability. The Rasch scale of this map is from −5 to 7, with higher Rasch measures indicating greater severity. The numerals represent the panelist numbers and the letters represent the performance level. For example, 1b represents a basic level designation by panelist 1, 1p represents a proficient level designation, and 1a represents an advanced level designation. The overview of infit/outfit statistics in Figure 2 show that some panelists (e.g. panelist 31) are quite stable with respect to severity and some panelists (e.g. panelist 29) are unstable at different levels. This map can also indicate the highly discriminating panelists. For example, panelist 29 is an under-discriminating panelist, as indicated by severity of judgments at the advanced level, but leniency at the basic level. The consequence of such decisions, with a wide gap between advanced and basic levels, will be to place a large number of students into the proficient level. On the other hand, panelist 22 is lenient at the advanced level but severe at the basic level. Such decisions will result in a narrower distance between the two levels, meaning he is a ‘highly discriminating’ panelist. The result will be a greater propensity to spread students out across different performance levels.

Calibration of judges’ views of cutoff scores for three performance levels.
Table 2 presents a summary of the measurement report for the panelist and item facets at each performance level. Regarding the panelist facet, this summary table provides information on the spread of panelists’ judgments across the logit scale as well as the fit of the model. The FACETS analysis provides several statistics to show the magnitude of the difference among each facet, including reliability, separation index, and the chi-square statistics (Weigle, 1998). The separation index is the ratio of the corrected standard deviation of element measures to the root mean square standard error (RMSE). If the severity between panelists is the same, the standard deviation of the panelist difficulty estimates should be equal or smaller than the RMSE. The number of measurably different levels of examinee proficiency was also provided in the separation index. As shown in Table 2, the rater separation index is approximately equal to two, indicating that the variance among raters is about two times the error of estimates. The reliability statistics indicate the degree to which the analysis reliably distinguishes between different levels of difficulty or severity among the panelists. A low reliability indicates the different panelists should be equally severe. However, in this case, the reliability ranged from 0.79 to 0.81, indicating that the analysis fairly separates panelists into different levels of severity. Thus, it may be inferred that among the 32 raters included in the analysis, there are about two statistically distinct classes of rater severity. Moreover, the fixed chi-square tests the null hypothesis that the panelists are equal in severity, in this case, chi-square indices are statistically significant (p < .001), indicating the panelists are not equally severe. That is, different panelists presented with the same items did not render equally severe or lenient judgments. Finally, the infit and outfit mean square close to 1 indicated that panelists demonstrated good fit, and on average, provided predictable responses. Overall, the panelists’ decision making seems quite consistent with the measurement model.
Summary of estimated logits for each facet at each level.
Table 2 shows that the separation indices for the item facet ranged from 2.48 to 3.10, which suggested that among the 103 items, there are about three statistically distinct levels of item difficulty, corroborating the three-category scale; that is, the measurement system identified three different levels of performance. The item reliability ranged from 0.84 to 0.87 with significant chi-square values indicating the item difficulties are separated along the ability continuum. In order to establish the origin of the logit scale and make the model identifiable, the convention in the FACETS analysis is to center the mean item difficulty for each performance level. That is, the item facets were constrained to have a mean element measure of zero (Eckes, 2011).
Table 3 shows the frequency distribution of the unexpected observations for each performance level with the threshold for unexpected observations set at an absolute value of two. It can be seen that the majority of the residuals approximate two, with more unexpected observations in the ‘proficient’ level than the ‘basic’ and ‘advanced’ levels. The ratio of unexpected observations (277) to overall observations (9540) is approximately 3%. Linacre et al. (2006) indicated that when the data fit the model, there should be less than 5% of standardized residual outside the absolute value of 2. The small proportion (3%) of outlying standardized residuals in this analysis indicates the data fit the model quite well, and most of the judgments made by panelists seem quite reasonable based on the Rasch measurement model.
Summary of unexpected observations.
Conclusions and discussion
As indicated in Hambleton and Pitoniak (2006), the standard setting process is composed of a mixture of judgments, psychometrics, and practicality. Thus, it is not possible to avoid subjective judgments in setting the performance standards. Although the types of judgments vary for different methods, the judgments play an important role. For the researcher, judgments are used to decide the method that is most suitable for standard setting and to determine the demographic composition of the panels. For the panelist, the judgments are the key to success for the whole standard setting process. The researchers should supply adequate training so the panelists can provide their judgments in an informed manner, and set up a proper context within which to interpret those judgments (Hambleton, 2001). The goal of good standard setting work is to make these judgments as informed as possible.
This study presented some statistical indices and plots, which are not typically examined within the context of standard setting. However, they can provide evidence to examine the internal criteria based on Hambleton and Pitoniak’s (2006) evaluation framework. Table 4 summarizes the types of information that can be provided regarding the quality of rating for the item and panelist facets in the context of the MFRM model. In addition, the questions that can be evaluated based on these indices and plots are provided. For the panelist facet, the variable map can be used to evaluate the relative severity of the panelists. The reliability of separation indices and chi-square statistics provide information about variation among panelists’ ratings. Next, large infit or outfit statistics can be used to identify the possible ‘less consistent’ panelists. Once a panelist is flagged, the standardized residual plot can be constructed to detect the aberrant decisions for each panelist. This information can also serve as the basis for diagnosis of the sources of disagreements between the panelist and the group. After receiving the residual plot of items, the panelists can determine whether their decision for each item is reasonable or not. If the plot shows that the residuals are beyond the ± 2 boundary for certain items, it indicates those items produced problematic disagreement. The panelists can utilize this information to expediently correct those items prior to group discussion, while the standard setting designer can use this information to determine if he should delete the aberrant judgments for the final cutoff score calculations.
Statistics indices and graphics plots used to evaluate the quality of panelists and items.
The item specific questions that can be used to evaluate the standard setting judgments based on the MFRM model indices and plots are also provided in Table 4. Similar information is provided for the item facet; however, it is worth noting that although the questions being addressed by the indices and plots vary somewhat, the item difficulties and other item-based indices also provided information on the panelists’ interpretations in the process of standard setting.
This study employed the MFRM for examining ratings obtained from the Yes/No Angoff methods in the context of language assessment, and the analyses provided evidence regarding the quality of judgments from the standard setting process. In particular, the MFRM provides a framework for summarizing the results of a standard setting procedure that can help the decision maker or project designer better understand the recommended performance standards obtained from standard setting panelists.
In sum, the results of this study indicated that the MFRM model provides a promising way to assess whether the requirement of invariant measurement is obtained (Engelhard, 2009). The invariant measurement is based on the idea that the psychometric quality of the assessment systems should not be influenced by the construct irrelevant sources of variance. The panelist and item indices described in Table 4 are based on the theoretical requirements of invariant measurement and the application of Rasch measurement theory to test whether the goal of invariant measurement has been met in the Yes/No Angoff standard setting method. The information provided in this study should be useful for panelists’ retrospection of judgments and for standard setting designers’ evaluation of overall judgment quality.
For future study, it would be worthwhile to conduct a standard setting meeting during which the panelists and designers are presented with the information in this study, and query their perceptions of the usefulness of these indices and plots for the revision of item judgments. Implementing this analysis in real practice can further examine the value of this kind of feedback in standard setting. It would be meaningful to compare how the judges changed their judgments round by round with and without the feedback from FACETS analysis.
Application of the MFRM model to other standard setting procedures also warrants future investigation. For example, it would be worthwhile to examine how to evaluate the judgments within the context of the bookmark procedure (Mitzel, Lewis, Patz, & Green, 2001), since this method has been used widely in many high-stakes assessments (Schulz & Mitzel, 2005). Each standard setting method involves some explicit and implicit conceptualizations of item or task difficulties, and employing the MFRM model can serve to illuminate different aspects of a variety of standard setting procedures.
Footnotes
Funding
This research received no specific grant from any funding agency in the public, commercial or not-for-profit sectors.
