Abstract
The many-faceted Rasch (MFR) model has been used to evaluate the quality of ratings on constructed response assessments; however, it can also be used to evaluate the quality of judgments from panel-based standard setting procedures. The current study illustrates the use of the MFR model for examining the quality of ratings obtained from a standard setting process employing the Multiple Yes-No, modified Angoff standard setting procedure to set the cut scores on the Advanced Placement Environmental Science exam. The panelists’ severity, judged item difficulties, rounds, and cut scores using the MFR model were examined, in addition to panelists’ characteristics (gender and level of course taught). Findings from the Rasch analyses conducted in this study provide evidence of acceptable quality of the ratings obtained from the panelists. Thus, the current study provides an important source of both internal and procedural validity evidence for the Advanced Placement Environmental Science standard setting as well as an illustration of the utility of the MFR model for evaluating standard setting judgments.
Keywords
The process of establishing cut scores on achievement tests often necessitates that panel-based standard setting procedures are implemented. Since the publication of Livingston and Zieky’s (1982) foundational standard setting handbook, Passing Scores, numerous definitions for standard setting have appeared within the psychometric and measurement literature. In the words of Cizek and Bunch (2007), “. . . standard setting refers to the process of establishing one or more cut scores on examinations. The cut scores divide the distribution of examinees’ test performances in two or more categories” (p. 5). A common theme across the many definitions of standard setting is its role as a values clarification procedure during which examinee performance is categorized for decision-making purposes. Clarity in the information provided by standard setting judgments is essential for having trustworthy cut scores and for best documenting standard setting procedures. As such, invariant measurement—the requirement that the same construct be measured across the entire population (in this case, the entire population of standard setting panelists)—is necessary. In this study, models based on principles of invariant measurement (Engelhard, 2009) are used to examine the quality of standard setting judgments, and implications for findings are discussed.
Prior to 2011, The College Board had not used a formal judgmental standard setting process (Morgan, Reshetar, Matts, Kaliski, & Hendrickson, 2012). However, a judgmental standard setting procedure was implemented in 2011 for the Advanced Placement® Environmental Science (APES) examination. Specifically, a modified Angoff (Angoff, 1971) procedure known as the Multiple Yes-No (MYN) method (Plake, Impara, Buckendahl, & Ferdous, 2005) was employed for the multiple choice (MC) items, and an extended Angoff (Hambleton & Plake, 1995) procedure was used for the constructed response (CR) items. The current study uses data from the 2011 APES standard setting to explore panelist judgments within the framework of invariant measurement. The following section provides a theoretical framework based on Rasch measurement theory for evaluating the quality of standard setting judgments.
Theoretical Framework
Evaluating the quality of panelist judgments obtained from a standard setting is essential, given the subjective nature of the judgments. Furthermore, evaluative processes add to the documented validity evidence of the cut scores that result from a standard setting procedure. The criteria for evaluating panelist judgments from a standard setting are typically classified as procedural, internal, or external evidence of validity (Hambleton, Pitoniak, & Copella, 2012; Kane, 2001; Pitoniak & Morgan, 2012). Procedural validity evidence typically refers to the documentation, clarity, and explicitness of the implementation of the standard setting procedures and the gathering of evaluation feedback from the panelists themselves. Internal validity evidence typically refers to the consistency of ratings within a particular method (e.g., intrapanelist and interpanelist consistency). For example, generalizability theory studies can be conducted to examine variability in standard setting judgments across panelists (Brennan, 1995; Brennan & Lockwood, 1980; Lee & Lewis, 2008; Yin & Sconing, 2007). External validity evidence typically refers to comparing the results of multiple standard setting methods with one another, as well as evaluating the relationship between the resulting cut scores from a standard setting and other theoretically relevant sources of empirical information (e.g., another test measuring the same construct). Hambleton and Pitoniak (2006) summarize the various methods that fall into each of these three categories of evaluating standard setting judgments.
Many-Faceted Rasch Model
The current study is focused on a particular method for gathering internal and procedural evidence of validity. Specifically, the conceptual framework presented by Engelhard (2009) for using the many-faceted Rasch (MFR) model to evaluate the judgments from standard setting panelists guides this research. This framework is grounded in the Rasch measurement theory criteria for evaluating scores on rater-mediated assessments (Engelhard, 2002). For the current study, these criteria are interpreted using standard setting as the context as opposed to the scores from rater-mediated assessments. Within this framework, specific facets are specified as part of the MFR model, where each facet represents a source of variability in panelist ratings. Such a model provides a method for examining the quality of standard setting ratings that account for the important sources of variability (i.e., the facets) in panelist ratings. Previous studies have employed this model to evaluate panelist ratings based on the Bookmark method (Lewis, Mitzel, Mercado, & Schulz, 2012) using data from the Michigan Educational Assessment Program (Engelhard, 2011).
This study employs a rating scale formulation of the MFR model for evaluating standard setting judgments. Stated mathematically, the model used in this study is as follows:
where Pnijk is the probability of panelist n giving a standard setting modified Angoff rating of k on item i in round j, and Pnijk − 1 is the probability of panelist n giving a standard setting modified Angoff rating of k − 1 on item i in round j. Also, θ n is the judged severity for panelist n, δ i is the average judged item difficulty for item i, ω j is the judged average performance level for round j, and τ jk is the cut score, or threshold coefficient, from round j for standard setting ratings of k. The threshold coefficient, τ jk , describes the intersection where a panelist at a given severity level on the Rasch scale has an equal probability of selecting either of adjacent categories on the standard setting rating task. The rating scale formulation of the MFR model assumes that the same rating scale structure is applied across the entire set of APES items.
A variety of rating quality indices can be used to evaluate the quality of ratings assigned by panelists in a standard setting context. Within the performance assessment literature, indices of rating quality typically focus on three broad categories: (a) rater agreement, (b) rater error and systematic bias, and (c) direct measures of rater accuracy (Murphy & Cleveland, 1991). Rating quality indices that are applied within a Rasch measurement theory framework typically focus on (a) panelist severity/leniency measures, (b) model–data fit, and (c) the creation of a visual display for comparing panelist judgments on the latent variable (Engelhard, 2009, 2011). In the following section, criteria from the MFR model that can be used to evaluate the quality of ratings are described as they apply within a standard setting context.
Based on invariant measurement, the MFR model used in this study models standard setting judgments as a combination of panelist severity (θ n ), judged item difficulty (δ i ), judged average performance level for round (ω j ), and judged locations of individual recommended cut scores (τ jk ). Through the use of this model, differences among panelist judgments are captured in estimates of panelist severity measures, and fluctuation in rating patterns is captured in infit and outfit mean square error (MSE) statistics. 1 Rather than emphasizing the percentage of exact panelist agreement across items, the Rasch-based approach focuses on rating quality indices related to significant differences in overall panelist severity calibrations (panelist severity/leniency), overall rating patterns (model–data fit), and the simultaneous placement of panelists, items, rounds, and cut score categories using a visual representation of a common logit scale (variable map). The variable map provides a useful display for presenting the recommendations of the standard setting panelists, along with differences in judgments across panelist subgroups (e.g., male and female panelists). Each of these three rating quality indices that are typical foci within a Rasch measurement theory framework—panelist severity measures, model–data fit, and the variable map—are described in more detail below.
Panelist severity measures
As they are presented in the performance assessment literature, rating errors and systematic biases are aberrant patterns of rating scale use that contribute to the assignment of scores different from those hypothesized as true reflections of student achievement. Within the context of standard setting, panelist severity can be directly modeled as a facet in the MFR model to explore differences among panelists in terms of degree of severity. These panelist severity measures provide a method for identifying panelists who consistently provide judgments that are either lower (more lenient) or higher (more severe) than is warranted by the panelists’ interpretations of the performance-level descriptors (PLDs). For example, a severe panelist is one who expects high achievement across items to be placed into a given performance-level category and will in turn judge the items to be more difficult, and a lenient panelist is one who expects low achievement across items and will in turn judge the items to be easier for students to answer correctly.
An examination of separation statistics for panelist severity measures (i.e., the severity/leniency of the panelists across all the items) on the logit scale that are computed using the MFR model informs whether or not there are statistically and practically significant differences among the panelist judgments across the set of items. Based on these severity measures, practically significant differences between panelists are identified using an index of the reliability of panelist separation, which ranges from 0 to 1. This separation statistic can be conceived similarly to Cronbach’s alpha because this value is the proportion of observed score variance that is due to true score variance. A large value indicates that the ordering of panelists in terms of severity across items is consistent and that one can differentiate between the panelists very well (Bond & Fox, 2007). In addition to the reliability of panelist separation index, statistically significant differences are identified using a chi-square statistic. A statistically significant value of the chi-square statistic suggests that differences among panelist severity measures are greater than would be expected due to chance alone. Engelhard (2011) describes how to interpret these values substantively within the context of standard setting:
The substantive interpretation of panelist differences as problematic depends on how the standard-setting process is viewed. If the goal of the standard-setting or governing board is to obtain agreement among the panelists in terms of their views of the performance standards, then panelist variability may be viewed as undesirable. An alternative way to view panelist variability is to recognize that panelists come from different settings with different experiences and that they may even have been deliberately selected to represent diverse viewpoints. From this perspective, the goal is to describe this variability and present the governing board with a description of the diverse views of the different panelists. (p. 912)
In other words, whereas with rater-mediated assessments such as the scoring of essays on a high-stakes large-scale assessment variability among ratings is viewed as undesirable, with standard setting the variability across panelist ratings might be appropriate given that panels are often selected with the intention of representing diverse viewpoints.
Model–data fit
Fit statistics provide information about the amount of variability in data compared with the variability that would be expected based on an observation of perfect fit to the model (invariant measurement). These statistics provide information about the degree to which facets of the assessment situation diverge from the expectations of the Rasch model and are useful for measuring the intended construct (Engelhard, 2002). Findings of “misfit” suggest that the observed data are not summarized well by the model. High values indicate data patterns that are more hap-hazard or “noisy” than expected, and low values indicate more uniform or “muted” patterns than were expected. In the case of standard setting panelists, infit and outfit MSE statistics are used to identify rating patterns that are either noisy or muted and can help ensure that only those panelists who are productive for representing panelist judgment are included in the final cut score calculations. 2
Although previous researchers have noted limitations of Rasch fit statistics (e.g., Karabatsos, 2000), infit and outfit MSE statistics are still useful within the context of rater-mediated assessments (Engelhard, 1994; Linacre, 2010). These fit statistics are not intended to evaluate whether or not the Rasch model fits better than another model; rather, they are used as a measure of internal consistency of the observed data (e.g., Are panelists “noisy” or “muted”? Are items judged inconsistently across raters or consistently across raters?), and these statistics are best used when supplementing the residual analyses visual displays (described below). The outfit MSE statistics and infit MSE statistics that are provided in the output from Facets analyses provide an index of unweighted mean square residual differences between observed and expected patterns in rating data. The outfit MSE statistic for the rater facet is useful because it is particularly sensitive to “outliers,” or extreme unexpected patterns in rating. Infit MSE statistics are also useful for evaluating model–data fit but are less sensitive to outlying data. Thus, the outfit statistic is favored over the infit statistic in the current study. Outfit MSE is the unweighted mean squares and is calculated by summing over the appropriate values; as given in Engelhard (1994), the outfit MSE for panelists is as follows:
where
where N is the number of panelists.
Infit MSE statistics are also useful for evaluating model–data fit but are less sensitive to outlying data because residuals are weighted by the variance of an individual facet, which reduces the impact of unexpected observations. It is also possible to calculate standardized versions of the infit and outfit MSE statistics based on a Z-score transformation. In a standard setting context, mean square and standardized residual fit MSE statistics can be used to identify panelists whose ratings tend to differ significantly from those expected based on the MFR model.
Visual displays
One of the greatest advantages of using the MFR model to evaluate the quality of standard setting ratings is the synthesis of the results on a variable map. The Facets output (Linacre, 2010) provides a variable map that depicts all of the facets of interest that are specified in the MFR model on the underlying latent construct to facilitate visual comparison across the facets. This display provides a holistic approach to evaluate all of the facets that are related to panelist ratings.
Purpose of Study
Although MFR model has been used to evaluate ratings from other standard setting procedures, including the modified Angoff (Engelhard, 2009) and modified Bookmark methods (Engelhard, 2011), the MFR model has not yet been employed on ratings from a modified Angoff standard setting procedure that used the MYN method (Plake et al., 2005). This study provides an illustration of how the MFR model can be used to evaluate ratings obtained from MYN standard setting methods. The purpose of this study is twofold. First, the MFR model is used to evaluate the quality of judgments on MC items provided by panelists who participated in a modified Angoff standard setting that used the MYN method for MC items, the 2011 APES exam. Interpretation of the variable map, model–data fit, and severity/leniency measures are provided. Second, panelist characteristics (gender and level of teaching) are incorporated into the MFR model to determine whether or not these are explanatory variables that account for differences in panelist ratings (De Boeck & Wilson, 2004).
Method
Participants
Fifteen subject matter experts in environmental science participated in the 2011 panel-based standard setting for APES. Following a brief introduction at the beginning of the standard setting meeting, the subject matter experts were asked to complete a form that gathered biographical data for use in summarizing the representativeness of the panelists. Nine of the panelists were from higher education institutions across the country and 6 were APES high school teachers. Table 1 provides a summary of demographic information for these 15 panelists.
Biographical Data of Standard Setting Panelists.
Note. AP = Advanced Placement; APES = Advanced Placement Environmental Science.
Measures and Data Sources
Advanced Placement program
The Advanced Placement (AP) program is composed of 34 courses and corresponding examinations in 22 subject areas. The AP program provides high school students with the opportunity to participate in college-level courses with the intention of preparing students for success in their college courses. A weighted composite score based on combined MC and CR items is then converted to a score on a 5-point scale (1 = low, 5 = high).
The APES used for this study was first offered in 1998. In 2011, more than 98,000 students took the examination. A total of 95,642 students took the main form of the APES exam. When it was introduced in 1998, the initial cut scores for the APES exam were set based on results from a college comparability study. The AP program periodically conducts college comparability studies in all AP subjects. These studies compare the performance of AP students with that of college students in the courses for which successful AP students will receive credit. 3 The intention of college comparability studies is to inform where the cut scores should be placed along the continuous score scale to classify students into AP performance categories of 1, 2, 3, 4, or 5. The AP program is currently moving toward the use of judgmental, or panel-based, standard setting procedures to set cut scores rather than continuing to conduct college comparability studies.
Instrument
Data used in this study come from the 2011 administration of the APES exam and the standard setting for this examination. The APES exam comprises 100 MC items and four CR items. The 2011 APES exam had 1 MC item that was removed from operational scoring. Therefore, only 99 MC items were examined during standard setting. The APES exam is scored on a raw scale from 0 to 150, with 90 points from the MC items and 60 points from the CR items. The final reported score for a student is a categorical score ranging from 1 to 5 with 5 being the highest score. The AP course grade correlates with college course performance are as follows:
AP Score 5: Performance equivalent to grades of A and A+ in the corresponding college course
AP Score 4: Performance equivalent to grades of A−, B+, and B in the corresponding college course
AP Score 3: Performance equivalent to grades of B−, C+, and C in the corresponding college course
AP Score 2: Performance equivalent to grades of C−, D+, and D in the corresponding college course
AP Score 1: No recommendation is made about the student
APES performance-level descriptors
PLDs describe the knowledge, skills, and abilities that are required for a student to be placed into each score category—in this case, the five AP score categories. Creating the PLDs helps establish a common understanding across standard setting panelists for the meaning of each score category in terms of student knowledge, skills, and abilities. Prior to the standard setting meeting, the APES PLDs were developed in regard to the course grade correlates mentioned above, but they also elaborate on these generic categorizations with descriptions of the actual knowledge, skills, and abilities that examinees at each level should be able to demonstrate. During the standard setting meeting, the 15 panelists reviewed the PLDs and made small revisions to improve the clarity of the five PLDs.
After initial review of the PLDs and prior to the standard setting rating process, the concept of the Borderline Examinee was introduced, and this concept was discussed among the standard setting panelists within the context of APES. Borderline Examinees are students whose knowledge, skills, and abilities represent the minimal level of competence required for placement in each category. Panelists were asked to reexamine the PLDs from the perspective of identifying the knowledge, skills, and abilities that would be expected of the borderline examinee in APES, rather than across the full range of these characteristics within each performance level. Although minor changes were made to the PLDs during this discussion, the panelists felt that the PLDs generally represented the Borderline Examinee and did not need to be significantly changed.
Panelist ratings
Data used in this study are the ratings that resulted from two rounds of item-level judgments provided by the 15 APES panelists. Only MC item ratings were examined in this current study. The actual rating task that the panelists were given is described below.
Procedures
After finalizing the PLDs, the panelists took the 2011 APES exam in a 2-hour time period (1 hour less than allotted for students operationally), with the purpose of familiarizing themselves with the examination and the tasks that are asked of students. Then, the standard setting task was introduced to the panelists. The modified Angoff procedure conducted for this standard setting was the MYN method proposed by Impara and Plake (1997). This method is considered to be less taxing on the panelists than the traditional modified Angoff method. A specific variation of the MYN was used (Impara & Plake, 1997; Plake et al., 2005), which requires panelists to consider the borderline examinee at each cut score and to identify at which level the borderline examinee would be able to answer each item correctly. To complete the rating task, panelists considered each item and decided whether or not the borderline examinee in each category would be able to identify the correct answer. Specifically, panelists considered each MC item separately and completed the following thought process in relation to each item: Would a borderline-1/2 student be able to answer this item correctly? If yes, then the panelist would circle the 1/2 cut score on the rating form and move on to the next item. If no, then the panelists would consider the next question about the same item: Would a borderline-2/3 student be able to answer this item correctly? If yes, the 2/3 cut score would be circled for that item and the panelist would move on to the next item. If no, the panelist would consider the next question about the same item: Would a borderline-3/4 student be able to answer this item correctly? If yes, the 3/4 cut score would be circled for that item and the panelist would move on to the next item. If no, then the panelists would consider the next question about the same item: Would a borderline-4/5 student be able to answer this item correctly? If yes, the 4/5 cut score would be circled for that item and the panelist would move on to the next item. If no, then the panelist would consider the final question about the same item: Would the above borderline-5 student be able to answer this item correctly? If yes (which is likely given that all other possible borderline students have been considered), then the Above 5 score would be circled for that item. If no, the panelist will reconsider all questions again until the panelist can make the best judgment about which borderline student would get this item correct (see the appendix for the rating form that panelists completed). 4
Following the first round of ratings, each panelist was provided a feedback form with the number of panelists who selected each cut score for each item, along with the observed item difficulty information from the operational administration. Using this information, panelists worked in small groups to select items for discussion based on discrepant ratings as determined by the amount ratings for an item differed among members of the small group and the amount ratings differed from the average rating for the total panel. Panelists were instructed that items with the most discrepant ratings should be chosen for discussion along with any other items that panelists would like to discuss. Following the small group discussions, panelists came together for a large group discussion to allow the small groups to hear the rationales of the other groups about items where the discrepancies were largest. Panelists were given the opportunity to suggest for the large group discussion any items that they wanted to discuss or for which they found it particularly difficult to make a rating. In addition, the facilitator chose several items with discrepant items for discussion. When sufficient discussion had occurred, impact data were provided to the large group. Impact data consisted of the percentage of examinees expected to fall within each performance level based on the Round 1 cut scores when applied to the 2011 APES composite score distribution. Then, the panelists completed a second round of ratings, followed by a second round of presentation of the impact data and large group discussion.
Data Analyses
All analyses were conducted using the Facets computer program (Linacre, 2010). The specified facets for the current study were (a) panelist severity/leniency, (b) judged item difficulty, (c) rounds of ratings (i.e., two rounds for the current study), and (d) cut scores on the rating task conducted by the panelists. For this study, a rating scale formulation of the MFR model was applied to the ratings, and various fit indices and displays were examined to address the research purposes stated above. Specifically, the locations of the levels of each facet on the variable map were examined, along with outfit and infit MSE statistics and reliability indices, to address the first purpose of this study. The same analyses were conducted with panelist gender and level of course taught (high school or college) to address this study’s second purpose. That is, the only difference between the analyses conducted to address the first and second research purposes was that two additional facets—gender and level of course taught—were incorporated into the analyses for the second part of the study.
Results
Research Purpose 1
In this section, results are described as they relate to the first research purpose of this twofold study, which was to illustrate how the MFR model was used to evaluate the quality of judgments on MC items provided by panelists who participated in a modified Angoff standard setting that used the MYN method.
The variable map shown in Figure 1 presents a graphical display of the spread of panelist severity measures, judged item difficulties, overall panelist severity across rounds, and judged locations of cut scores, all on the same logit scale. The logit scale represents the underlying latent construct of judged APES performance. The first column shows the logit scale, which ranged from −3 to +3 logits and is used to show the location of each of the elements within the facets in the MFR model. The second column presents the panelist severity measures on the logit scale. Severe panelists appear at the top of the column, and less severe panelists appear at the bottom. The third column shows the average location of the judged item difficulty for each item on the logit scale. Items that have higher judged difficulty are located near the top of the column, and easier items are located closer to the bottom. The fourth column shows the average location of panelist severity across the two rounds of ratings (R1 and R2 represent Round 1 and Round 2, respectively). This allows for comparison of panelist severity across rounds and provides an indication of whether or not panelists became more severe or more lenient across rounds. Finally, the last two columns display the location of cut scores for the panelist ratings on the logit scale. Horizontal lines represent the approximate location of cut scores that distinguish between the categories that were used by the panelists when conducting their rating tasks (i.e., Borderline 1/2, Borderline 2/3, Borderline 3/4, Borderline 4/5, Above 5). As can be seen in the figure, the location of these cut scores appears comparable across both rounds.

Variable map.
Summary statistics
To provide a frame of reference for interpreting differences among panelist severity measures, the explanatory facets (items and rounds) are centered at zero on the logit scale, and only the panelist facet is allowed to vary. The overall differences among panelists, items, and rounds are statistically significant (p < .05), with high reliabilities of separation (RelPanelist = .95, RelItem = .95, RelRound = .95). The reliability of separation statistics for panelists’ severity location measures from Facets is comparable with Cronbach’s coefficient alpha (e.g., the replicability of panelist severity location). For other facets, the reliability of separation statistics describes the spread of differences between elements within a facet. The significant separation statistics observed here indicate a spread of the elements within each of the facets across the latent variable (APES) beyond measurement error. Good fit to the model is evident for each of these facets (e.g., panelist MSE fit statistics were less than 1.50 and greater than 0.60; more details about model fit are described below; Engelhard, 2009). Acceptable model–data fit suggests that the MFR model is functioning as intended for these data.
Panelist severity measures facet
Table 2 presents the calibration of each of the panelists who participated in the 2011 standard setting for the APES examination. As evident from the variable map (Figure 1) and the summary statistics, the panelists vary from one another in severity. Panelists 3 and 6 were the most severe panelists (0.17 logits) and Panelist 8 was the most lenient panelist (−1.38 logits). An examination of the infit and outfit MSE statistics in Table 2 reveals that none of the panelists were extremely overfitting—indicated by fit statistics lower than about .60, which would suggest that the panelist was rating all items similarly, or extremely underfitting—indicated by fit statistics higher than about 1.50, which would suggest that a panelist was rating all items in different relative orders from the other panelists (Engelhard, 2009). When compared with the group of panelists, Panelist 8 demonstrates the most “muted,” or invariable, rating pattern, indicated by low infit and outfit MSE patterns (infit MSE = 0.65, outfit MSE = 0.67). As demonstrated in Engelhard (2009), low values of fit statistics in the context of standard setting may suggest that this panelist is displaying a halo effect, or assigning dependent ratings across items. This finding could also indicate that the panelist is not using the full range of cut score categories, or that the panelist is not able to distinguish between the five levels of performance on the APES exam. On the other hand, the highest values of infit and outfit MSE statistics are observed for Panelist 6 (infit MSE = 1.43, outfit MSE = 1.42). High values of fit statistics for this panelist suggest that the ratings assigned by Panelist 6 were more inconsistent than expected by the model in terms of judged item difficulty.
Calibration of Panelists.
Note. SEM = standard error of the mean; MSE = mean square error. Panelists ordered by severity measure (logits).
Residual plots for panelists
Residual analyses within the MFR model framework are useful for visually displaying the consistency in judgments for a particular panelist across a set of items and between rounds of rating—that is, these plots provide a method for examining intrarater consistency. Residuals are the difference between observed ratings for each item and the expected MFR model–based rating for a given panelist. Figure 2 displays exemplar residual plots for the judgments of three panelists on the 99 APES MC items whose outfit MSE statistics indicate high, expected, and low values of inconsistency during rating. Standardized residuals are plotted for each of the APES MC items for the three panelists, and residual values of +2 and −2 indicate statistically significant differences in judged item difficulty for an individual panelist when compared with the overall judged difficulty of the item on the logit scale. Panelist 6 demonstrated the most inconsistency during rating (outfit MSE = 1.42), Panelist 8 showed the least inconsistency (outfit MSE = 0.67), and outfit MSE statistics for Panelist 11 most closely approximate the expected value of 1.00 (outfit MSE = 1.00). Residuals for these three panelists are shown for both rounds of rating. For example, for Item 10 during Round 2, Panelist 6 assigned a Rating of “Above 5,” as compared with eight panelists who selected “3/4 (Borderline 4),” five panelists who selected “2/3 (Borderline 3),” and one panelist who selected “1/2 (Borderline 2).” Panelist 6 did not change much across rounds. Panelist 11 was most in line with the expected MFR values and did change ratings across rounds. Panelist 8 had values similar to that of the other panelists. For example, for Item 10 (described above), Panelist 8 selected “2/3 (Borderline 3),” which was similar to the judgments of other panelists. Also, a halo effect tendency of Panelist 8 is evident when examining his or her item ratings, as the categories of “3/4 (Borderline 4)” or “Above 5 were never used.” Panelist 8 also changed somewhat across rounds, particularly for the items near the beginning of the examination. Again, none of the panelists were indicated as extremely overfitting or underfitting by the MSE statistics; Panelists 6 and 8 were simply the most overfitted and underfitted in the current observed data and were therefore selected for the illustrations provided above.

Residual plots for three panelists.
Judged item difficulty facet
Calibrations for the 99 APES MC items are displayed in the third column of the variable map shown in Figure 1. As evident in the visual display and the significant reliability of separation statistic for the item facet (RelItem = .95), the items on the APES vary significantly in terms of judged difficulty. Specifically, Item 65 was judged to be the most difficult item by the panelists (2.14 logits) and Item 1 was judged to be the easiest item by the panelists (−3.34 logits). The judged item difficulties are centered with a mean of 0.00, to provide a frame of reference for interpreting panelist severity measures. Specifically, panelists with a severity location greater than zero tend to be more severe on the average item (e.g., would be likely to select Borderline 4/5 and Above 5 more frequently), and panelists who have lower severity locations less than zero tend to be more lenient on the average item (would be likely to select Borderline 1/2 and Borderline 2/3 more frequently). As with the panelist severity measures facet, values of outfit MSE can be used to identify individual items that have large residuals across panelists and across rounds of rating. Similar to the residual analyses described for the panelist facet, residual plots for judged item difficulties could also be plotted and examined to provide a visual display of the difference between model expectations and empirical observations.
Figure 3 shows a scatter plot of the judged item difficulties from the MFR model and the observed item difficulties from the actual APES 2011 cohort (i.e., the p values, or proportion of students getting the item correct). As expected, there is a positive relationship between judged item difficulty from the APES standard setting panel and p values (r = .54; 28.8% of variance explained). This indicates that the panelists’ judgments of item difficulty are related to the actual item difficulties based on student performance, adding validity evidence to the panelists’ judged item difficulties.

Scatterplot of judged item difficulties and observed p values.
Round facet
The round facet is displayed in the variable map just to the right of the items facet (see Figure 1), and the summary statistics are shown in Table 3. As indicated by a higher measure for Round 1 (0.16 logits) than for Round 2 (−0.16 logits), panelist severity decreased during the second round of ratings. This is not surprising, given that after Round 1, several panelists indicated in large group discussion their concern that the cut score was inappropriately high for the 4/5 cut. It is also interesting to note that the values of infit and outfit MSE for statistics for Round 1 and Round 2 suggest that there was less consistency during Round 1 ratings (infit MSE = 1.07, outfit MSE = 1.06) compared with those assigned during Round 2 (infit MSE = 0.91, outfit MSE = 0.91).
Calibration of Rounds.
Note. SE = standard error; MSE = mean square error.
Calibration of cut scores
Table 4 summarizes the calibration of cut scores that represent the five categories on the standard setting judgment rating scale. As can be seen in Table 4, the number of panelists who selected the “3/4,” “4/5,” and “Above Borderline-5” categories increased during Round 2, and the number of panelists who selected “1/2” and “2/3” decreased—suggesting that panelists became more lenient after the group discussion that followed Round 1 ratings. As expected, the outfit MSE statistics for each category reveal that the amount of consistency among panelists increased between rounds.
Calibration of Cut Scores for Both Rounds (Rasch Threshold).
Note. MSE = mean square error; SE = standard error.
The locations of cut scores, which are referred to as thresholds in Facets output, between categories on the 5-point standard setting judgment rating scale are displayed visually for Round 1 and Round 2 in the variable map shown in Figure 1, and calibrations for these thresholds are provided in Table 4. In the context of MYN standard setting rating tasks, thresholds indicate the location on the logit scale at which the probability for item difficulty to transition between cut score categories (e.g., from Borderline 2/3 to Borderline 3/4) is approximately 50%. The interpretation of the thresholds is important to describe because of the MYN methodology that was employed, as well as the fact that our purpose of applying the MFR model to the standard setting ratings is to evaluate and describe the standard setting ratings—not to produce Rasch-based cut scores. These thresholds, which represent the cut score facet, are not interpreted as the cut scores between two AP performance scores (e.g., AP 3 and AP 4). Rather, these thresholds describe the intersection where a panelist at a given severity level on the Rasch scale has an equal probability of selecting either of the adjacent categories on the standard setting rating task. Figure 4 displays probability category curves for the five categories used by panelists during the APES standard setting procedure. Each intersection of category probability curves in Panel A of Figure 4 corresponds to a threshold coefficient in Table 4. Panel B of Figure 4 shows the expected average rating for panelists along the Rasch logit scale, with the empirical distribution plotted around it. The bands surrounding the empirical and model-based curves in Panel B indicate close fit to the model for these standard setting judgments.

Category and item functions.
Research Purpose 2
The second purpose of this study was to employ the MFR model, with the potential explanatory variables of gender and level of course taught added to the model. In this section, results are described related to this second research purpose regarding the relationship between panelist characteristics and their conceptualization of the underlying construct (APES).
The variable map that depicts the calibrations of the items, panelists, rounds, cut scores, gender, and level of course taught on the underlying latent construct of APES performance is shown in Figure 5. An examination of the variable map suggests that the two explanatory variables do not add much additional value to the MFR model; in other words, the close locations of these panelist subgroups suggest that there may not be a substantial difference between the severity of men and women and between high school teachers and college teacher panelist judgments in this sample. However, a closer examination of the panelist subgroup calibrations suggests otherwise.

Variable map with explanatory variables.
Table 5 provides calibrations for the average severity of panelist judgments within gender and teaching-level subgroups. The logit values for male and female panelists are 0.05 and −0.05, respectively—suggesting that the male panelists in this sample are slightly more severe in their ratings than female panelists; this difference is statistically significant, χ2(1) = 4.5, p < .05. In addition, infit and outfit MSE statistics are larger for females than males, which suggests that female panelists are more inconsistent in their ratings than males (i.e., outfit MSE = 1.13 for females, outfit MSE = 0.88 for males). However, the difference between the two groups was less than 0.50 logits, implying that the difference is not practically significant despite being statistically significant (Draba, 1977; Wright, Mead, & Draba, 1976).
Calibration of Panelist Subgroups.
Note. SE = standard error; MSE = mean square error.
As shown in Table 5, differences in panelist severity are also evident between teaching-level subgroups. Specifically, the logit values for college teachers and high school teachers (0.05 and −0.05, respectively) indicate that the college teachers in this sample are slightly more severe in their ratings than high school teachers; this difference is statistically significant, χ2(1) = 4.0, p < .05. The infit and outfit MSE statistics are larger for high school teachers than for college teachers, which also suggests that high school teachers are more inconsistent in their ratings than college teachers (i.e., outfit MSE = 1.11 for high school teachers, outfit MSE = 0.90 for college teachers). The difference between teaching-level subgroups was less than 0.50 and therefore not practically significant (Draba, 1977; Wright et al., 1976).
Discussion
Conclusions
An examination of the panelist severity measures, judged item difficulties, rounds, and cut scores using the MFR model provided evidence of acceptable quality of the ratings obtained from the first operational AP judgmental standard setting procedure. Thus, the current study provides an important contribution of validity evidence for the APES standard setting. Overall, the panelists showed a high degree of spread in severity, as indicated by statistically significant value of reliability of separation statistics. This finding suggests significant differences among individual panelists in terms of their severity of item judgments during the standard setting. Moreover, no panelists exhibited a concerning level of a halo effect (which would have been seen if any panelists had an outfit MSE less than 0.60), or a concerning level of inconsistency from the other panelists (which would have been seen if any panelists had an outfit MSE greater than 1.50). As may be expected, the judged item difficulties on the APES represented a range of difficulty. The moderate positive correlation between the panelists’ judged item difficulties and the observed item difficulties based on student performance on the 2011 APES exam indicated that the panelist judgments were in line with the actual student performance data. The panelists recommended that cut scores be decreased across rounds, which was expected given the nature of the discussion that took place among the panelists between Round 1 and Round 2.
The second purpose of the study was to examine the impact of including two explanatory variables—gender and level of course taught—into the MFR model to determine whether these panelist characteristics explained patterns in their judgments. Although differences in severity between these panelist subgroups were statistically significant, neither of these explanatory variables demonstrated a practically significant effect (i.e., at least a 0.50 logit difference; Draba, 1977; Wright et al., 1976). However, the patterns of ratings were interesting to note for each of the panelists’ characteristics; specifically, mens had higher recommended cut scores (i.e., higher severity in judgments) than womens, and college teachers had higher recommended cut scores than high school teachers. The latter is particularly of interest. One possibility is that the college teachers hold a higher standard of what AP students should know and be able to do, whereas the high school teachers are more in touch with what the AP students would know and are able to do. In other words, the high school teachers may be more likely to give students the benefit of the doubt than the college teachers, because they have their own current AP students in mind.
In summary, the results of this study were favorable in regard to the quality of the ratings that resulted from the first operational AP standard setting. This study provides an example of the practical value of using the MFR model to evaluate panelist ratings from standard settings. Any testing program that establishes and uses cut scores to classify students into performance categories could benefit from following the methods that were conducted in this study.
Future Research
The current study opens the door to several lines of research that could be valuable contributions to the field of standard setting procedures. First, additional explanatory variables could be examined, such as region of the country where the panelist teaches and amount of teaching experience. Second, additional statistical models for evaluating the quality of panelist ratings, such as the use of generalizability theory, could be implemented using the same data to compare the results for methodological effects. Third, the MFR model can be applied to evaluate judgmental data from other AP standard settings that have employed different standard setting methodologies, such as other modified Angoff procedures, or Bookmark procedures. Fourth, as the AP program continues to conduct operational standard settings, the implementation of a MFR model using judgmental data can evaluate the usefulness of this approach across other AP exams. Fourth, a study evaluating the overall contribution of each facet could be conducted, leveraging log likelihood ratio fit statistics (e.g., Akaike information criterion, Bayesian inference criterion, consistent Akaike information criterion) that were not the focus of this study. Finally, the current work should be extended to include the standard setting judgments from the CR questions.
Conclusions and Implications
Standard setting procedures produce subjective judgments that inform the cut scores for many large-scale assessments (e.g., AP, National Assessment of Educational Progress, Michigan Educational Assessment Program; Engelhard, 2011; Peterson, Schulz, & Engelhard, 2011). Given that these subjective judgments directly affect the decisions about students from high-stakes assessments, it is imperative to evaluate the quality of these judgments. The application of the MFR model demonstrated in this article provides testing programs and educational researchers with an example of how to evaluate the quality of judgments from a standard setting so that they are equipped to use the results from standard setting procedures in a psychometrically appropriate manner. For example, these results can be used to provide descriptive and statistical information about the standard setting results, in particular the quality of the ratings obtained from panelists in a standard setting. Also, these results can be used to identify panelists whose ratings are aberrant and who should possibly be omitted from final cut score computations. Finally, a variable map presents a holistic illustration of how all facets are related to one another and can be used to depict the important factors that are related to panelist ratings. All of the above contribute to gathering internal validity evidence for standard setting results because intrapanelist and interpanelist consistency are examined with infit and outfit MSE statistics, residual analyses, and the variable map. What is more, using the MFR model to evaluate judgmental data from standard settings contributes to gathering procedural validity evidence because the variable map holistically depicts the procedure overall and can serve as documentation of the panelists’ judgments.
In summary, implementation of the MFR model to evaluate the quality of ratings from a standard setting is an ideal method for gathering validity evidence and technical documentation for standard setting procedures and is an appropriate tool for contributing to the final step of a psychometrically defensible standard setting meeting, which is to compile technical documentation and validity evidence (e.g., Hambleton et al., 2012).
Footnotes
Appendix
Authors’ Note
An earlier version of this article was presented at the annual meeting of the American Educational Research Association in Vancouver, British Columbia, Canada (April 2012). Researchers are encouraged to freely express their professional judgment. Therefore, points of view or opinions stated in College Board–supported research do not necessarily represent official College Board position or policy.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article. Support for this research was provided by the College Board.
