Abstract
Teacher evaluation systems often include classroom observations in which raters use rating scales to evaluate teachers’ effectiveness. Recently, researchers have promoted the use of multifaceted approaches to investigating reliability using Generalizability theory, instead of rater reliability statistics. Generalizability theory allows analysts to quantify the contribution of multiple sources of variance (e.g., raters and tasks) to measurement error. We used data from a teacher evaluation system to illustrate another multifaceted approach that provides additional indicators of the quality of observational systems. We show how analysts can use Many-Facet Rasch models to identify and control for differences in rater severity, identify idiosyncratic ratings associated with various facets, and evaluate rating scale functioning. We discuss implications for research and practice in teacher evaluation.
Keywords
Teacher evaluation systems are intended to provide educational leaders and policymakers with insight into teachers’ teaching effectiveness in order to improve educational experiences and outcomes for students. Often, these evaluation systems include teacher observations (i.e., classroom observations) in which principals or other raters observe a sample of teachers’ classroom practices and judge their effectiveness in several areas (Casabianca et al., 2013). Because these assessments often inform a variety of high-stakes decisions related to teacher retention or promotions (Doherty & Jacobs, 2013), many researchers and practitioners have expressed concerns with the quality of rater judgment and its susceptibility to errors and systematic biases (Cohen & Goldhaber, 2016; Hill, Charalambous, & Kraft, 2012; Ho & Kane, 2013; Kane & Staiger, 2012). As a result, numerous researchers have advocated for empirical analyses of the reliability, or consistency, with which principals and other types of raters evaluate teacher effectiveness (Bell et al., 2012; Casabianca, Lockwood, & McCaffrey, 2015; Cohen & Goldhaber, 2016; Ho & Kane, 2013; Kane & Staiger, 2012).
Recently, Hill et al. (2012) discussed the issue of rater reliability in teacher observation systems. Using data from an observation-based evaluation of mathematics teaching, these researchers pointed out a variety of challenges in teacher observation systems that potentially compromise their psychometric quality. These researchers noted that differences in rater judgment related to evaluation instruments, rater training and qualification procedures, and scoring designs can impact the reliability of teaching evaluations. Furthermore, these researchers argued that rater reliability analyses, in which a single rater reliability coefficient (e.g., a kappa statistic) is used to summarize the quality of rater judgments, are not sufficient for informing the interpretation of raters’ ratings or improving the quality of teacher evaluation systems. Hill et al. (2012) concluded that analyses that take into account the multifaceted nature of observational systems and provide empirical evidence to inform the design and revision of these systems, such as Generalizability theory (Brennan, 2001; Cronbach, Gleser, Nanda, & Rajaratnam, 1972; Shavelson & Webb, 1991), are more useful analytic tools for evaluating the psychometric quality of these assessments than reliability coefficients. Briefly, Generalizability analyses allow researchers and practitioners to identify sources of construct-irrelevant variance that negatively impact the quality of observational systems, such as differences among individual raters or observation occasions, and identify ideal data collection designs to maximize the reliability of rater judgments.
We agree with Hill et al. (2012) that rater reliability coefficients are not sufficient as indicators of the psychometric quality of rater judgments in teacher evaluation contexts. We also agree that analytic approaches such as Generalizability theory that take into account the multifaceted nature of rater-mediated teacher evaluations are essential for the development and improvement of effective observational systems. In this manuscript, we argue that there are additional aspects of teacher observation systems beyond what analysts can detect using Generalizability analyses for which empirical evidence is needed to improve the quality of these observational systems. Specifically, we use a real-data illustration to present a case for the use of latent trait models in general, and Many-Facet Rasch (MFR) models (Eckes, 2015; Linacre, 1989) in particular, as a complementary approach to Generalizability theory.
MFR models are extensions of dichotomous or polytomous Rasch models (Rasch, 1960) that model the log of the odds for a response in a particular category as a function of a latent trait and other facets of an assessment system, such as item difficulty or rater severity. These models transform ordinal ratings to a continuous scale using a logit transformation, where the continuous scale (the logit scale) represents the latent trait that is the object of measurement. The estimation procedure results in “location” estimates for elements of each facet (e.g., individual examinees, raters, and items) on a common scale. As a result of the common, continuous scale, one can compare the elements of one facet (e.g., examinees) directly to the elements of another facet (e.g., raters) in order to interpret their relative locations on the latent variable. Like other Rasch models, the MFR model is based on the principles of invariant measurement. In a teacher evaluation system based on classroom observation, invariant measurement means that estimates of a teacher’s effectiveness should not depend on the particular principal who happened to observe them, and estimates of principal severity should not depend on the particular teacher(s) whom they happened to rate. As we show in this manuscript, MFR models provide analysts with a variety of indicators for evaluating the degree to which empirical data match what would be expected based on the principles of invariant measurement. As a result, this approach provides a framework within which to evaluate a measurement procedure. We provide more details about these models below; for a didactic introduction to Rasch models in general, including MFR models, please see Bond and Fox (2015), Eckes (2015), and Engelhard and Wind (2018). When applied to teacher observation systems, MFR models allow researchers and practitioners to empirically evaluate rating quality, statistically adjust for rater effects, and identify areas for improvement in teacher observation systems.
MFR models have a long history of use in rater-mediated language assessments, such as writing assessments in which raters use scoring rubrics to score students’ compositions (McNamara & Knoch, 2012; Wind & Peterson, 2017). These models are useful for these language assessments because they can be specified to match the particular assessment context of interest. For example, analysts can include a customized set of facets (independent variables) in the model that reflect the specific components of an assessment system, such as students, raters, prompts, and domains in an analytic scoring rubric. In addition, researchers and practitioners have used MFR models to evaluate rater-mediated language assessments because the models can be applied when there are large proportions of missing observations (e.g., when every rater does not rate every examinee), because they provide a method for empirically detecting and adjusting for differences in rater severity, and because they allow analysts to empirically check whether or not rater severity is invariant (i.e., consistent) over construct-irrelevant components of an observational system—that is, the fairness of rater judgments. MFR models also allow analysts to empirically check whether raters have used the rating scale categories appropriately (e.g., make sure that higher ratings correspond to higher achievement and that the categories meaningfully distinguish among students with different levels of achievement). However, relatively few researchers have used MFR models in the context of teacher observation systems.
Illustration: Principal Observations of Teacher Effectiveness
The illustrative data are from the Network for Educator Effectiveness (NEE), which is a teacher evaluation program used by schools in the state of Missouri. An outreach unit of the University of Missouri, NEE provides a teacher evaluation framework and training to principals in over 250 districts across the state. As the main component of NEE, principals are encouraged to complete at least four to six classroom observations for each teacher in their building annually; that is, each teacher ideally receives ratings and feedback from the principal at least four times per year. Principals also participate in annual recalibration training conducted by certified trainers from NEE; this training includes evaluation of “anchor” teachers that provide principals opportunities to rate examples of high-, mid-, and low-performing teachers. At the completion of training, principals take a certification exam, and their scores are compared to those established by the NEE rubric developers; please see Bergin, Wind, Grajeda, and Tsai (2017) for a detailed discussion of how principals are trained and evaluated prior to operational scoring. We used a random sample of classroom observation data from NEE principals during the 2016–2017 school year for the current analyses.
Our sample consisted of 114 principals in the NEE system. For the 2016–2017 school year, we obtained initial classroom observation scores for each teacher rated by these principals. The average number of classroom observations provided by principals was 13.76 (SD = 2.95), with the minimum being 10 and the maximum being 20. 1 In all, the sample consisted of 1,569 individual teacher observations, with each teacher being rated once. All 114 principals also rated four example teachers as part of the initial recalibration training prior to the school year. These were included to provide common observations (i.e., connectivity) between raters across schools (Wind & Jones, 2018). In the psychometric literature, researchers often refer to common observations between elements of an assessment system as “anchors,” because they provide a reference point through which to compare otherwise disconnected elements of an assessment system. In our study, the common observations “anchor” principals across teachers. Thus, each principal contributed four anchor ratings and between 10 and 20 individual teacher ratings to the data.
During each observation, principals rated teachers on a variety of teaching practices that are included in the NEE rubric. Teachers received one rating per teaching practice on an eight-category rating scale; due to infrequent observations in the lowest category, we combined the first two rating scale categories prior to analysis. Not all principals used the same set of teaching practices; however, most of them rated teachers on three common teaching practices: (a) “the teacher cognitively engages students in the content” (Cognitive Engagement), (b) “the teacher uses instructional strategies that lead students to problem-solving and critical thinking” (Critical Thinking), and (c) “the teacher monitors the effect of instruction on the whole class and individual learning” (Formative Assessment). Although these are not the only teaching practices included in NEE, we selected these three for our analyses because they were used during the anchor ratings in calibration training and because they represented a range of teaching practices. Furthermore, limiting our analysis to these three practices allowed us to present a focused illustration.
Models
Similar to Generalizability theory, researchers can specify a variety of formulations of MFR models that match the characteristics of an observational system. Specifically, one can include facets in an MFR model that function similarly to explanatory variables in logistic regression models and reflect various components of an observational system, such as raters and teaching practices (i.e., items or domains) that are included in a scoring rubric. Furthermore, one can manipulate the way in which rating scale categories are modeled in order to examine the structure and comparability of rating scales as they relate to various facets.
Three common facets used in MFR analysis are rater severity, examinee ability, and task difficulty. These three facets describe components of rating systems that are present in most rater-mediated assessments. Rater severity refers to the overall magnitude of the ratings provided by Rater i (across all examinees and tasks) when compared with those provided by other raters. “Severe” raters provide consistently lower ratings, while “lenient” raters provide generally higher ratings. In the current context, we refer to this facet as principal severity. Examinee ability refers to the judged ability of Examinee j relative to other examinees; in this case, we refer to this facet as teachers’ judged effectiveness. Task difficulty refers to the relative difficulty of each task, or the likelihood of receiving a high score when being rated on the task; in our study, we refer to this facet as the judged difficulty of teaching practices.
MFR models share the same basic requirements (i.e., model assumptions) as other Rasch models. First, MFR models require unidimensionality: One latent variable is sufficient for explaining most of the variation in responses (in this case, principals’ ratings of teacher effectiveness). These models also require local independence: After controlling for the primary latent variable, there is no significant relationship between responses. Finally, MFR models are based on two strict requirements related to invariance. In a rater-mediated assessment, these requirements can be stated as follows: (1) Estimates of rater (principal) severity are independent of the particular teachers who they happened to rate, and (2) estimates of examinee achievement (judged teacher effectiveness) are independent of the particular principal who rated them. Analysts can use a variety of model-data fit analyses to check for adherence to these assumptions; we describe these as part of our illustrative analyses below. Finally, it should be noted that there are not strict sample size requirements for Rasch analyses; rather, researchers have proposed sample size recommendations depending on the type of analysis (e.g., calibration only, checking for evidence of bias, dichotomous vs. polytomous), and the level of precision required for the intended interpretation and use of the results. For an assessment based on polytomous ratings that has high-stakes consequences, Linacre (1994) recommended a minimum sample size of around 250 to achieve logit-scale estimates that are stable within a 99% confidence interval.
For our illustrative analysis, we will use two MFR models. We applied both models to the sample that we described above (Nprincipals = 114; Nteachers = 1,569; Nteaching practices = 3).
Model One
In the first model, we included three facets: teachers, raters (principals), and teaching practices. This model provides estimates of teachers’ judged teaching effectiveness, raters’ severity, and the judged difficulty of teaching practices. The model can be expressed mathematically as:
where the term on the left side of the equation is the log of the odds that Teacher n, when rated by Rater i on Teaching Practice j, receives a rating in category k, rather than in the category just below it (k – 1). On the right side of the equation, the first three terms represent Teacher n’s judged effectiveness (θn), Rater i’s observed severity (λi), and the judged difficulty of Teaching Practice j (δj). The last term (τk) is a rating scale category threshold that reflects the difficulty associated with receiving a rating in a given rating scale category, rather than the category just below it.
Model Two
The second model is very similar to the first model. The only difference is that we changed the way in which we estimated the structure of the rating scale. Specifically, we estimated separate rating scale category thresholds (τ) for each rater so that we could evaluate the degree to which the raters interpreted the rating scale in a similar fashion. Model Two is as follows:
where all of the terms are defined as they were in Equation 1, except for the last one. Specifically, the rating scale category threshold is defined as the difficulty associated with receiving a rating in category k, rather than in category k – 1, specific to Rater i. As a result of this formulation, Equation 2 produces separate rating scale category difficulty estimates for each rater.
Indicators of Rating Quality for Teacher Observation Systems
Together, Model One and Model Two provide a variety of evidence related to each facet in the model (teachers [θ], raters [λ], and teaching practices [δ]). Because we are focusing on rating quality in this study, we will focus on the evidence that these models provide about the rater facet (λ); however, other situations often warrant the consideration of evidence related to other facets as well. For example, one might also include rating occasions as an additional facet in order to examine changes in teachers’ judged effectiveness or principal severity over time. Table 1 presents several categories of evidence that the two models provide that have important implications for teacher observation systems. We classified the sources of evidence from these two models into three broad categories: (A) rater severity, (B) rater fit, and (C) rating scale category use. One can use numeric and graphical indicators related to these categories to address important questions about the quality of ratings in teacher observation systems.
Indicators of Rating Quality for Teacher Observation Systems Based on MFR Models
Note. MFR = Many-Facet Rasch.
The two models that we included in our analysis provide a suite of indicators of rating quality that can be applied to teacher evaluation systems based on classroom observations. Although we have included two models, the purpose of the analysis is not to determine which model fits the data best. Instead, each model provides different information that can alert analysts to information about different aspects of the teacher evaluation.
To illustrate the use of these sources of evidence, we used the Facets software program (Linacre, 2015) to analyze the NEE ratings using Model One and Model Two (see Online Supplement A for our Facets syntax, available on the journal website). In the following sections, we present the results from our analyses to illustrate the three categories of evidence.
Evidence Category A: Rater severity
The first category of evidence about the quality of teacher observation ratings is related to the severity with which the raters rated teachers. Specifically, one can use evidence in this category to address the following question: To what extent do raters exhibit different levels of severity when they rate teachers? The MFR model provides several numeric and graphical indicators of rater severity that address this question: rater locations, separation statistics, fair averages, and visual representations of rater locations on a variable map.
Rater locations
For each rater, the MFR model uses observed ratings to estimate location values that reflect rater severity. These estimates reflect differences in the severity with which individual principals rated teachers (i.e., differences between raters). Rater location estimates are expressed on an interval-level scale (the logit scale) that represents the latent variable measured in a teacher observation system, such as teaching effectiveness. For each rater, a location estimate is calculated that represents their severity, where raters who are severe (i.e., give low ratings often) have higher estimated locations and raters who are lenient (i.e., give high ratings often) have lower estimated locations.
It is possible to calculate rater location estimates even if all of the raters did not rate all of the teachers on all of the tasks in a teacher observation system. As long as there are connections between raters through common elements (e.g., two different raters rate at least one teacher in common), estimates of rater severity are adjusted for differences in teacher achievement. The practical implication of this property is that it is not necessary for raters to exhibit high levels of interrater reliability in order to have a sound observational system, because teachers’ estimates are adjusted for differences in rater severity. Table 2 shows the estimated locations for the raters in our illustrative NEE dataset, with the raters ordered by severity. According to these estimates, the most severe rater was Rater 58 (λ = 2.26; mean rating = 4.33), and the most lenient rater was Rater 242 (λ = −3.68; mean rating = 5.41). These estimated locations also include standard errors (SEs) that reflect the precision with which each rater’s location has been estimated. Larger SEs reflect less precision, which may result from poor targeting between a rater and the teachers they observe. Large SEs may also occur when there are few observations of the rater, such as when a rater only rates a small number of teachers. One can use these SEs to calculate confidence intervals around rater locations.
Rater Calibrations and Fit Statistics Calculated Using Model One (Results for Selected Raters)
Note. MSE = Mean square error.
Separation statistics
One can also use MFR model estimates to evaluate the degree to which raters’ locations are distinct—that is, the degree to which raters have different levels of severity. In contrast to Generalizability theory (Brennan, 2010; Cronbach et al., 1972), differences among raters are not treated as evidence of measurement error in the context of Rasch measurement theory (Rasch, 1960). As long as there is adequate model-data fit (described further later in the manuscript), differences among raters are considered valuable because these differences can lead to more precise estimates of examinee achievement (in this case, teaching effectiveness) by targeting a wide range of achievement levels. Because the Rasch model estimates of examinee achievement are adjusted for differences in rater severity, it is not necessary for all of the raters to exhibit the same level of severity in order to achieve desirable psychometric properties.
To gauge the degree to which raters exhibit similar levels of severity, researchers who use MFR models often calculate the reliability of separation statistic (Rel) for each facet in the model. In contrast to reliability statistics in Classical Test Theory and Generalizability Theory, values of Rel indicate the magnitude of differences among the locations of elements within a facet, where higher values of Rel indicate larger differences. In our illustrative dataset, the Rel statistic for raters was 0.95—indicating substantial severity differences among the raters. As we noted above, these differences in rater severity are not considered evidence of measurement error from the perspective of Rasch measurement theory. As long as there are links between the raters, as well as evidence of acceptable model-data fit, the teacher estimates are adjusted for differences in rater severity—such that the interpretation of each teacher’s teaching ability is not tied to the particular raters who happened to observe them.
Adjusted average ratings
A third numeric indicator related to rater severity is the adjusted average rating for each rater (i.e., “fair averages”; Eckes, 2005; Linacre, 2018). The adjusted average rating is a transformation of rater location estimates on the logit scale back to the original scale (i.e., raw score scale). These values are useful because they show each rater’s raw score average rating, adjusted for the level of the teachers that they happened to rate. These adjusted average statistics are the teacher location estimates on the logit scale transformed back to the original rating scale. As a result, adjusted averages are practically useful statistics for communicating the spread of rater severity to practitioners. For raters, the difference between unadjusted average ratings and adjusted average ratings shows the impact of the adjustment for differences in teacher achievement. Likewise, the difference between the unadjusted and adjusted average ratings for teachers shows the impact of adjustment for differences in rater severity.
Table 2 includes adjusted average ratings for the raters. For some raters, the adjusted averages are lower than their observed averages (e.g., Rater 58: observed average = 4.33, adjusted average = 3.99)—indicating that these raters rated teachers with relatively low teaching achievement. For other raters, the adjusted averages are higher than their observed averages (e.g., Rater 54, observed average = 4.09, adjusted average = 4.63). Plot A in Figure 1 illustrates the correlation between the observed average ratings (x axis) and adjusted average ratings (y axis) for raters (r = 0.46). Some principals have an adjusted average rating that was higher than their observed average rating (points above the diagonal). For these raters, the adjustment reduced their estimated severity, indicating that they rated teachers with relatively low judged effectiveness. For other principals, the adjusted rating was lower than their observed rating (points below the diagonal). For these raters, the adjustment increased their estimated severity, indicating that they rated teachers with relatively high judged effectiveness. For teachers, the adjustments for differences in rater severity have similar consequences. As shown in Plot B in Figure 1 (r = 0.91), although there is a strong positive correlation, some teachers have an adjusted average rating that is higher than their observed average rating (points above the diagonal); for these teachers, the adjustment reduced the impact of rater severity on the estimate of their teaching proficiency. Other teachers have adjusted average ratings that are lower than their observed average ratings (points below the diagonal). For these teachers, the adjustment reduced the impact of rater leniency on the estimate of their teaching proficiency.

Observed and adjusted average ratings.
Variable map
A final source of evidence related to rater severity is a graphical display of raters’ estimated locations in a variable map. Figure 2 is a variable map that shows the location estimates for the NEE raters in our illustrative dataset. The first column is the logit scale—this is the linear scale on which locations for all of the principals, teachers, tasks, and rating scale categories have been estimated. The second column shows the severity estimates for each of the 114 principals in the NEE dataset. In this column, an asterisk (*) represents four principals, and a period represents between one and three principals. These locations are the same as the logit-scale locations given in Table 2, where higher locations indicate principals who were more severe and lower locations indicate principals who were less severe. To facilitate comparisons between the facets in the MFR model, we centered the locations of the principals around zero logits (i.e., set the mean equal to zero). The spread of the principal locations in Figure 2 reveals that the principals exhibited a range of severity levels when they conducted classroom observations.

Variable map.
The third and fourth columns show the estimated locations for the teachers and tasks. Although these facets are not the primary focus of our analysis, we will describe them briefly. The teacher locations are also represented using asterisks and periods, where an asterisk represents 16 teachers and a period represents between 1 and 15 teachers. Teachers with higher logit-scale locations were judged as more effective than teachers with lower logit-scale locations. There was a wider spread of teacher locations than rater locations, indicating that, in general, the raters judged most of the teachers as relatively effective. The fourth column shows the estimated locations for the tasks. Critical Thinking was judged as the most difficult teaching practice—that is, the principals required higher levels of teaching effectiveness to give high ratings on this teaching practice. Cognitive Engagement was judged as the least difficult practice—that is, the principals required relatively lower levels of teaching effectiveness to give high ratings on this teaching practice.
The last column in Figure 2 shows the estimated difficulty associated with the NEE rating scale categories. Horizontal lines between each pair of category labels show the logit-scale location at which there was an equal probability that a principal would give a rating in the adjacent categories. Although the distance between each of the rating scale categories is not equivalent, the category difficulties proceed in the expected order (increasing in difficulty from Category 1 to Category 7). We provide additional details about the rating scale categories in our discussion of Model Two below.
Evidence Category B: Rater fit
The second category of evidence from the MFR model for evaluating rating quality in teacher observation systems is rater fit. Whereas the evidence in Category A reflects the degree to which raters exhibit different levels of severity and leniency when they observe and score teachers, the evidence in Category B reflects the degree to which raters are internally consistent such that statistical adjustments for differences in their severity are trustworthy. Researchers and practitioners can use evidence in this category to address the following question: To what extent do raters give expected and unexpected ratings? In the context of latent trait models, such as the MFR model, “expected” and “unexpected” observations are defined using estimates for parameters in the model. In the case of teacher observation systems, expected and unexpected ratings are defined using rater location estimates, teacher location estimates, and estimates for other facets in the model. For example, a rater who frequently gives high ratings would have a relatively high location estimate on the logit scale. Likewise, a teacher who received high ratings very often would have a relatively high estimate on the logit scale. If this lenient rater observed the high-performing teacher, we would expect the rater to give the teacher relatively high ratings, and very low ratings would be unexpected. Statistically speaking, after location estimates have been calculated, they are used to calculate expected ratings using the MFR model (Equation 1). These expected ratings are then compared to the actual observed ratings. If the MFR model estimates are a reasonable summary of the data, then there will be a close correspondence between the expected and observed ratings. Discrepancies between the expected and observed ratings—described as “misfit”—can alert researchers and practitioners to instances where the MFR model is not an adequate summary of a rater’s judgments of a teacher’s teaching effectiveness on a particular teaching practice. Identifying these discrepancies can lead to improvements in various aspects of the observation system, such as revisions to rater training or scoring materials. Likewise, instances of misfit can alert researchers and practitioners to raters, teachers, or teaching practices for whom it may not be appropriate to interpret MFR model estimates or adjusted averages. There are many ways to evaluate model-data fit using latent trait models such as the MFR model. In this illustration, we focus on two numeric indicators and one graphical fit indicator.
Mean square error statistics
Popular numeric indicators of model-data fit for Rasch models are averages and weighted averages of the residuals (discrepancies between observed and expected ratings) associated with individual elements within facets, such as individual raters. In particular, researchers and practitioners who use Rasch models often calculate unweighted and weighted Mean Square Error (MSE) statistics. These statistics are based on standardized residuals, or differences between the ratings that would be expected given the model estimates and the observed ratings in the data. Standardized residuals are calculated as follows:
where xni is the observed rating that rater i gave to teacher n and Eni is the expected rating for teacher n when they were rated by rater i, 2 and Wni is the variance. 3 It is possible to use standardized residuals to calculate fit statistics for any facet in a Rasch model. For raters, the unweighted MSE statistic, referred to as “outfit” MSE, is calculated as follows:
In Equation 4, outfit statistics are the average of the squared standardized residuals across all of the teachers who Rater i rated (N). Individuals who use Rasch models also frequently calculate a weighted MSE statistic, referred to as “infit” MSE, as follows:
Infit MSE is the average of the squared standardized residuals across all of the teachers who Rater i rated, where each squared standardized residual is weighted by its variance. Relatively speaking, Outfit MSE statistics are more sensitive to extreme unexpected ratings (e.g., when a very severe rater gives a high rating to a teacher whose teaching is judged as ineffective), and Infit MSE statistics are more sensitive to unexpected ratings that are less extreme.
In the context of a teacher observation system, MSE statistics are useful because their values can help researchers and practitioners identify individual raters, teachers, and tasks for whom there are many or large unexpected ratings and thus warrant additional investigation prior to interpreting rater judgments. A number of researchers have proposed recommendations for setting critical values (i.e., cut scores) above or below which to classify elements as “misfitting,” including recommendations that take into account the type of data (e.g., multiple-choice or rating scale data; Bond & Fox, 2015), sample size (Smith, Schumacker, & Bush, 1998), and the focus of the analysis (e.g., test-taker fit vs. item fit; DeAyala, 2009; Wu & Adams, 2013). Other researchers have recommended that critical values be established using the empirical distribution of fit statistics in a given sample (Seol, 2016; Wolfe, 2013). In our illustration, we treat MSE statistics as continuous variables and use the empirical distribution to inform our interpretation of values for individual raters.
Using the empirical critical values for the outfit MSE statistic (see Online Supplement B), we flagged eight raters as misfitting. Five of these raters had outfit statistics greater than our upper critical value (>1.98), indicating that these raters gave many extreme unexpected ratings, and three had outfit statistics lower than our lower critical value (<0.34), indicating that these raters gave overly consistent ratings. Similarly, using the empirical critical values for the infit MSE statistic, we flagged nine raters as misfitting. Seven of these raters had infit statistics greater than our upper critical value (>1.99), and two had infit statistics lower than our lower critical value (<0.35).
Expected and observed response functions
In addition to numeric indicators of model-data fit, it is useful to examine graphical displays of the alignment between observed and expected responses. Graphical displays of model-data fit are particularly beneficial because they reveal characteristics of misfit that numeric fit statistics do not communicate, including the magnitude, direction, and location of unexpected ratings. A popular method for displaying model-data fit graphically is through plots of expected and observed response functions. Response functions illustrate the relationship between examinee location estimates and the probability for ratings in each category. One can create expected response functions that illustrate the modeled relationship between examinee achievement and ratings and compare these to observed response functions that illustrate the actual observed relationship. In this illustration, we focus on response functions for raters (rater response functions, RRFs). Specifically, one can plot an RRF for each rater that illustrates the relationship between logit-scale estimates of teacher effectiveness and the probability for a rating in each category of the rating scale, specific to the rater of interest.
Figure 3 shows the modeled (expected) and observed RRFs for selected raters from the NEE dataset; each plot corresponds to a different rater. We selected these raters to demonstrate graphical displays of model-data fit for raters whose ratings exhibited approximate fit to the model and raters whose ratings exhibited misfit. In each plot, the x axis shows teachers’ estimated effectiveness in logit-scale units. The y axis shows the rating scale categories. The solid line shows the expected relationship between the two axes given the Rasch model estimates, and the dashed line shows the observed relationship. Finally, the dotted lines show upper and lower limits of the 95% confidence interval around the expected response function.

Expected and observed rater response functions.
Rater 919 is an example of a rater whose ratings exhibited approximate fit to the model; the plot for Rater 919 indicates that all of these raters’ observed ratings were within the 95% confidence interval for the expected ratings, and the pattern of the observed RRF generally follows the shape of the expected RRF. Rater 884 exhibited slightly more variability in their observed ratings, but this rater’s ratings were also within the 95% confidence interval for the expected ratings. Figure 3 also includes plots for two raters who exhibit substantial model-data misfit. For example, the plot for Rater 948 indicates that this rater gave unexpected ratings to teachers with relatively low judged teaching effectiveness. Specifically, the empirical RRF for this rater indicates that they gave higher than expected ratings (empirical RRF is higher than the expected RRF) to teachers with estimated locations less than around −1 logits and lower than expected ratings (empirical RRF is lower than the expected RRF) to teachers with estimated locations around 0 logits. Rater 586 also gave unexpected ratings; this rater’s most extreme departures from model expectations occurred when this rater rated teachers who were otherwise judged to have high teaching effectiveness. Specifically, Rater 586 gave substantially lower than expected ratings to teachers with estimated locations greater than about 3 logits.
Evidence Category C: Rating Scale Category Use
A third category of evidence for evaluating teacher observation systems is related to how raters use rating scale categories. Researchers and practitioners can use evidence in this category to address two questions: To what extent do raters interpret rating scale categories as expected? To what extent do raters share a similar interpretation of the difficulty of rating scale categories as other raters? Evidence in this category focuses on whether or not the raters have used the scale categories appropriately (e.g., higher ratings correspond to more effective teaching, and the categories describe distinct levels of teaching effectiveness). This evidence is distinct from the overall severity with which raters judge teachers’ effectiveness (Category A) and the reasonableness of adjustments for severity differences (Category B).
Evidence based on MFR models such as Model Two is particularly useful for evaluating rating scale functioning because these models allow researchers to explicitly examine and compare the structure of the rating scale across the individual raters in an observation system. There are several numeric and graphical indicators of rating scale functioning that one can calculate using Rasch models such as Equation 2.
Estimates of rating scale category thresholds
Rating scale category thresholds are numeric values that reflect the difficulty associated with receiving a rating in a particular rating scale category, rather than the category just below it. When an MFR model is specified as in Model Two, separate thresholds are calculated that are specific to each rater. Accordingly, for a rating scale with k categories, Model Two provides estimates of k – 1 thresholds for each rater. These threshold values allow researchers and practitioners to examine the degree to which individual raters interpret the difficulty of rating scale categories in a similar way to other raters. Specifically, one can examine the difference in the locations of these thresholds across raters to find out whether raters have a similar interpretation of the level of teaching effectiveness necessary to receive a rating in each category. Please see Online Supplement C for a presentation and discussion of the rating scale thresholds for the NEE data.
Rating scale category probability curves
A useful way to examine raters’ use of rating scale categories is to construct graphical displays of the probability that each rater gives a rating in each of the rating scale categories (see Figure 4). By examining plots of the probability associated with each rating scale category, it is possible to determine the extent to which individual raters have ordered the categories as expected, the extent to which raters use each category to represent a meaningful range of achievement, and the extent to which raters have used all of the categories.

Rating scale category probability curves for selected raters.
Figure 4 includes rating scale category probability curves for four raters from the NEE dataset who used the NEE rating scale categories differently. In each plot, the x axis shows the logit scale for judged teaching effectiveness, and the y axis shows the NEE rating scale. Different lines show the conditional probability that a teacher with a given logit-scale location will receive a rating in a particular rating scale category. Rater 48 (Figure 4A) used five of the seven NEE rating scale categories and interpreted the relative order of these categories as expected (the category probabilities proceed from left to right in increasing category order). Each of the category probability curves in this plot has a distinct peak—indicating that Rater 48 used the categories to reflect a distinct range of teaching effectiveness. Finally, the category probability curves are relatively evenly spaced—indicating that Rater 48 had a generally consistent interpretation of the relative difference between each of the categories in the scale. On the other hand, Rater 94 (Figure 4B) used all seven NEE rating scale categories, but the probability curve for Category 3 does not have a distinct peak. This result indicates that Rater 94 did not use Category 3 to describe a range of teaching effectiveness distinct from Category 2 or Category 4. The plots for Rater 36 (Figure 4C) and Rater 27 (Figure 4D) show similar patterns.
Discussion
The purpose of this study was to illustrate how researchers and practitioners can use latent trait models in general, and multifaceted models based on Rasch measurement theory (Rasch, 1960) in particular, to evaluate the quality of ratings in observational systems for teacher evaluation. Similar to Generalizability theory (Brennan, 2010; Cronbach et al., 1972), MFR models allow researchers and practitioners to go beyond rater agreement statistics and interrater reliability coefficients to evaluate psychometric quality related to various components of teacher observation systems. The MFR model indicators provide additional information that can assist researchers and practitioners in ensuring that teachers receive ratings that are comparable across different raters. Although rater consistency and reliability are important components of teacher observation systems, investigating these additional aspects of rating quality can help researchers and practitioners more fully explore teacher evaluation systems from a psychometric perspective.
Specifically, MFR models provide researchers and practitioners with a variety of detailed indicators that facilitate the investigation of rating quality beyond indicators of consistency, including indicators of rater severity, rater fit, and rating scale category use. With regard to rater severity, it is important to note that differences in rater severity may exist even with high reliability. That is, rater reliability statistics, including reliability and dependability indicators from Generalizability theory, may mask the implications of rater severity differences because they focus on the consistency of the raters’ relative ordering of teachers or other elements of an assessment system. While rater consistency and reliability are important components of teacher observation systems, the persistence of rater severity differences in these contexts (Casabianca et al., 2013, 2015) suggests that researchers and practitioners need to design assessment systems that allow for adjustments for these differences and then consider their impact in practical settings.
With regard to rater fit, the MFR model provides indicators of the degree to which individual raters’ ratings of teaching effectiveness display psychometric properties that are consistent with fundamental measurement properties, including invariance. We illustrated how one can use numeric and graphical indicators to identify individual raters whose ratings may warrant additional investigation before they can be meaningfully interpreted and used to evaluate teachers. Such analyses are particularly important if one plans to use statistical adjustments for differences in rater severity (e.g., Figure 1 and Table 2), because evidence of acceptable fit is needed in order to meaningfully interpret adjusted ratings (Wind, Engelhard, & Wesolowski, 2016).
Finally, the MFR model provides a variety of indicators related to raters’ use of a rating scale. We showed how using a Partial Credit model formulation (Masters, 1982 of the MFR model allows researchers and practitioners to consider the extent to which raters interpret and use the scale categories in the intended order and the extent to which raters’ interpretation and use of the categories is consistent with other raters. Such analyses provide evidence beyond rater severity and rater fit to support the interpretation and use of teacher observation ratings. These indicators are particularly useful for informing revisions to scoring materials (e.g., rating scales and rubrics) and rater training.
In addition to the indicators that we highlighted in our illustrative analyses, MFR models include several other useful indicators of psychometric quality that are relevant to teacher observation systems. In particular, indicators of interactions between facets (e.g., interactions between raters and teaching practices, or between raters and teachers’ demographic characteristics) and indicators of teacher fit (i.e., person fit; see Meijer, Niessen, & Tendeiro, 2016) provide important information about the psychometric quality of these assessment systems and warrant consideration in practice and in research. We focused on a select set of rating quality indicators that are directly related to the rater facet in teacher observation systems so that we could make a case for the value of the MFR approach for evaluating rating quality in these contexts. Because rater judgments play a central role in conclusions about teaching quality in teacher evaluation systems based on classroom observations (Cohen & Goldhaber, 2016), analyses of rating quality beyond indicators of rater consistency are an essential component of evidence of the psychometric quality of these systems.
We agree with Hill et al.’s (2012) observation that “despite their common use, rater agreement rates do not provide a comprehensive picture of the reliability of scores generated from observational systems” (p. 57) and with their conclusion that “it is misleading to talk about the reliability of specific instruments; instead, reliability inheres in the joint combinations of instruments, rater training and certification systems, and specific scoring designs that contribute to an observational system” (pp. 62–63). In addition to the multifaceted nature of reliability, we argue that other aspects of psychometric quality also warrant investigation, including rater severity, model-data fit, and rating scale category use. Using the techniques that we demonstrated in this study, researchers and practitioners who design, implement, and evaluate teacher observation systems can gather additional evidence to promote psychometrically sound evaluations of teaching quality. Furthermore, when there is evidence of adequate model-data fit, it is possible to adjust estimates of teacher effectiveness for differences in rater severity and to adjust estimates of rater severity for differences in teacher effectiveness even when it is not possible for all of the raters to rate all of the teachers. Such analyses are crucial in contexts where conclusions about teaching quality, and the resulting consequences of these conclusions, depend on rater judgments. Accordingly, we recommend that rating quality analyses using techniques such as those that we have illustrated in this study be included as a routine part of the development (e.g., rater training), implementation, interpretation, and revision of all rater-mediated assessments, including teacher evaluation systems based on classroom observations.
Supplemental Material
Wind_and_Jones_Online_Supplements_A-C – Supplemental material for Not Just Generalizability: A Case for Multifaceted Latent Trait Models in Teacher Observation Systems
Supplemental material, Wind_and_Jones_Online_Supplements_A-C for Not Just Generalizability: A Case for Multifaceted Latent Trait Models in Teacher Observation Systems by Stefanie A. Wind and Eli Jones in Educational Researcher
Footnotes
Notes
Authors
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
