Abstract
When analysts evaluate performance assessments, they often use modern measurement theory models to identify raters who frequently give ratings that are different from what would be expected, given the quality of the performance. To detect problematic scoring patterns, two rater fit statistics, the infit and outfit mean square error (MSE) statistics are routinely used. However, the interpretation of these statistics is not straightforward. A common practice is that researchers employ established rule-of-thumb critical values to interpret infit and outfit MSE statistics. Unfortunately, prior studies have shown that these rule-of-thumb values may not be appropriate in many empirical situations. Parametric bootstrapped critical values for infit and outfit MSE statistics provide a promising alternative approach to identifying item and person misfit in item response theory (IRT) analyses. However, researchers have not examined the performance of this approach for detecting rater misfit. In this study, we illustrate a bootstrap procedure that researchers can use to identify critical values for infit and outfit MSE statistics, and we used a simulation study to assess the false-positive and true-positive rates of these two statistics. We observed that the false-positive rates were highly inflated, and the true-positive rates were relatively low. Thus, we proposed an iterative parametric bootstrap procedure to overcome these limitations. The results indicated that using the iterative procedure to establish 95% critical values of infit and outfit MSE statistics had better-controlled false-positive rates and higher true-positive rates compared to using traditional parametric bootstrap procedure and rule-of-thumb critical values.
Researchers have proposed psychometric models that help researchers identify and control for problematic scoring patterns in rater-mediated performance assessments, among which the Many-Facet Rasch (MFR) model (Linacre, 1989) has been widely applied in many areas, such as writing (e.g., Schaefer, 2008), speaking (e.g., Eckes, 2005), and music performance (e.g., Wind et al., 2016). Considered alongside other approaches to monitoring ratings, such as rater reliability coefficients (e.g., Morgan et al., 2014), kappa coefficients (e.g., Cohen, 1968), or generalizability theory analyses (e.g., Brennan, 2000), the Rasch modeling approach is notably restrictive. In particular, the Rasch measurement framework is characterized by strict requirements for rater judgments of test-taker performances. To meet the requirements of this approach, rater severity must be invariant over all levels of test-taker achievement, and test-taker achievement estimates must be invariant over all raters. Although these requirements are strict, they are useful because they provide a clear framework within which to identify raters whose scoring patterns substantially deviate from the requirements for invariance (Engelhard & Wind, 2018). Although real data are never expected to adhere perfectly to the Rasch model (Smith, 2004), the strict framework allows analysts to empirically evaluate the hypothesis that rater judgments approximate the requirements for measurement (Briggs, 2019).
A natural consequence of acceptable adherence to the Rasch model requirements is that it is possible to obtain estimates of test-taker achievement that are adjusted for differences in rater severity, even when every rater does not rate every test-taker. This adjustment is one of the major motivations for the prevalent use of Rasch models in rater-mediated assessments across disciplines. However, meaningful interpretation of estimates from these analyses is not possible unless there is evidence of acceptable fit for all of the facets in the model. Therefore, it is critical that researchers gather evidence of rater fit in addition to other indicators of rating quality (e.g., indicators of specific types of rater effects such as central tendency or bias) to ensure a sound interpretation and use of model estimates; we discuss this point further below.
Within the framework of Rasch measurement theory, analysts routinely use two rater fit statistics, infit and outfit mean square error (MSE) statistics, to identify rater misfit (see Online Supplement 1 for a review documenting the prevalence of these statistics in applied and methodological research published between 2015 and 2020). Rater misfit occurs when observed ratings deviate from the expectation of the model that is used to estimate parameters of an assessment procedure. Rater misfit makes it difficult to directly compare rater severity with student achievement and other facets on a common scale.
When considering the practical implications of rater fit analyses, it is important to acknowledge that rater fit statistics are different from diagnostic indicators of specific rating patterns such as severity, biases, or central tendency, because fit statistics do not identify the specific characteristics of raters’ unexpected ratings. Accordingly, it is important that analysts use rater fit indices alongside indicators of other rater effects to ensure that such systematic patterns of unexpected ratings can be accurately detected to inform rater remediation, as well as other decisions regarding raters and ratings. Although they do not always point toward specific effects, it is important to include rater fit indices as a routine component of rating quality analyses for several reasons. First, rater fit statistics can alert analysts to raters who may not exhibit the specific types of rating patterns that are captured in other rater effect indices, but nonetheless warrant further investigation or remediation. For example, Wolfe and McVay (2012) used rater fit statistics to identify patterns of “rater inaccuracy,” which they defined as “patterns that cause the assigned ratings to be inconsistent with accurate ratings in an unpredictable way” and thus “poorly represent the true abilities of most examinees” (p. 32). Along the same lines, Wind and Engelhard (2012, 2013) found that these fit statistics are moderately correlated with rater accuracy indices calculated by comparing rater judgments to expert rater judgments.
Moreover, using rater fit statistics alongside rater effect indices can help analysts identify raters who exhibit systematic rater effects that were not directly investigated, such as systematic biases related to a subgroup of students for whom bias was not originally investigated or variation in severity related to time (i.e., rater drift). Because Rasch model rater fit statistics detect deviations between observed and expected ratings, these statistics can help analysts identify potential rater effects that may not be included as part of routine analyses.
Finally, investigating rater fit is important in contexts where analysts use measurement models to adjust estimates of student achievement for differences in rater severity (e.g., when data are collected using incomplete rating designs). In these cases, the model-adjusted estimates of student achievement can only be meaningfully interpreted if there is evidence that raters exhibit acceptable fit to the model. Accordingly, evaluating rater fit is a critical step in ensuring appropriate interpretation and use of student achievement estimates. For a full discussion of this point, please see Wind et al. (2016).
Infit and outfit MSE statistics, which are used to describe discrepancies between observed ratings and expected ratings are defined, for rater i, as
Both infit and outfit MSE statistics are useful for evaluating model-data fit. Outfit MSE is sensitive to extreme unexpected observations because it is unweighted. Infit MSE is weighted by its variance
To overcome the limitations of rule-of-thumb critical values, researchers have proposed a parametric bootstrap approach to establish critical values for infit and outfit MSE statistics to identify item and person (i.e., subject or examinee) misfit in item response theory IRT analyses. For example, Wolfe (2013) conducted a simulation study to demonstrate the efficiency of the parametric bootstrap procedure for identifying critical values for item and person fit statistics. The results showed that bootstrap item and person fit critical values resulted in well-calibrated Type-I error rates, when data were simulated to fit a dichotomous Rasch model. Similarly, Seol (2016) demonstrated that bootstrap critical values for infit and outfit MSE statistics depended on sample sizes and test lengths, and thus, one-size-fits-all rule-of-thumb critical values were not appropriate.
These previous investigations of bootstrap critical values for MSE statistics have some limitations. First, they focused on whether bootstrap critical values for infit and outfit MSE statistics can identify person (i.e., subject or examinee) or item misfit. To the best of our knowledge, researchers have not examined the performance of bootstrap critical values for detecting rater misfit in a many-faceted latent trait model context. Rater fit provides information that is distinct from person fit (e.g., examinee fit) and item fit. Specifically, traditional person fit analysis indicates whether a subject has unexpected item-score patterns. For example, an unexpected pattern might occur if an examinee answers an item incorrectly when we expect them to get it correct, or vice versa. Similarly, item fit gives information about the match between the actual responses to individual items and the measurement model expectation. Rater fit indicates how well individual rater’s ratings fit the expectation of the measurement model. Second, published studies do not include a thorough investigation of the the performance of the parametric bootstrap procedure for infit and outfit MSE statistics. Wolfe (2013) only examined the false-positive rates of bootstrap infit and outfit MSE statistics under a single condition with 100 items and 1,000 persons using simulated data based on the dichotomous Rasch model, which limited the generalizability of his findings. Seol (2016) did not report false-positive rates of the bootstrap infit and outfit MSE statistics, although the sample size and test length were varied. Also, neither of these studies examined the true-positive rates of bootstrap infit and outfit MSE statistics. Third, previous studies on the performance of the parametric bootstrap approach were based on the Rasch Rating Scale model or the dichotomous Rasch model. The performance of these statistics should also be judged based on other models, such as the Generalized Rating Scale model that we defined in the next section “Many-Facet Rasch (MFR) Models.” Apart from the above limitations, our motivational example (presented later in the manuscript) demonstrates how the parametric bootstrap approach does not always perform well and may need to be modified.
Purpose
The purposes of this study are twofold: (a) to propose an iterative parametric bootstrap procedure to overcome limitations of the traditional bootstrap method and (b) to determine the false-positive rates and true-positive rates of infit and outfit MSE statistics based on the iterative and traditional parametric bootstrap and rule-of-thumb methods. To assess how well the infit and outfit MSE statistics control false-positive rates, we simulated and fitted raters using a rating scale model version of a MFR (RS-MFR) model that we defined in the next section “Many-Facet Rasch (MFR) Models.” We selected the RS-MFR model because it has been widely applied to rating data in many areas, such as writing (e.g., Engelhard & Myford, 2003), speaking (e.g., Bonk & Ockey, 2003), and music performance (e.g., Wind et al., 2016). The RS-MFR model has some assumptions. For example, discrimination parameter of raters is fixed at 1. However, in practice, these assumptions may not be met—thus leading to a mismatch between the model and the data. To reflect this model-data mismatch in our simulation procedure and evaluate true-positive rates of infit and outfit MSE, we used a generalized rating scale model version of a MFR (GRS-MFR) model to simulate raters with different discrimination parameters.
We focused our analyses on the following research questions: (a) What are the false-positive rates of infit and outfit MSE statistics using rule-of-thumb, traditional, and iterative parametric bootstrap procedures? (b) What are the true-positive rates of infit and outfit MSE statistics using rule-of-thumb, traditional, and iterative parametric bootstrap procedures when the discrimination parameter is not the same as that from the Rasch model expected value of 1.0 for some raters, and (c) Do the false-positive rates and true-positive rates of infit and outfit MSE statistics change when different data collection designs are used?
This study contributes to the literature in three ways: (a) this is the first study to assess the performance of rule-of-thumb and the traditional parametric bootstrap procedure for infit and outfit MSE statistics in the context of detecting rater misfit; (b) we propose an iterative parametric bootstrap procedure to overcome limitations of its traditional counterpart; and (c) we evaluate the performance of the proposed method in various simulation conditions that reflect realistic data collection procedures for performance assessment contexts.
MFR Models
The Rating Scale model (Andrich, 1978) version of a three-facet MFR model (Linacre, 1989; RS-MFR) with facets for student achievement, rater severity, and domain difficulty can be expressed as
The Generalized Rating Scale model version of a three-facet MFR model (Linacre, 1989; GRS-MFR) is stated as
Motivational Example
In this motivational example, we examined the performance of the traditional parametric bootstrap procedure in one simulated condition, where each of 15 raters judged 150 students’ work in four domains using 5-point ordinal rating scales. The first rater was set to exhibit misfit using the discrimination parameter. The ratings of this misfitting rater were generated using the GRS-MFR model with a discrimination parameter of 0.2, whereas the ratings from the other raters were generated using the RS-MFR model. Student achievement parameters
We constructed upper and lower critical values for infit and outfit MSE statistics for each rater using the traditional parametric bootstrap procedure (von Davier, 1997; Wolfe, 2013): (a) we fit the RS-MFR model to the original data and obtained student, rater, domain, and threshold parameter estimates
Figure 1 is the plot of infit MSE statistics for 15 raters, where the x-axis shows the rater IDs, and the y-axis the values of the infit MSE statistic. The rater IDs were listed in descending order according to their infit MSE statistics estimated from the original sample. We used different plotting symbols to indicate whether raters were simulated to exhibit misfit (solid triangles) or not (solid dots). The interval bars represent the upper and lower bootstrap critical values for the infit MSE statistic of each rater. Based on these intervals, Rater 1 was identified correctly as a misfitting rater, whose value of infit MSE statistic of 2.03 was higher than the upper critical value (1.13). However, Raters 6, 9, 12, 13, 14, and 15 were incorrectly classified as showing substantial misfit. The values of the infit MSE statistics for these raters were 0.83, 0.86, 0.88, 0.85, 0.87, and 0.88, respectively, which were smaller than the lower limits of the corresponding critical values (i.e., 0.88, 0.89, 0.89, 0.89, 0.90, and 0.90, respectively). Similar results can be found in Figure 2, which shows outfit MSE statistics. These results indicated that the critical values constructed using the traditional bootstrap procedure might not be able to flag raters correctly.

A plot of infit MSE statistic for 15 raters with bootstrap 95% critical values.

A plot of outfit MSE statistic for 15 raters with bootstrap 95% critical values.
An Iterative Parametric Bootstrap Procedure
In Rasch model analyses, researchers have documented that the presence of a poor-fitting rater (or item) with large MSE statistics influences the estimates of MSE statistics for good-fitting raters, such that they are lower than 1 because the means of infit and outfit MSE statistics are usually forced to be near 1 (Linacre, 2019; Su et al., 2007). In addition, fit statistics are sample-dependent because the residuals from which they are calculated reflect model expectations, which are based in part on poor-fitting raters. This sample-dependent nature of MSE fit statistics implies that those good-fitting raters may be incorrectly flagged as “misfit.” As a result, the traditional bootstrap critical values are not appropriate. To make this more concrete, consider Rater 12 from the motivational example, who was simulated to be a fitting rater. The estimated infit MSE statistic for Rater 12 was 0.86, while the critical values identified using the traditional bootstrap procedure were (0.91, 1.12). This rater was flagged as a misfitting rater, which was not correct because the estimation of infit MSE for Rater 12 was influenced by Rater 1, who was simulated to exhibit misfit and had a large value of infit MSE (1.96).
One solution to this problem is to remove poor-fitting raters from the analysis and recalibrate (Su et al., 2007). Since “underfit [MSE fit statistics greater than 1] is a much greater threat to measurement than overfit [MSE fit statistics less than 1]” (Linacre, 2018), we propose the following iterative parametric bootstrap procedure in which raters with high values of MSE statistics are removed and fit statistics are recalculated: (a) apply the traditional parametric bootstrap procedure to an original sample and calculate the upper limit of the traditional parametric bootstrap critical values for infit and outfit MSE statistics. Flag the raters whose infit or outfit MSE statistic calculated from the original sample is higher than the upper limit of the bootstrap critical values. (b) If some raters are flagged in Step 1, remove these flagged raters and perform the traditional parametric bootstrap procedure to the remaining data. Otherwise, no action is needed.
To be more explicit, we used the data in the motivational example to illustrate how the iterative parametric procedure works. We can examine results based on the infit MSE statistic as an example. In the motivational example, we applied the traditional parametric bootstrap procedure to the original sample and flagged Rater 1, whose infit MSE statistic (2.03) was higher than the upper limit of the critical values (1.13). We removed Rater 1 and conducted the parametric bootstrap procedure on the remaining data. The infit MSE statistics of the remaining 14 raters (Rater 2–15) and corresponding critical values were recalculated.
From Table 1, we observed that Raters 2 to 15 all had acceptable fit in step 2, which was what we expected since we simulated these raters to fit the model. Note that if we only conducted step 1 (i.e., traditional parametric bootstrap procedure), except for Rater 1, Raters 6, 9, 12, 13, 14, and 15 would be incorrectly classified as misfitting raters. After step 2, these raters had acceptable fit.
Infit MSE and 95% Critical Values Based on Iterative Parametric Bootstrap Procedure.
Note. MSE = mean square error. Bold-faced values indicated misfit raters identified by traditional parametric bootstrap procedure.
Simulation Study
Design
Manipulated variables
The rater sample size was set to three levels: N = 15, 30, or 60, which reflects the sample sizes reported in previous real-data and simulation studies of rater-mediated assessments (e.g., Wind & Engelhard, 2012; Wind & Guo, 2019). Under all conditions, we generated ratings for the good-fitting raters using the RS-MFR. Under those conditions with misfitting raters, either 5% or 10% of raters were simulated to exhibit misfit. We used the GRS-MFR to simulate misfitting raters. Specifically, the discrimination parameter for misfit rater i was drawn from U[0.4, 0.8] or U[0, 0.4], representing weak or strong misfit, respectively. The discrimination parameter for the ith misfit rater was below 1, which is the discrimination parameter of good-fitting raters. In practice, this type of misfit could occur if, for example, a rater did not completely understand the rubric and thus applied it less precisely. Finally, we considered two data collection designs: complete or incomplete with systematic links. In the complete rating design, we simulated raters’ ratings of all students on every domain. In the incomplete design with systematic links, each student was rated by two raters and each rater rated students in common with two other raters. These two designs are commonly used in rater-mediated assessments, such as language testing and music performance assessment (e.g., Wesolowski et al., 2015; Wind & Engelhard, 2013).
Variables held constant
In all conditions, we used a student-to-rater ratio of 10:1 to reflect previous studies in which there were many more students than raters (Brown et al., 2004; Wolfe et al., 2010). To match previous studies (e.g., Wolfe & Mcvay, 2012), we selected student achievement parameters
Data Analysis
We used R (R Core Team, 2018) to generate the data and the Facets software program (Linacre, 2015) to analyze the generated data according to RS-MFR model. We examined the false-positive rates and the true-positive rates for raters flagged as misfitting or fitting for the infit and outfit MSE statistics based on the iterative parametric bootstrap, traditional parametric bootstrap, and rule-of-thumb approaches.
Results
Before we explored the rater fit results, we checked the estimates of rater severity in our simulated data and verified that they reflected the intended characteristics. To highlight how our focus on rater fit corresponds to other rater effects, we also evaluated the raters using specific rater effect indices for severity/leniency and centrality/extremism. Online Supplement 2 includes these results.
False-Positive Rates: No Misfit Raters Exist
Table 2 displays the false-positive rates of infit and outfit MSE statistics based on the iterative parametric bootstrap and the traditional parametric bootstrap when all raters were good-fitting raters. In general, for these simulated data, infit and outfit MSE statistics based on both the iterative and traditional parametric bootstrap procedures had well-calibrated false-positive rates across all conditions. However, the false-positive rates of these two statistics in the conditions in which we simulated complete ratings were closer to the nominal significance level (i.e., 0.05) compared to the conditions in which we simulated incomplete ratings.
False-Positive Rates: When No Misfit Raters Exist.
Note. MSE = mean square error.
False-Positive Rates: Discrimination Misfit Raters Exist
The results in Table 3 showed that under the complete design, the performance of the iterative parametric bootstrap procedure was excellent. Specifically, the false-positive rates of infit and outfit MSE statistics were close to 0.05. In contrast, the traditional parametric bootstrap procedure resulted in inflated false-positive rates of infit and outfit MSE statistics. For example, the false-positive rates were up to 0.34, and 0.99 when the degree of misfit was weak, and strong, respectively. The false-positive rates also increased as the rater sample size or the percentage of misfitting raters increased.
False-Positive Rates: When Discrimination Misfit Raters Exist.
Note. MSE = mean square error.
True-Positive Rates: Discrimination Misfit Raters Exist
In the conditions in which we simulated incomplete ratings, overall, the results suggest that false-positive rates of infit and outfit MSE statistics were well controlled in both the iterative and traditional parametric bootstrap methods. However, the false-positive rates of outfit MSE in the iterative parametric bootstrap method (0.04 ≤ false-positive rates of outfit MSE≤ 0.05) were closer to 0.05 than in the traditional parametric bootstrap method (0.03 ≤ false-positive rates of outfit MSE≤ 0.13). For both parametric bootstrap methods, in most conditions, false-positive rates of infit MSE statistic were higher than or equal to those of outfit MSE statistic.
From Table 4, we observed that under the complete rating design, the critical values established based on both the iterative and the traditional parametric bootstrap methods resulted in high true-positive rates, but when the degree of misfit was weak and the rater sample size was small, the iterative parametric bootstrap outperformed its traditional counterpart. Whichever bootstrap procedures were used, the outfit MSE statistic had a slight advantage over infit MSE statistic in terms of detecting true misfitting raters when the degree of misfit was weak, and the rater sample size was small.
True-Positive Rates: When Discrimination Misfit Raters Exist.
Note. MSE = mean square error.
Although the iterative and the traditional parametric bootstrap methods produced high true-positive rates, especially in the presence of a strong degree of misfit the iterative parametric bootstrap procedure was superior because it controlled the false-positive rates best. Under the incomplete rating design, in most conditions, the iterative parametric bootstrap procedure produced higher true-positive rates of infit and outfit MSE statistics (0.36 ≤ true-positive rates of infit MSE≤ 0.98; 0.26 ≤ true-positive rates of outfit MSE≤ 0.96) than its traditional counterpart (0.32 ≤ true-positive rates of infit MSE≤ 0.98; 0.26 ≤ true-positive rates of outfit MSE≤ 0.93). The true-positive rates were higher when a strong degree of misfit exists than when a weak degree of misfit exists. For brevity, we provide the false-positive rates and true-positive rates for the infit and outfit MSE statistics produced by the rule-of-thumb critical values in Online Supplement 3.
Discussion
Researchers who evaluate rater fit in performance assessments using an MFR model approach often rely on rule-of-thumb critical values or critical values obtained from traditional parametric bootstrap procedures to classify raters as “fitting” or “misfitting.” In this study, we illustrated a limitation of the traditional parametric bootstrap procedure for constructing critical values for infit and outfit MSE statistics using MFR models. To overcome the limitation, we proposed an iterative parametric bootstrap procedure for evaluating rater fit and compared the iterative parametric bootstrap procedure with the traditional parametric bootstrap procedure and rule-of-thumb critical values in terms of the false-positive and true-positive rates under a variety of simulated conditions.
Complete Rating Design
In our study, we observed that when all raters were simulated to exhibit acceptable fit, the parametric bootstrap and rule-of-thumb critical values yielded well-controlled false-positive rates, which is consistent with Wolfe (2013). However, in practice, researchers do not know whether misfitting raters exist, and to the best of our knowledge, researchers have not previously examined false-positive rates of MSE statistics when data include some poor-fitting raters. Our study indicated that under the complete rating design, the traditional parametric bootstrap procedure and rule-of-thumb critical values had inflated false-positive rates when rater misfit exists. In contrast, the proposed iterative parametric bootstrap procedure produced false-positive rates close to the nominal significance level under all simulation conditions. This suggests that the iterative parametric bootstrap procedure outperforms the traditional bootstrap procedure and rule-of-thumb method in terms of controlling the false-positive rates in general.
Another limitation in many previous studies is that the true-positive rates of the MSE statistics using traditional bootstrap and rule-of-thumb methods have not been systematically documented (Seol, 2016; Su et al., 2007; Wolfe, 2013). Our simulation study showed that the traditional bootstrap method produced higher true-positive rates than the rule-of-thumb method under complete rating design. Our study also showed that the iterative parametric bootstrap method yielded similar or better true-positive rates in simulated conditions compared to the traditional bootstrap method for both infit and outfit MSE statistics. It is important to note that regardless of these three approaches, true-positive rates of the outfit MSE statistic were higher than those of the infit MSE statistic.
Incomplete Rating Design
Our findings that the performance of the iterative parametric bootstrap procedure is better with complete ratings is somewhat unsurprising since with a complete design, more evidence is available to detect rater misfit. Nonetheless, the iterative parametric bootstrap under the incomplete design still performed reasonably well, with well-controlled false-positive rates and slightly higher true-positive rates. Compared to the traditional parametric bootstrap method and rule-of-thumb critical values, the iterative parametric bootstrap procedure exhibits high true-positive rates while maintaining false-positive rates at the nominal level under almost every simulation condition.
Implications
Our findings have some implications for research and practice. First, this study provides insight into the performance of the traditional parametric bootstrap procedure and rule of thumb critical values in evaluating rater fit. To the best of our knowledge, ours is the first study that systematically investigates their performance. Our findings suggest that researchers and practitioners should be cautious about using the traditional bootstrap procedure to identify critical values for infit and outfit MSE statistics, although this approach has been advocated before (Seol, 2016; Su et al., 2007; Wolfe, 2013). In addition, our findings support previous researchers’ admonition that rule-of-thumb critical values for infit and outfit MSE statistics be used with caution (Seol, 2016; Smith et al., 1998; Wolfe, 2013). Although our findings showed that the traditional parametric bootstrap procedure and rule-of-thumb approach yielded high true-positive rates in some simulation conditions, practitioners should be careful to use them because these two approaches had inflated false-positive rates in some conditions.
Second, we proposed an iterative bootstrap procedure, which might serve as an attractive alternative to its traditional counterpart and the rule-of-thumb critical values. The iterative bootstrap procedure is easy to carry out. The R code for this procedure is available from the first author upon request. In this study, we demonstrated how researchers can apply the iterative parametric bootstrap method to obtain empirical critical values for the infit and outfit MSE statistics. Researchers and practitioners may adjust the nominal significance level and use different quantiles than 2.5% and 97.5% based on their own needs. Also, although we presented a feasible approach to flag misfitting raters, analysts should not rely solely on results from any single statistical technique to evaluate raters. Instead, we encourage analysts to incorporate a variety of analyses, as well as experience and opinions from experts to make a final decision as to whether raters should be removed, retrained, remediated, and so on.
In terms of practical implications, it is important to note that our study focused on rater fit as evaluated using MSE fit statistics. This perspective on rater fit may include, but is not limited to, specific types of rater effects. As we noted earlier in the manuscript and in Online Supplement 2, it is important to include indicators of rater fit alongside other indicators of rating quality, including diagnostic checks for specific types of rater effects (e.g., severity/leniency, centrality/extremism, bias), when evaluating ratings in performance assessment systems. Using effect-specific indices in addition to rater fit statistics provides a more comprehensive approach to evaluating rating quality than using either approach in isolation.
Limitations and directions for future research
Our study has some limitations that warrant consideration in future research. First, running the iterative bootstrap procedure is relatively inefficient in terms of computational time, particularly in simulation studies. For example, when the rater, student, and bootstrap sample sizes are 60, 600, and 200, respectively, the procedure takes about 3 hours 12 minutes to perform one replication. Fortunately, parallel computing can dramatically reduce the computing time. For example, on a computer with 8-core CPU supporting hyper-threading, the computing time of a single data set would be about 12 minutes. Even so, in future studies, researchers should develop time-saving procedures. Second, the simulation conditions considered in our study do not reflect the full scope of rater-mediated performance assessments. For example, in real situations, some other rater effects likely exist, such as centrality and differential rater functioning. Researchers may evaluate the performance of the iterative parametric bootstrap procedure when other rater effect exists in future studies. The focus of this study was rater misfit, but in practice, item and person misfit may also exist. Future research may evaluate the performance of the iterative parametric bootstrap approach when these misfits simultaneously exist. Third, we assessed the performance of the parametric bootstrap procedure. However, a nonparametric bootstrap procedure has been developed to establish 95% critical values for infit and outfit MSE statistics to evaluate item fit (Su et al., 2007). Future studies may assess rater fit using the nonparametric bootstrap method.
Supplemental Material
sj-pdf-1-apm-10.1177_01466216211013105 – Supplemental material for An Iterative Parametric Bootstrap Approach to Evaluating Rater Fit
Supplemental material, sj-pdf-1-apm-10.1177_01466216211013105 for An Iterative Parametric Bootstrap Approach to Evaluating Rater Fit by Wenjing Guo and Stefanie A. Wind in Applied Psychological Measurement
Supplemental Material
sj-pdf-2-apm-10.1177_01466216211013105 – Supplemental material for An Iterative Parametric Bootstrap Approach to Evaluating Rater Fit
Supplemental material, sj-pdf-2-apm-10.1177_01466216211013105 for An Iterative Parametric Bootstrap Approach to Evaluating Rater Fit by Wenjing Guo and Stefanie A. Wind in Applied Psychological Measurement
Supplemental Material
sj-pdf-3-apm-10.1177_01466216211013105 – Supplemental material for An Iterative Parametric Bootstrap Approach to Evaluating Rater Fit
Supplemental material, sj-pdf-3-apm-10.1177_01466216211013105 for An Iterative Parametric Bootstrap Approach to Evaluating Rater Fit by Wenjing Guo and Stefanie A. Wind in Applied Psychological Measurement
Supplemental Material
sj-pdf-4-apm-10.1177_01466216211013105 – Supplemental material for An Iterative Parametric Bootstrap Approach to Evaluating Rater Fit
Supplemental material, sj-pdf-4-apm-10.1177_01466216211013105 for An Iterative Parametric Bootstrap Approach to Evaluating Rater Fit by Wenjing Guo and Stefanie A. Wind in Applied Psychological Measurement
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Supplemental Material
Supplementary material is available for this article online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
