Abstract
Differences in rater judgments that are systematically related to construct-irrelevant characteristics threaten the fairness of rater-mediated writing assessments. Accordingly, it is essential that researchers and practitioners examine the degree to which the psychometric quality of rater judgments is comparable across test-taker subgroups. Nonparametric procedures for exploring these differences are promising because they allow researchers and practitioners to examine important characteristics of ratings without potentially inappropriate parametric transformations or assumptions. This study illustrates a nonparametric method based on Mokken scale analysis (MSA) that researchers and practitioners can use to identify and explore differences in the quality of rater judgments between subgroups of test-takers. Overall, the results suggest that MSA provides insight into differences in rating quality across test-taker subgroups based on demographic characteristics. Differences in the degree to which raters adhere to basic measurement properties suggest that the interpretation of ratings may vary across subgroups. The implications of this study for research and practice are discussed.
Keywords
One of the most important tasks in research and practice related to rater-mediated writing assessments is evaluating the degree to which the assessment procedure is fair for all test-takers. From a psychometric perspective, rater-mediated writing assessments are fair when each test-taker has the same opportunity to demonstrate their level of achievement related to a particular construct (AERA, APA, & NCME, 2014). In other words, fairness occurs when rater judgments of test-taker performances are not systematically influenced by construct-irrelevant variables, such as biases related to test-takers’ demographic characteristics. Because fairness is a key concern in research and practice related to rater-mediated writing assessments, many researchers have proposed methods for identifying threats to fairness related to these assessments. Among the most popular methods are techniques related to identifying “rater effects,” such as rater severity/leniency, rater accuracy, and rater biases related to construct-irrelevant test-taker characteristics (i.e., differential rater functioning; DRF; Engelhard, 1994; Myford & Wolfe, 2003; Wolfe & McVay, 2012).
A popular technique for detecting rater effects in research and practice related to rater-mediated language assessments is to use models based on parametric item response theory (IRT) models, such as Rasch models (Wind & Peterson, 2017). IRT models, including Rasch models, are considered parametric when the modeling procedure involves transforming ordinal ratings to an interval-level scale, and prescribing a specific parametric form (e.g., the logistic or normal ogive) for the relationship between estimates of test-taker achievement and rating scale categories. In order to interpret the results from any parametric latent trait model analysis of rater-mediated language assessments, including Rasch analyses, a number of assumptions must be met. For example, the estimates of student achievement, rater severity, and other facets depend on the assumption that a specific parametric form, namely the logistic ogive, is the best representation of the relationship between student achievement and the probability for ratings in each category (i.e., the response function). Parametric IRT models also depend on assumptions related to the dimensionality of an assessment procedure, or the number of underlying constructs that can be used to explain student responses, and assumptions about local independence, or the degree to which responses to individual items are statistically independent from responses to other items, after controlling for the construct or constructs that the assessment measures. Another important consideration in the use of parametric IRT models is that the sample is large enough such that the estimates of student achievement, rater severity, and other facets are precise and stable for the entire sample. Without adherence to these assumptions, the resulting estimates from the parametric models do not have a clear interpretation and may not hold over replications of the assessment procedure. Relatedly, these strong assumptions may lead individuals to remove students, raters, or items from an analysis if they exhibit violations of model assumptions for certain ranges of student achievement, but have acceptable and useful psychometric properties for other ranges of student achievement (Meijer & Baneke, 2004; Santor & Ramsay, 1998; Wind, 2018). Because of these strong assumptions, a number of researchers have expressed reservations about the application of parametric models to item response data in the social and behavioral sciences owing to the complex nature of response processes in these contexts (e.g., Meijer, Tendeiro, & Wanders, 2015; Molenaar, 2001; Sijtsma & Meijer, 2007). Such reservations are certainly warranted in rater-mediated performance assessments in which raters use complex judgmental processes informed by a variety of cues (e.g., rubrics, rating scales, student performances) to arrive at ratings (Myford, 2012). In sum, although it is mathematically possible to transform raters’ ordinal judgments to an interval scale, this transformation may not be theoretically appropriate for modeling ratings that result from complex judgment procedures.
As an alternative approach to parametric models, it is possible to use nonparametric scaling procedures to explore the psychometric characteristics of rater-mediated writing assessments. The key distinction between nonparametric IRT models and parametric IRT models is that response functions for nonparametric IRT models are based on ordering restrictions, rather than a specific parametric function (Sijtsma & Meijer, 2007). Compared to classical test theory methods such as kappa coefficients, nonparametric methods that are based on a theoretical framework allow individuals to evaluate rating quality using clear guidelines, with somewhat less restrictive assumptions about the relationship between student achievement and the probability for ratings in particular categories. Nonetheless, many nonparametric IRT models share key assumptions with parametric IRT models related to dimensionality and local independence. Because of these shared assumptions, one can use nonparametric IRT models to evaluate educational assessments for adherence to important measurement properties (Wind, 2017b).
Nonparametric methods are useful in assessment contexts in which interval-level estimates of student achievement, rater severity, and other facets are not needed for the intended interpretations, but ordinal scores are sufficient. For example, non-computer-adaptive assessments, assessments from which results will not be used to perform equating procedures, assessments in which the relative ordering of student achievement is sufficient to inform the intended interpretation and use, and rater training procedures may be appropriate contexts for the use of nonparametric IRT analyses. This study illustrates the use of a nonparametric procedure based on Mokken scale analysis (MSA) (Mokken, 1971) to explore systematic differences in rating quality related to test-taker characteristics. Briefly, MSA is a nonparametric scaling procedure that allows data analysts to consider many of the same issues as Rasch models, including the degree to which rater severity is consistent for all subgroups of students (i.e., invariance), but without parametric assumptions (discussed further below). MSA was originally presented as a method for scaling affective variables in political science. However, the relationship between MSA and invariant measurement also makes it a promising framework in which to explore rating quality in language assessments (Wind & Engelhard, 2016).
Purpose
The purpose of this study is to use a nonparametric IRT procedure to explore the degree to which there are differences in rating quality across subgroups of test-takers in the context of a rater-mediated writing assessment. This study contributes to previous research in two main ways. First, although researchers from a variety of disciplines have used MSA techniques to develop, revise, and interpret the results from measurement instruments (Sijtsma & Molenaar, 2002), only a few studies have been published related to the application of MSA to educational assessments. Second, only a few researchers have used MSA to consider systematically the differences across subgroups of test-takers (discussed further below), and these procedures have not been explored within the context of educational achievement testing.
What is Mokken scale analysis?
Mokken scale analysis (MSA; Mokken, 1971) is a nonparametric scaling procedure that researchers and practitioners can use to evaluate the psychometric characteristics of social science measurement instruments without applying potentially inappropriate parametric transformations to ordinal data, such as ratings obtained from rater-mediated writing assessments. It is based on a set of exploratory techniques that individuals can use to consider the degree to which their measurement procedures adhere to basic ordering properties that are essential to the interpretation of total scores. Mokken described his nonparametric modeling procedure as a relatively conservative approach to evaluating measurement instruments that are potentially better suited to data collected using social science measurement instruments compared to parametric approaches. He noted that, compared to parametric IRT models, MSA models “are less specific in their assumptions and more in harmony with our limited knowledge concerning the data” (Mokken, 1971, p. 173). Accordingly, many researchers have used MSA to explore constructs characterized by complex response processes, such as affective variables prior to the application of a parametric model, or as a relatively conservative standalone analytic procedure (Meijer et al., 2015).
It is important to note that, although they are less strict than parametric models in terms of their underlying assumptions and scale transformations, Mokken scaling models are not free from assumptions. Instead, they require adherence to several key requirements, including a consistent ordering of test-takers across items, and a consistent ordering of items across test-takers. Using MSA techniques, researchers can use indices of model–data fit to identify individual test-takers or items that violate these requirements and thus may require additional investigation. For a didactic introduction to MSA, readers are encouraged to consult the tutorials published by Sijtsma and van der Ark (2017) and Wind (2017b).
Mokken models for ratings
In his original presentation of MSA, Mokken (1971) presented scaling procedures for dichotomous items only. Later, Molenaar (1982, 1997) presented polytomous versions of Mokken’s original scaling models that are appropriate for ordinal responses in more than two categories, such as attitude surveys with Likert-type response formats. Recently, Wind (2017a) presented an adaptation of Molenaar’s scaling procedures in which a slightly different approach is used to calculate rating scale category probabilities than the original version of the models. Specifically, Molenaar’s models are based on a cumulative probability formulation, where the probability for a rating in a given category is defined as the probability for a rating in the category of interest or any higher category. In contrast, Wind’s adjacent-categories adaptation of MSA scaling procedures (ac-MSA) are based on adjacent-categories probabilities, where the probability for a rating in a given category is defined as the probability for a rating in the category of interest, rather than the category just below it. Andrich (2015) summarized the substantive implications of these different probability formulations in the context of educational performance assessments as follows: The [cumulative probabilities] model specifies the probability that a person will be classified above any threshold … This does not seem consistent with performance assessment—judges locate a performance in one of the categories, not in and beyond any particular category. (p. 6)
The ac-MSA version of Mokken scaling is based on an interpretation of rating scales categories that is more closely aligned with the interpretation of rating scale categories in rater-mediated educational performance assessments than Molenaar’s cumulative models, and also provides more informative diagnostic information related to individual raters than the cumulative models (Wind & Schumacker, 2017). Recently, Wind and Engelhard (2017) compared ac-MSA and Many-Facet Rasch (MFR) model (Linacre, 1989) techniques for identifying rater effects related to rater severity, leniency, noisy ratings (i.e., haphazard ratings), and muted ratings (i.e., overly predicable ratings). In this study, the researchers observed that the MFR model and ac-MSA indicators identify many of the same raters related to these effects, but that ac-MSA provides some additional insight into rating quality that is not captured in Rasch statistics. This study continues the investigation of ac-MSA to evaluate rating quality, but focuses on its usefulness as a tool for identifying differences in rating quality between student subgroups.
Comparing measurement quality across subgroups with Mokken scale analysis
When individuals apply parametric models such as Rasch models to rater-mediated assessments, they often examine the degree to which model estimates (e.g., rater severity) and the quality of these estimates (e.g., rater fit) are consistent between student subgroups. For example, researchers often use the MFR model (Linacre, 1989) to examine the degree to which rater judgments are similar between student subgroups. This study demonstrates how one can use MSA techniques to evaluate the degree to which rating quality is consistent between subgroups of test-takers in a rater-mediated writing assessment. The approach that is illustrated in this study provides a nonparametric alternative to parametric analyses, such as MFR models, through which it is possible to compare measurement quality between subgroups, without potentially inappropriate parametric model assumptions.
In previous studies, researchers have used MSA to compare the psychometric quality of attitude surveys between two or more subgroups of respondents (e.g., Gesthuizen, Scheepers, van der Veld, & Völker, 2013; Gillespie, Tenvergert, & Kingma, 1988; Quaranta, 2013; Van Der Veer, Yakushko, Ommundsen, & Higler, 2011; van Schuur, 2003). As pointed out by Molenaar and Sijtsma (2000), MSA techniques for evaluating measurement equivalence across subgroups focus on three questions: “(a) Does the measurement model fit equally well across subgroups?; (b) Are scale score and item scores similarly distributed?; and (c) Are there specific items for which the overall difficulty order is violated in one or more of the subgroups?” (p. 89). These differences can be described as a sort of interaction between subgroup membership and measurement quality that helps researchers identify items that warrant further investigation.
In general, MSA procedures for comparing the psychometric quality of scales across subgroups involve comparing indicators of measurement quality within each subgroup of test-takers. For example, Quaranta (2013) applied MSA to evaluate the equivalence of a political protest scale across 20 countries in Western Europe. In order to compare the quality of the scale across these national contexts, this researcher calculated indicators of measurement quality based on MSA using data from the complete sample and separately for the respondents within each country. Then, Quaranta compared the results from the separate analyses and considered the implications of differences in measurement quality. Researchers have used similar approaches to compare the psychometric properties of affective scales in a variety of contexts, including attitudes toward abortion (Gillespie et al., 1988), confidence in national institutions (van Schuur, 2003), measures of xenophobia (Van Der Veer et al., 2011), and perceptions of cultural capital (Gesthuizen et al., 2013). However, there are limited studies in which researchers have used MSA to compare differences in measurement quality between subgroups of test-takers in educational assessment procedures, including rater-mediated writing assessments.
Methods
This study presents an illustrative analysis of essay ratings from the Georgia High School Writing Test. The data include ratings from 20 raters on 365 essays composed by eighth-grade test-takers (nfemale = 171, nmale = 194). The raters rated students’ essays using a four-category scale (recoded to 0 = low, 3 = high) in four domains: Conventions, Organization, Sentence Formation, and Style. Before they could participate in operational scoring, all of the raters were required to complete a training program in which they learned how to apply the analytic scoring rubric to student essays. The training program materials were designed to reflect the specific prompt that was used in the administration of the writing assessment that the raters scored. At the end of the training program, the raters were required to pass a certification exam in which they demonstrated that they could accurately apply the analytic rubric as intended. All the raters rated all the students, such that the rating design was fully connected. In this analysis, I focused on the Conventions and Style domains in order to present a simple illustration. I selected these two domains to highlight unique aspects of writing proficiency for which differences in rating quality across test-taker subgroups may have interesting implications.
The writing assessment data used in this study were also analyzed by Wind and Engelhard (2016) and by Wind (2017a). In Wind and Engelhard (2016), the researchers focused on illustrating the use of MSA to examine rating quality generally, without an emphasis on differences related to student subgroups. Wind (2017a) used the data to illustrate the adjacent-categories adaptation of MSA, and compared the results with Wind and Engelhard (2016). In the current study, the data were used to illustrate how ac-MSA indicators can be used to explore differences in rating quality between student subgroups.
Data analysis
The R statistical software program was used for all the analyses. The analytic procedure involved two major steps. First, base programming in R 1 was used to calculate rating quality indices based on ac-MSA for the overall sample and within the female test-taker and male test-taker subgroups (described below). Second, differences were explored between the ac-MSA indicators for female and male test-takers as evidence of differences in rating quality for these subgroups of test-takers.
Rating quality indicators based on Mokken scale analysis
Three types of rating quality indices were calculated, which were based on MSA: (a) rater monotonicity, (b) rater scalability, and (c) invariant rater ordering. For each domain, rating quality indices were calculated for the overall sample of test-takers, as well as within the sample of female test-takers and the sample of male test-takers. Each rating quality indicator is described below.
Rater monotonicity
The first indicator of rating quality based on ac-MSA is rater monotonicity. Rater monotonicity is an indicator of the degree to which individual raters order test-takers’ performances consistently with other raters. Evidence of rater monotonicity indicates that test-taker ordering on the construct is invariant across raters. A combination of statistical hypothesis tests and graphical displays are used to identify departures from monotonicity for individual raters. Both procedures involve calculating rest scores, which are a nonparametric estimate of student achievement levels. To evaluate monotonicity for an individual rater, researchers can calculate rest scores by subtracting the rating each student receives from the rater of interest from their total score across the rest of the raters. As a simple example, consider a situation where four raters (X1, X2, X3, and X4) have given a student’s essay the following ratings: X1 = 4, X2 = 3, X3 = 4, X4 = 2, with a total score of X+ = 4 + 3 + 4 + 2 = 13. To evaluate monotonicity for Rater 1, one would subtract Rater 1’s rating from the total score: Rest score = X+ – X1 = 13 – 4 = 9. After each student’s rest score is calculated for the rater of interest, one can combine students with the same or adjacent rest scores into rest score groups to evaluate model assumptions. The typical approach for combining test-takers into rest score groups in MSA is to create the sample size of test-takers (N) divided by 10 rest score groups if the sample size includes more than 500 test-takers, N divided by 5 if the sample size includes between 250 and 500 test-takers, and N divided by 3.5 if the sample size includes fewer than 250 test-takers (Molenaar & Sijtsma, 2000; van der Ark, 2007). Researchers use this grouping procedure to increase the statistical power of their analyses, so that fluctuations across individual test-takers do not have an undue influence on the results.
Figure 1 illustrates the graphical procedure for evaluating monotonicity for a rater who scored test-takers using a rating scale with four ordered categories (0 = low; 3 = high). Rest scores are plotted along the x-axis in order of increasing achievement from left to right, and the y-axis shows the probability that test-takers within a particular rest score group received a rating in category k, rather than category k−1.

Illustrative category response function for evaluating rater monotonicity.
The plotted lines in Figure 1 represent the difficulty associated with the threshold between each pair of adjacent rating scale categories, in which the highest line (circle-shaped plotting symbols) in the plot shows the probability for a rating in Category 1 rather than Category 0, the middle line (triangle-shaped plotting symbols) shows the probability for a rating in Category 2 rather than Category 1, and the lowest line (cross-shaped plotting symbols) shows the probability for a rating in Category 3 rather than Category 2. The example rater in Figure 1 demonstrates adherence to monotonicity, because the adjacent-categories probabilities are non-decreasing across increasing rest scores. Conversely, violations of monotonicity occur when adjacent-categories probabilities are decreasing across increasing rest scores. If these violations are observed, researchers can gauge their severity using a statistical test of the null hypothesis that the probability for a rating in category k, rather than category k − 1, is equal between two rest score groups. Failure to reject this null hypothesis suggests a statistically significant violation of invariant test-taker ordering across raters.
Adjacent-categories rater scalability
The second ac-MSA indicator of rating quality is rater scalability. In traditional applications of MSA, many researchers have used scalability analyses to inform item selection procedures in order to construct a scale that facilitates the interpretation of total scores as indicators of person ordering on the construct (Hemker, Sijtsma, & Molenaar, 1995; Meijer, Tendeiro, & Wanders, 2015; Straat, van der Ark, & Sijtsma, 2013). Scalability coefficients for items are calculated by using the ratio of the observed-to-expected Guttman errors associated with each item. For dichotomous items, Guttman errors are defined as combinations of incorrect responses to easier items with correct responses to more-difficult items. These response patterns complicate the interpretation of a scale because they indicate that difficulty item ordering cannot be interpreted in the same way across test-takers. Accordingly, evidence that few Guttman errors associated with an item suggests that it meaningfully contributes to test-taker ordering on the construct. A method for identifying Guttman errors is illustrated in Figure 2, using three example dichotomous items (Item i, Item j, and Item k). In the example, the items are ordered by difficulty from easy to difficult as follows: Item 1 < Item 2 < Item 3; this ordering is used to define Guttman errors. In the figure, a score of “1” indicates a correct response, and a score of “0” indicates an incorrect response. The top part of the figure (Panel A) shows a response pattern without any Guttman errors because the responses proceed from correct to incorrect as the items proceed from easy to difficult and as student achievement increases from low to high. Panel B illustrates a response pattern with two Guttman errors; each error is marked with italics and an asterisk. In the figure, Guttman errors occur when a score of “1” appears to the right of a score of “0”.

Identifying Guttman errors for dichotomous items.
In the context of a rater-mediated assessment, researchers can identify Guttman errors for individual raters at the level of rating scale categories. Specifically, rating scale category thresholds are treated as a sort of dichotomous “item,” where test-takers “pass” the threshold if they receive a rating in category k, and “fail” the threshold if they receive a rating in category k − 1. One can use the difficulty ordering of rating scale category thresholds across test-takers to identify Guttman errors. Based on this adjacent-categories conceptualization of rating scale category thresholds, Guttman errors are combinations of success on more-difficult thresholds in combination with failure on easier thresholds.
Figure 3 illustrates a procedure for identifying Guttman errors for polytomous ratings (i.e., ratings in two or more categories). A simple example is used, in which two raters (Rater i and Rater j) rated student performances by using a rating scale with three categories (1 = low, 3 = high). Panel A shows the difficulty ordering of the rating scale categories for the pair of raters calculated using all the students and raters; as in the dichotomous example, this ordering is used to identify Guttman errors. Cells with bold and underlined entries are the expected responses when there are no Guttman errors. In contrast, cells with italicized entries and asterisks indicate Guttman errors. Panel B and Panel C illustrate the response pattern associated with each cell in Panel A, where the ratings on dichotomous thresholds (0 = fail/1 = pass) were recorded. The sum of scores on these thresholds equals the observed rating. As in the dichotomous example, Guttman errors occur when a score of “1” appears to the right of a score of “0”.

Illustration of Guttman errors for raters.
In order to calculate adjacent-categories scalability coefficients for raters, it is necessary to identify the observed and the expected proportion of Guttman errors based on marginal independence within each pair of raters. Then, scalability coefficients for pairs of raters are calculated using the general form of scalability coefficients based on MSA:
where Fij is the observed frequency of Guttman errors, and Eij is the expected frequency of Guttman errors based on marginal independence. For individual raters, scalability is calculated using the combination of each rater (i) with every other rater (i ≠ j):
Researchers can use adjacent-categories scalability coefficients to evaluate the contribution of each rater’s judgments to the overall ordering of test-takers in terms of the construct. Higher values of scalability coefficients suggest fewer Guttman errors associated with a particular rater, and lower values of scalability coefficients suggest frequent Guttman errors associated with the rater.
This study compared rater scalability across groups of test-takers by calculating rater scalability coefficients using the ratings that raters assigned within each subgroup. Next, a nonparametric bootstrapping procedure was used to obtain standard errors of the difference between individual rater scalability coefficients across subgroups that could be used to evaluate the degree to which raters provided comparable adjacent-categories scalability coefficients for the two gender groups under re-sampling. Nonparametric bootstrapping procedures involve selecting many samples from a set of data. The individual observations in the random samples are randomly selected from the original dataset without replacement, so the bootstrap samples may be different from the original sample. In each bootstrap sample, the statistic of interest is calculated (in this case, the difference in scalability coefficients between the male and female students). Then, one can use the distribution of the statistic of interest over the random samples to estimate standard errors and confidence intervals. In this analysis, the boot package for R (Canty & Ripley, 2016; Davison & Hinkley, 1997) was used to calculate standard errors and 95% confidence intervals for the difference between Hi values within the male and female test-taker subgroups for each rater using 1000 replications.
Invariant rater ordering
The third category of Mokken rating quality indices is invariant rater ordering (IRO). In the context of rater-mediated educational performance assessments, invariant ordering analyses are used to evaluate the degree to which raters are ordered the same way across test-takers in terms of severity. Invariant ordering in the context of a rater-mediated assessment is a fairness concern, where evidence of IRO is needed to ensure that conclusions about test-taker achievement do not depend on the particular raters who happened to score their work, and conclusions about rater severity ordering do not depend on the particular performances the raters happened to score.
Similar to rater monotonicity analyses, researchers can evaluate IRO using graphical and statistical indices. Following Ligtvoet et al. (2010, 2011), the procedure for evaluating IRO is based on overall rater judgments aggregated across rating scale categories. First, average ratings from each rater are used to establish an overall severity (i.e., difficulty) ordering for the group of raters. IRO is observed when rater severity ordering is consistent across rest-score groups. Figure 4 illustrates a graphical method for evaluating IRO for pairs of raters. In each plot, the y-axis shows average observed ratings on a four-category scale (0 = low, 3 = high), and the x-axis shows test-taker rest-score groups, arranged from low to high. The lines are nonparametric rater response functions (RRFs) that illustrate the relationship between rater severity and test-taker achievement. The RRFs for Rater k and Rater l are invariantly ordered across test-taker rest-score groups, such that Rater k (solid line) is consistently more lenient (i.e., assigns higher average ratings) than Rater l (dashed line).

Illustrative rater response functions for evaluating invariant rater ordering.
Results
This section presents the results from the analyses, using the three ac-MSA indicators of rating quality. The results are summarized as they relate to the entire sample of test-takers, as well as the degree to which differences were observed in rating quality between test-taker subgroups.
Rater monotonicity
First, rater monotonicity was examined within the overall sample, the sample of female test-takers, and the sample of male test-takers within the Conventions and Style domains. Results from the monotonicity analyses revealed no significant violations of monotonicity for any of the 20 raters based on overall sample as well as within the test-taker subgroups. However, examination of the graphical displays used to evaluate monotonicity revealed some interesting patterns related to test-taker subgroups. For example, Figure 5 illustrates differences in rating scale category use for Rater 12 within the Style domain. For the female test-taker subgroup (Panel A), all three the rating scale categories were ordered as expected, with no violations of monotonicity. However, for the male test-taker subgroup (Panel B), the second and third rating scale category thresholds were disordered within the first rest-score group. Specifically, the probability for a rating in category 3, rather than category 2 was higher than the probability for a rating in category 2 rather than category 1 for test-takers with the lowest achievement levels. This disordering corresponds to a violation of monotonicity related to the third rating scale category threshold. Although this violation of monotonicity was not statistically significant, this finding highlights a difference in rating scale category use across the female and male test-taker subgroups for Rater 12 that may warrant further investigation.

Category response functions for Rater 12 within female and male test-taker subgroups.
Rater scalability
Next, individual rater scalability coefficients (Hi) were calculated to explore differences in the frequency with which Guttman errors were associated with each rater across the two gender subgroups. Values of rater scalability coefficients and standard errors based on the nonparametric bootstrap procedure for the overall sample and within the two gender subgroups are given in Column B of Table 1 and Table 2 for the Conventions and Style domains, respectively. Overall, values of the adjacent-categories scalability coefficients suggest that there were differences in the degree to which individual raters were associated with Guttman errors based on the overall sample of test-takers, as well as within the gender subgroups for both domains.
Rating quality results: Conventions.
Rating quality results: Style.
Raters with statistically significant differences in scalability coefficients (p < 0.05) between female and male test-takers.
For the Conventions domain, Rater 10 had the highest scalability coefficient based on the overall sample of test-takers (Hi = 0.39, SE = 0.02), and Rater 18 had the lowest scalability coefficient based on the overall sample (Hi = 0.29, SE = 0.02). Across the raters, the range of scalability coefficients was slightly smaller within the male test-taker subgroup (0.27 ⩽ Hi ⩽ 0.35) compared to the range of values observed for the female test-taker subgroup (0.21 ⩽ Hi ⩽ 0.35). However, none of the scalability coefficients were significantly different between the gender subgroups at α = 0.05.
For the Style domain, Rater 16 had the highest scalability coefficient based on the overall sample of test-takers (Hi = 0.34, SE = 0.02), and Rater 17 had the lowest scalability coefficient (Hi = 0.25, SE = 0.02). For all the raters, scalability coefficients were slightly higher within the male test-taker subgroup (0.20 ⩽ Hi ⩽ 0.29) compared to the female test-taker subgroup (0.25 ⩽ Hi ⩽ 0.33). Furthermore, these differences in rater scalability across test-taker subgroups were statistically significant (p < 0.05) for four raters (Rater 2, Rater 6, Rater 11, and Rater 14).
Invariant rater ordering
The results from the IRO analyses are summarized in Column C of Table 1 and Table 2 for the Conventions and Style domains, respectively. It is interesting to note that more-frequent violations of IRO were observed when the complete sample was used, compared to the analyses within the two subgroups; this pattern was consistent across the Conventions and Style domains. For example, in the Conventions domain (Table 1), five violations of IRO were observed for Rater 4 in the overall sample. However, no violations of IRO were observed when this property was evaluated separately within the female and male test-taker subgroups. Because no violations of IRO were observed within the gender subgroups, these findings suggest that the relative severity of Rater 4 remained consistent across achievement levels within subgroups, but that this rater was systematically more severe for male test-takers compared to female test-takers, which suggests that potential differential rater severity is related to test-taker gender. Examination of the IRO results reveals similar patterns for several other raters across the two domains.
Another interesting pattern that was observed across all four domains was the presence of significant violations of IRO within the overall sample and only within one of the test-taker subgroups. For example, in the Conventions domain (Table 1), Rater 13 had nine significant violations of IRO in the overall sample, zero violations of IRO for female test-takers, and four violations of IRO for male test-takers. This finding suggests that the relative severity ordering of Rater 4 remained consistent across achievement levels when scoring essays composed by female test-takers, but the same was not true when this rater scored essays composed by male test-takers.
Discussion
The purpose of this study was to illustrate and explore the use of ac-MSA to identify differences in rater judgment across test-taker subgroups within the context of a rater-mediated educational performance assessment. MSA provides a theory-based approach to exploring differences in rating quality between subgroups that does not depend on potentially inappropriate transformations of ordinal ratings. Unlike other popular nonparametric statistics for evaluating rating quality, such as kappa coefficients (Cohen, 1968), MSA indicators reflect a specific theoretical framework, namely invariant measurement, which asserts particular psychometric properties as requirements for meaningful interpretation of assessment results. As a result, one can use violations of MSA model assumptions to identify individual raters, students, or student subgroups whose ratings may not have psychometrically sound interpretations, and thus warrant additional investigation. Accordingly, MSA provides information about rating quality that is more diagnostically useful than summary-level indicators of rater consistency (e.g., kappa) but also more conservative than latent trait models whose interpretation depends on parametric transformations (e.g., the MFR model). The results from the illustrative analyses revealed differences in rater monotonicity, rater scalability, and IRO between the gender subgroups based on the Conventions and Style ratings. Together, these findings suggest that rating quality was not invariant across female and male test-takers. Specifically, violations of monotonicity, evidence of Guttman errors, and violations of invariant ordering can alert individuals to raters whose ratings do not have clear interpretations. The implications of these findings in terms of the three ac-MSA indicators of rating quality are considered below.
Rater monotonicity
Results from monotonicity analyses for the complete sample of test-takers as well as within the two gender subgroups suggested that the ordering of test-takers was consistent across the 20 raters, and that invariant test-taker ordering held within both of the test-taker subgroups. Although no significant violations of monotonicity were observed, the graphical monotonicity analyses highlighted some differences in the raters’ interpretation of the rating scale across the gender subgroups that may warrant additional investigation.
Rater scalability
Next, values of the rater scalability coefficients suggested that there were differences in the degree to which each of the 20 raters’ judgments included Guttman errors when they rated the two student subgroups. Although the differences in rater scalability coefficients between the gender subgroups were not statistically significant for the Conventions domain, results related to the Style domain indicated that all 20 raters had higher scalability coefficients for the sample of female test-takers than for the sample of male test-takers, and these differences were statistically significant for four raters. This finding suggests that rater judgments in the Style domain were associated with more Guttman errors when the raters evaluated male test-takers’ compositions compared to their judgments of female test-takers’ compositions.
Invariant rater ordering
Finally, the ac-MSA analyses revealed differences in the frequency of violations of IRO across gender subgroups for both domains. For several raters within each domain, the IRO analyses revealed more-frequent violations in the complete sample than were observed when the gender subgroups were analyzed separately. This finding suggests that violations of IRO may be related to the differences in rater severity ordering related to test-taker gender. On the other hand, the IRO analyses within subgroups revealed significant violations within only one subgroup. For these raters, the IRO results suggested systematic differences in rater severity across achievement levels within a subgroup that may suggest idiosyncratic interpretations of test-taker achievement related to test-taker gender.
Implications
Together, the results from the illustrative analysis suggest that ac-MSA is a useful nonparametric technique through which one can evaluate the consistency of rating quality across subgroups of test-takers in rater-mediated writing assessments. In particular, these findings indicate that ac-MSA analyses conducted within subgroups of test-takers reveal differences in rating quality across test-taker subgroups that warrant further analysis. Specifically, violations of monotonicity for individual raters indicate that those raters have not interpreted the relative ordering of student achievement in the same way as other raters, with the result that conclusions about students’ relative ordering vary across raters. Likewise, evidence of frequent Guttman errors (low scalability) indicate that raters have not interpreted the difficulty of rating scale categories consistently with other raters. Violations of invariant rater ordering indicate that rater severity is inconsistent across levels of student achievement, such that it is not possible to compare raters’ relative ordering across students with different levels of achievement. Finally, differences in rater monotonicity, rater scalability, and invariant rater ordering between student subgroups indicate that the quality of rater judgments is not comparable between these subgroups. Importantly, these statistical indicators do not provide an explanation for raters’ cognitive processes that contributed to raters’ scoring patterns. As with any statistical approach to evaluating rating quality, additional investigations that incorporate qualitative components, such as mixed-methods studies (Creswell & Plano-Clark, 2007), are needed in order to provide potential explanations for observed violations of MSA assumptions.
This study has several implications for research and practice related to rater-mediated writing assessments. In terms of research, this study builds upon recent developments in the use of ac-MSA within the context of rater-mediated assessments to include an additional set of indices that can be used to identify differences in measurement quality across test-taker subgroups. The procedure illustrated in this study provides information regarding the degree to which individual raters conform to basic measurement properties, and how these properties vary across test-taker subgroups.
In the context of the current study, the lower values of scalability coefficients within the male test-taker subgroup for the Style domain are interesting in light of the persistent finding of lower achievement among male test-takers on direct writing assessments (Cole, 1997b, 1997a; Engelhard, Gordon, & Gabrielson, 1992). Specifically, the higher frequency of Guttman errors in rater judgments of male test-takers’ compositions in terms of Style compared to female test-takers’ compositions suggests that the interpretation of ratings as indicators of test-taker locations on the construct may be different. Additional research is needed to explore more fully the implications of fewer Guttman errors in terms of the implications related to validity, reliability, and fairness (AERA, APA, & NCME, 2014).
It is important to note that evidence of differences in rating quality between test-taker subgroups provides different information than the interactions that are often included in studies of differential item functioning, differential test functioning, or differential rater functioning based on parametric IRT models, such as bias/interaction analyses in MFR models. These analytic techniques focus on differences in rater severity while statistically controlling for test-taker locations on the latent variable. In contrast, the methods illustrated in this study provide a method for exploring differences in the degree to which the basic measurement properties are observed for individual raters across test-taker subgroups. For example, differential scalability coefficients based on female and male test-taker subgroups for an individual rater suggests differences in rater interpretation of test-taker achievement when evaluating female and male test-taker compositions. This finding could suggest potential directions for rater remediation, or revisions of the scoring rubric or score-point exemplars to minimize the influence of construct-irrelevant aspects of test-taker performances on rater judgment.
Shortcomings of nonparametric IRT models
While recognizing the useful characteristics of MSA models discussed above, it is important to acknowledge the limitations of these models relative to popular parametric IRT models. As noted at the beginning of the paper, the lack of a parametric form prevents MSA models from providing interval-level parameter estimates, such as are needed for computer-adaptive assessment procedures, equating, and other parametric analyses. Whereas parametric IRT models result in interval-level estimates that are suitable for such analyses, NIRT models do not. Additionally, MSA models currently do not include a multi-faceted model similar to the MFR model through which analysts could examine more than two facets in a single analysis. Available MSA models are also limited to assessments in which all the items (or in this case, raters) use the same number of rating scale categories. Assessments in which items can take on different maximum values may be more appropriately analyzed with parametric models such as the Partial Credit model (Masters, 1982). Despite these shortcomings, the results from the analyses illustrated in the current study suggest that MSA models provide useful information about the quality of assessment procedures that can inform a variety of interpretations and uses that do not depend on these features of parametric IRT models. As noted in the beginning of the manuscript, such situations may include non-adaptive assessments that will be used to identify student or rater relative ordering, assessment development procedures, and rater training procedures.
Limitations and directions for future research
Several limitations are important to note when considering the results from this study. First, this study included an illustrative analysis based on an authentic dataset from a rater-mediated writing performance assessment. Accordingly, researchers should consider the match between the current sample and other assessment contexts before generalizing the results to these other contexts. In future studies, researchers could compare rating quality between test-taker subgroups in additional contexts, including rater-mediated assessments in content areas besides writing.
The methods for detecting differences in rating quality across subgroups presented in this study should be viewed as a continuation of the initial presentation of adjacent-categories MSA (Wind, 2017a) that complement existing methods for evaluating rating quality based on parametric models. Because they are based on the requirements for invariant measurement, researchers can use the methods illustrated in this study to evaluate ratings in terms of fundamental measurement properties with less-strict underlying model assumptions than parametric techniques. Additional studies, including simulation studies, are needed to understand more fully the sensitivity of ac-MSA indicators to differences in rater judgment between student subgroups. In particular, additional studies are needed in which researchers compare the sensitivity of ac-MSA indicators of differences in rater judgment between subgroups to other methods for comparing rating quality, such as MFR models. In addition, researchers could continue the exploration of ac-MSA in the context of rater-mediated writing assessments that include analytic scoring rubrics. This study analyzed the raters’ ratings separately by domain. The results indicated differences in rating quality between the Conventions and Style domains, suggesting that the raters interpreted these two domains as distinct aspects of student writing. However, additional research is needed to develop methods based on MSA for examining empirically the degree to which raters distinguish among domains in analytic rubrics (e.g., halo effects).
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
