Abstract
Models of modern test theory imply statistical independence among responses, generally referred to as local independence. One violation of local independence occurs when the response to one item governs the response to a subsequent item. Expanding on a formulation of this kind of violation as a process in the dichotomous Rasch model, this article generalizes the dependence process to the case of the unidimensional, polytomous Rasch model. It then shows how the magnitude of this violation can be estimated as a change in the location of thresholds separating adjacent categories in the second item caused by the response dependence on the first. As in the dichotomous model, it is suggested that this index is relatively more tangible in interpretation than other indices of dependence that are either a weight in the interaction term in a model or a correlation coefficient. One function of this method of assessing dependence is likely to be in the development of tests and assessment formats where evidence of the magnitude of dependence of one item on another in a pilot study can be used as part of the evidence in deciding which items will be retained in a final version of a test or which formats might need to be reconstructed. A second function might be to identify the magnitude of response dependence that may then need to be taken into account in some other way, perhaps by applying a model that takes account of the dependence.
Keywords
Models of modern test theory generally assume statistical independence among responses of persons to multiple items of an assessment instrument. In modern test theory, this independence is generally referred to as local independence (Andrich, 1991; Lazarsfeld & Henry, 1966). Andrich and Kreiner (2010) noted that, although generally assumed from the perspective of statistical modeling, this assumption of independence reflects an ideal design of an instrument composed of multiple items. They make the case that the two main reasons that tests are composed of many locally independent items are that the greater the number of different statistically independent items, the greater the potential for both the precision and validity of measurement. Thus, if other relevant factors such as the number of items, the administration of the instrument, sample size, alignment of persons, and items are held constant, responses to locally independent items provide more information than do responses that are locally dependent.
Andrich and Kreiner’s (2010) concern was with unidimensional variables in which there is only one person parameter, and they review a range of related topics in handling violations of local independence, including the structural sources of dependence and the approaches to indentifying and accounting for it. Their focus was on responses that are dichotomous. Their review will not be repeated but only summarized in the present article. Regarding the sources of violations, they note the two kinds of generic violations of local independence identified in Marais and Andrich (2008): The first is where responses to subsets of items interact with subsets of persons so that more than one dimension plays a role in the responses; second is where the response to an item is governed by a response to another, generally previously administered, item. Marais and Andrich give examples of each kind of violation.
Andrich and Kreiner (2010) also noted that there are a range of procedures for identifying and accounting for dependence, including ones that arise from tests of violations of fit when the model assumes unidimensionality and response independence, and ones in which models quantify local dependence as interaction terms (e.g., Haberman, 2007; Kreiner & Christensen, 2004).
Andrich and Kreiner (2010) identified and quantified response dependence between two dichotomously scored items using an approach different from those previously appearing in the literature. They used a recent characterization of the violation of local independence in the dichotomous Rasch model as a response process in which the location of a dependent item was changed by its dependence on another, independent item. They then showed how the magnitude of this change in location could be estimated. The advantage of estimating the magnitude of dependence in this way, they argued, was that it makes tangible its effect in terms of the values of the key parameters of the items, their respective locations. Because such an analysis is not based on simply modeling response dependence, but in identifying and quantifying its magnitude, the main benefits are likely to be in the improvement in the construction of items and their response formats.
This article generalizes the results shown in Andrich and Kreiner (2010) for dichotomously scored items using the dichotomous Rasch model to the case of response dependence between two polytomously scored items with more than two, putatively ordered categories, using the polytomous Rasch model (PRM). In achievement testing with dichotomously scored items, response dependence can arise when the correct response to one item in some way affects the probability of a correct response to a subsequent item. Practical guides on test construction include advice on how to construct items that do not have such a property, for example, obtaining the correct answer to one test item should not be based on having correctly answered a prior item (Mehrens & Lehman, 1991). In the case of polytomous response formats, response dependence may occur when a particular rating on one item either implies logically, or creates empirically, the same rating on another item. Marais and Andrich (2008) suggested a halo effect as an example of the latter where an assessment of some performance by a single assessor on multiple criteria provides a greater relationship among the criteria than if different assessors assessed the different criteria independently. Such assessments are very common in psychology and education. A brief analysis of a real example of this kind is provided in this article.
This article focuses on the dependence between two polytomous items which have the same number of ordered categories. A single simulated example and an example with real data are used to illustrate the procedure. It is stressed, however, that the simulation is not a study of the properties of the estimates, which in any case are well known to be consistent (Andersen, 1973; Zwinderman, 1995) when conditional maximum likelihood estimation, as is the case in the examples, is used. In addition, in conditioning out the person parameter, the person distribution plays no role in the estimation and in the assessment of the magnitude of any dependence among these items. These prove to be important properties in the method described in this article for estimating the effect of dependence.
The rest of the article is structured as follows: The section titled “The PRM and the Formulation of Response Dependence” reviews the PRM, the formulation of response dependence in the dichotomous case, and its generalization to the polytomous case. The sections on illustrative simulation example and illustrative real example follow, and the last section is a summary.
The PRM and the Formulation of Response Dependence
There are a number of forms in which the PRM can be expressed. The one expression convenient for this article (Andrich, 2010) takes the form
where (a)

Response probability curves for item j with and without dependence of d = 0.25 on item i, where xni = 2 and xnj = 2
A special case of the model in which all items have the same
where
It is observed from Figure 1, and it can be proved, that the range in which the probability of the response
in which there is only the one threshold,
Local Response Dependence With Dichotomous Responses
The formulation of response dependence (Andrich & Kreiner, 2010; Marais & Andrich 2008) for the dichotomous Rasch model of Equation 3 is reviewed. Statistical independence for the response of person
Marais and Andrich (2008) introduced a violation of this condition in the dichotomous Rasch model as follows:
where
To interpret Equation 5 explicitly, Marais and Andrich (2008) supposed
Using the above formulation, Andrich and Kreiner (2010) showed that the dependence effect
Local Response Dependence With Polytomous Responses
Although response dependence in Equation 5 can be interpreted as a change in relative difficulty, a second interpretation, more readily generalized to the case of polytomous responses, is to consider the effect that the change in the point of dichotomization has on the range of the continuum of the dependent item.
First, consider a response to item
This interpretation of response dependence as inducing a change in the range of the continuum in which item
Let the response to independent item
In summary, let
where
Figure 1 provides a graphical depiction of the process where (a) item
Table 1 shows
Probabilities of xnj = 2 and xnji for Selected Values of β
Estimation of the Magnitude of Dependence d
To estimate the magnitude of dependence
Of course, the original item
However, because their responses are all dependent on the responses to item
Clearly, having eliminated items
Estimates of the Thresholds for Each Resolved Item jix
In general, let the estimates of
Subset of Estimates of the Thresholds for Each Resolved Item
The mean of the successive differences of each of these pairs gives an estimate of
The estimate
Standard Errors of the Magnitude of the Estimate of d
In general, the hypothesis is that
Estimated Variance Errors of Each
With
Then the variance of the mean of the
and
A Simulated Example
This section introduces a small simulated example that illustrates the procedure for estimating the magnitude of dependence based on the preceding formulation.
Simulation Design
The simulated example had the following features:
There were 30 items, all items had three thresholds, and therefore four categories and a maximum score of
20 items had no dependence structure;
5 items had response dependence on the immediately preceding item, in particular, Item 22 was dependent on Item 21, 24 on 23, 26 on 25, 28 on 27, and 30 on 29;
the dependence value in the preceding formulation was
the thresholds were randomly, uniformally generated, and then reordered to be in the natural order, in the range
there were 2000 persons, normally distributed with mean 0 and standard deviation 2.0
Results
The item parameters were estimated using the computer program RUMM2030 (Andrich, Sheridan, & Luo, 2011), which uses the conditional, pairwise maximum likelihood algorithm (Andrich & Luo, 2003) to eliminate the person parameters and from which the estimates of the item parameters are consistent (Zwinderman, 1995). RUMM2030 also accounts for missing responses routinely, a necessary feature given that the resolved items constructed from each original item have structurally missing responses.
Using the structure of Table 3 to obtain an estimate of
Estimates of
It is evident from Table 5 that an excellent estimate
Given five item pairs which are statistically independent, and each pair having the same dependence value of
This mean estimate,
and
The estimated empirical standard deviation of these 15 estimates is 0.072, giving a standard error of the mean of
Finally, to consolidate the illustration of estimating the value of
A Real Example
To use the preceding approach to quantify response dependence in the analysis of real data, two preliminary conditions need to be satisfied. First, it is necessary to have a hypothesis as to which items might be dependent. Second, the remaining items of the test should fit the RM reasonably well. Of course, if the dependence between the hypothesized pairs of items is not sufficiently large to be detected at the level of power provided by the sample size, then the estimate of dependence
One way of generating a hypothesis of whether two specific items show response dependence is to analyze the data with the PRM, form standardized residuals between the observed and expected responses from the parameter estimates and the model, and study the correlations among these residuals. A correlation of high magnitude between the residuals of two items would suggest some form of dependence between them that is not accounted for by the PRM. To ensure reasonable fit of the data to the model for the remaining items, items that show gross misfit to the model can be eliminated. Then the estimate of the magnitude of the dependence between two items is relative to that set of items which fit the model. In conjunction with such a quantitative estimate, it is necessary to consider the substantive features of the items and to understand the dependence in these terms.
This way of generating a hypothesis of response dependence between items is based on a post hoc analysis of data. However, these kinds of approaches involve general tests of fit, and if all items have some common feature, the misfit may not be detected (Smith, 2002). The power of detecting misfit with a specific hypothesis is greater than the power of a general test of fit. The test of a hypothesis of dependence is one such specific test which might show significance even though general tests of fit would indicate no or little violation of the model. Thus, an alternative, theoretically driven, method for generating a hypothesis of response dependence is from the structure and format of the items in relation to each other. The real data analyzed in the following are one such example where the hypothesis of response dependence was generated from the response structure and format and where response dependence is detected even though general tests of fit did not point to any response dependence.
The Gyagenda and Engelhard (2009) Data Set
The example used illustratively in the following involves a data set analyzed comprehensively by Gyagenda and Engelhard (2009), where the tests of fit they used did not indicate misfit to the Rasch model (Gyagenda & Engelhard, 2009). However, as shown in this article, a specific test of a hypothesis of response dependence showed that it was significantly different from zero.
The data set contains ratings of 366 eighth-grade students’ essays in the Georgia High School Writing Test. Each student was rated by 20 raters on four criteria: content/organization, style, conventions, and sentence formation. On each criterion, a rater assigned a rating of “inadequate,” scored 1; “minimal,” scored 2; “good,” scored 3; and “very good,” scored 4. Thus, in the notation of the PRM,
Gyagenda and Engelhard (2009) analyzed the data to explore rater, criterion, and student gender influences on writing ability. They used classical and modern test theory approaches in their analysis and compared the strengths and limitations of the two approaches.
Following the classical test theory approach, test reliability indicated through alpha coefficients with the criteria as items, was .87. Interrater reliability, with the raters as items, was .99. Total and criterion scores were analyzed using ANOVA with rater and gender as explanatory variables. Results indicated statistically significant mean differences among raters. In addition, overall, female students scored higher than did males, resulting in a statistically significant gender difference. There was also a significant rater by gender interaction effect.
Following the modern test theory approach, they used the multifacet PRM model with the proficiency of the person, the difficulty of each criterion, the severity of the rater, and student gender as facets (Linacre & Wright, 2002; Lunz, Wright, & Linacre, 1990). The person separation index of reliability, based on a single person estimate across all criteria, was 0.99. Results from this analysis also revealed effects of rater and gender on student writing. Despite intensive training, raters rated with different severity. This effect has often been reported and has led to the development of the multifacet Rasch model (e.g., Lunz et al., 1990), used in the analysis. Female students scored on average 1 logit higher than did males. This female superiority over males in writing performance is well documented (e.g., Hedges & Nowell, 1995).
General tests of fit indicated that all criteria and raters fitted the model. Gyagenda and Engelhard (2009) did not report any evidence of local dependence, in particular no suggestion of a possible halo effect. When individual raters display a halo effect but the majority do not, then those raters typically misfit the multifacet Rasch model (Myford & Wolfe, 2004). However, when there is a common, possibly a structural reason for the presence of halo, such tests of fit are unlikely to identify any halo effect (cf. Smith, 2002).
These data satisfy the two preliminary conditions for using the approach outlined in this article. First, because a halo effect has been well documented in cases where raters rate on multiple criteria (e.g., Myford & Wolfe, 2004), there is a basis for an a priori hypothesis of local dependence. Second, Gyagenda and Engelhard (2009) reported that the data fit the RM reasonably well, which allows for the estimation of relative dependence among items.
In the next section, the data are reanalyzed and the possibility of response dependence between two of the four criteria is considered. In a comprehensive analysis, for example, for purposes of improving ratings, dependence between all the criteria would be investigated. However, because in this article the analysis is only an illustration of the methodology, a comprehensive analysis of their data is left for a later date.
Estimation of the Magnitude of Dependence d
Data were reanalyzed using the PRM of Equation 1 with the proficiency of the person and the difficulty of each rater–criterion combination as items. That is, the data were “racked” (Wright, 2003) with each person having ratings on 80 items (20 raters by 4 criteria). In Gyagenda and Engelhard (2009), the first rater was numbered 2 and the last 21, a numbering retained in this article. Thus, for any one person, Items 1 to 4 contained the ratings of Rater 2, Items 5 to 8 contained the ratings of Rater 3, Items 9 to 12 contained the ratings of Rater 4, and so on.
The specific hypothesis tested was of response dependence between Criterion 1 (content/organization) and Criterion 2 (style) ratings for each rater; that is, the hypothesis that Items 2, 6, 10, and so on were, respectively, dependent on Items 1, 5, 9, and so on. Each of the hypothesized dependent items was resolved into four distinct items, one for each category of the independent item. The independent item was deleted, as was the complete dependent item. Using the structure of Table 3, an estimate of
Estimates of
Note: All
It is clear from Table 6 that the estimates of d are all positive with the range of estimates being
Summary
Methods for identifying violations of local independence in unidimensional models of modern test theory have been considered in the literature. This article notes that there are two generic sources of violation of local independence; one results from multidimensionality, and the other from response dependence between pairs of items. In the presence of response dependence, which is the concern of this article, the response of one item governs the response of one or more items which are dependent on it.
The possibility of clarifying and understanding response dependence when using the Rasch model is helped by having a mathematical formulation of a process that creates such dependence, a formulation which has been specified in the literature only relatively recently (Marais & Andrich, 2008). Andrich and Kreiner (2010) took this formulation of response dependence and showed how, using the dichotomous Rasch model, an estimate of the magnitude of dependence between two dichotomous items could be summarized as a change in the difficulty of an item because of its dependence on another item.
This article generalized the procedure shown in Andrich and Kreiner (2010) to the case of response dependence between two polytomously scored items with the same maximum score. Specifically, response dependence was formulated as inducing a change in the range of the thresholds separating adjacent scores of the dependent item such that a given response in one item increases the probability of the same response in the dependent item. Second, the estimation procedure of the magnitude of dependence, which was also a generalization of Andrich and Kreiner (2010), was derived.
This procedure for estimating the magnitude of dependence was illustrated in two ways: first in a simulation study and, second, with real data. In the first study, data were simulated with and without dependence. A detailed illustrative analysis of the responses simulated to have dependence recovered the simulated value of dependence very accurately. In the case of no dependence, as expected, none of the estimates of dependence were statistically significantly different from 0.
The procedure was illustrated in a real data set that was analyzed comprehensively by Gyagenda and Engelhard (2009) using classical test theory and the PRM of modern test theory. The data consist of ratings of student’s writing on four criteria by 20 raters. General tests of fit, shown by Gyagenda and Engelhard, did not indicate general misfit to the PRM. For purposes of this article, an a priori hypothesis of response dependence of ratings of Criterion 2 on Criterion 1 for all 20 raters was tested. Statistically significant response dependence was diagnosed for all 20 raters. The significant magnitudes of response dependence ranged from 0.519 to 1.810 logits, a relatively large effect. This study provides an example where general tests of fit do not indicate misfit but where a specific test of a hypothesis of dependence shows significant misfit.
Although the discussion has focused on identifying and quantifying response dependence in terms of an increase in distance between thresholds defining scores, this article was not concerned with accounting for this dependence. There are a number of procedures in the literature for doing so, including forming a polytomous item by summing the dependent dichotomous items (Andrich, 1985; Kreiner, 2007; Kreiner & Christensen, 2007; Wang, Bradlow, & Wainer, 2002; Wilson & Adams, 1995). However, as noted in the introduction, the ideal is to have no response dependence. Therefore, identifying dependence is particularly important in studies when raters are being trained and where criteria and formats are clarified. The magnitude of the effect on the thresholds of an item because of its dependence on another item is very tangible evidence of the impact of dependence. The evidence in this article suggests that at least reexamining the intrinsic relationship between the first two criteria in the Gyagenda and Engelhard (2009) data set would be useful in understanding the dependence. In addition, response dependence among other criteria can be studied.
Footnotes
Tim Dunne and Josh McGrane read the paper and each made excellent suggestions. The data in the real example were provided by George Engelhard.
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The research reported in this article was supported in part by Pearson and in part by an Australian Research Council Linkage grants with the Curriculum Council of Western Australia, Pearson Research and Assessment, and the Australian National Ministerial Council on Employment, Education, Training and Youth Affairs’ Performance Measurement and Reporting Task Force as Industry Partners.
