Abstract
Local independence is a central assumption of commonly used item response theory models. Violations of this assumption are usually tested using test statistics based on item pairs. This study presents two quasi-exact tests based on the
Introduction
Many commonly used item response theory (IRT) models, such as the Rasch model (Rasch, 1960), but also the more general two- and three-parametric logistic (2PL and 3PL) models (Birnbaum, 1968) assume that the probability of a positive response of a person to an item does not depend on this person’s response to any other item, conditional on the person’s ability. This assumption is generally known as the local independence assumption (De Ayala, 2009).
Violations of this assumption can be observed if learning effects, testlet effects (Wainer, Bradlow, & Wang, 2007), multidimensionality, and similar phenomena are not accounted for by the IRT model that is used to describe the data (Levy, Mislevy, & Sinharay, 2009). In practical evaluations of psychological tests, the assumption of local independence is usually tested for pairs of items (Kim, De Ayala, Ferdous, & Nering, 2011). For this purpose, numerous approaches have been described in the literature and evaluated in several simulation studies (Chen & Thissen, 1997; Edwards, Houts, & Cai, 2018; Kim et al., 2011; Yen, 1984), often testing the local independence assumption in the 2PL and 3PL models. For the special case of the Rasch model, practical suggestions have been presented by Koller and Hatzinger (2013).
This article aims at evaluating two new quasi-exact nonparametric methods for testing the local independence assumption for the Rasch model, which can also be applied to small data sets. The proposed methods can be seen as quasi-exact variations of the
Parametric Model Tests for Testing Local Independence in the Rasch Model
Numerous statistics have already been proposed to detect violations of local independence in the Rasch model. Among them, one may discern statistics that aim at detecting model violations on the level of the item set from statistics that aim to detect violations for specific item groups or item pairs. Test statistics that aim at detecting violations of local independence on the level of the item set include the
Instead, this study focuses on statistics that are calculated for item pairs. Examples include the
Recent overviews on tests for the detection of local dependence were provided by Kim et al. (2011) or Edwards et al. (2018). Of these procedures, the authors will describe the
Chen and Thissen (1997) proposed two asymptotically equivalent statistics that allow testing the hypothesis that the distribution of the responses observed for each item pair does not differ significantly from that expected under a specific IRT model. Using the notation of Kim et al. (2011), let
and
Both of these statistics can be assumed to follow a
Another widely applied method for detecting local dependence is the
The practical application of the
Quasi-Exact Model Tests for the Rasch Model
Ponocny (2001) described a framework for nonparametric, quasi-exact tests for testing the Rasch model in small samples. This framework, which does not require the estimation of item or person parameters, was further evaluated by Koller and colleagues in several simulation studies (Koller & Hatzinger, 2013; Koller, Maier, & Hatzinger, 2015; Koller, Wiedermann, & Glück, 2015).
The general idea of this framework is to test the Rasch model against a more general IRT model, which can be described as an exponential family. A distinct feature of the proposed model tests is that they are uniformly most powerful tests for the Rasch model against these alternative models (Ponocny, 2001). The test consists in comparing the observed value of a statistic, whose form can be deduced from the IRT model against which the Rasch model is tested, with its distribution in a bootstrap sample of data matrices with equal marginal sums (i.e., the sum of positive responses in every row and every column in the observed data set). This comparison leads to the calculation of p values. A p value near 0 or 1 indicates a violation of the Rasch model.
To obtain the bootstrap sample of response matrices, a Markov-Chain Monte Carlo (MCMC) algorithm developed by Verhelst (2008) is used. This algorithm is available in the software package eRm (Mair, Hatzinger, & Maier, 2018) for the R software environment (R Core Team, 2017). This algorithm generates artificial response matrices with identical marginal sums from the original response matrix using a stepwise procedure; for a more detailed description, see Koller and Hatzinger (2013). It is important that the generated response matrices are neither too similar to the original response matrix nor to each other, as both could lead to a bias in the calculation of the p values. To this end, not all response matrices generated by the algorithm are used. By discarding response matrices at the beginning of the sequence, which is also named the burn-in phase, a high similarity with the original response matrix is avoided. After the burn-in phase, every
Ponocny (2001) and Koller et al. (2015) already described several test statistics for this nonparametric framework, which may be used to detect specific violations of the Rasch model. For testing against local independence, Ponocny (2001) described the
Koller and Hatzinger (2013) described another variation of the
Further simulation results were presented by Koller, Wiedermann & Glück (2015), who investigated quasi-exact tests in the context of change measurement.
The authors now discuss how the nonparametric testing framework of Ponocny (2001) can be used for testing the local independence assumption based on the
Comparable to the aforementioned
Aim of This Study
This study aimed at comparing several methods for detecting violations of local independence with regard to their Type I error rate and their power against local dependence. Based on the review of the literature, the authors chose to evaluate the following statistics besides
Previous work on tests for detecting local dependence considered at least two violations of this assumption. The first violation occurs when more than one latent ability is necessary to describe the observed person–item interactions (i.e., the items measure more than one trait) and includes multidimensionality. The second type can be expected if specific item pairs are unusually similar with regard to their content (e.g., solving the first item helps in solving the second one). Both types of model violations are closely related to the concepts of trait and response dependence, as they were proposed by Andrich and Kreiner (2010) for the Rasch model.
All test statistics were evaluated with regard to their power against these two forms of local dependence. In the following section, the design of the simulation studies used for this evaluation is discussed.
Simulation Studies
Overall, three simulation studies were carried out, of which the first aimed at evaluating the Type I error rate, the second one aimed at testing the sensitivity against response dependence, and the third one aimed at evaluating the sensitivity against testlet effects.
General Settings
The three simulation studies shared the following general characteristics:
Simulation Study I: Type I Error Rate
Aim of Simulation Study I
Simulation Study I aimed at investigating the Type I error rate of the various tests. Tests with a high Type I error rate typically also have higher power against model violations, therefore, the results of this investigation also affect the interpretation of the remaining simulation studies.
Design of Simulation Study I
In this simulation study, all data were generated according to the Rasch model. As it is commonly known, this model uses the following item response function for describing the probability of a response of
For all test statistics, the rate of p values lower than 0.01 and 0.05 for all item pairs were evaluated.
Results and conclusions of Simulation Study I
Overall, all tests were found to have Type I error rates at or below the nominal alpha level. For interpreting the sensitivity of statistical tests against various model violations, it is important to state that the Type I error rate is not inflated. Under this perspective, the results were satisfactory. For the
Simulation Study II: Response Dependence
Aim of Simulation Study II
Simulation Study II aimed at evaluating the power of the test statistics to detect violations of local dependence on the level of single item pairs. This type of local dependence can be commonly observed in the presence of learning effects, similar item content, and similar causes of model violations.
Design of Simulation Study II
This study simulated a violation of local independence by generating items with response dependence (Marais & Andrich, 2008). Similar designs were used in a number of studies investigating local dependence in the Rasch model (Andrich & Kreiner, 2010; Marais & Andrich, 2008). In the data generating model, there are item pairs that violate the assumption of local independence. Let
This model can be compared with a Rasch model in which the item parameter of item
Depending on the simulated condition,
There were two additional conditions in this simulation study, which differed in the number of item pairs for which local independence was violated. Depending on the simulated condition, the violation of the item parameters affected only one item pair
Results and conclusions of Simulation Study II
This type of model violation leads to a higher correlation between specific item pairs than it is predicted by the Rasch model. Therefore, it is of central interest whether statistics such as
Overall, the nonparametric test based on
The authors conclude this section with results on the parametric bootstrap approach.
Simulation Study III: Rasch Testlet Model
Aim of Simulation Study III
As was already stated, the test statistics evaluated in this article aim at detecting local dependence on the level of item pairs. The presence of multidimensionality affects all item pairs to some degree, and, therefore, can be regarded as a model violation that these statistics were originally not aimed to detect. Such model violations are fairly common and include cases where groups of items measure an additional trait, as it is also modeled in testlet models (Wainer et al., 2007). Therefore, the authors chose to investigate the power of all tests also under a testlet model in Simulation Study III.
It seems important to note that multidimensionality was found to lead to a pattern of positive and negative local dependence in previous studies. A heuristic explanation for this observation was provided by Yen (1984), whereas a formal proof for the special case of normally distributed person ability parameters in multidimensional compensatory IRT models was provided by Habing and Roussos (2003). Therefore, the authors expected all test statistics to be sensitive against this model violation. Comparable simulation designs, which were based on multidimensional IRT models, were used in the studies of Edwards et al. (2018), Kim et al. (2011), and others for evaluating tests of local independence.
Design of Simulation Study III
This simulation study evaluated the power of the tests for local dependence in data sets generated from the Rasch testlet model (Wang & Wilson, 2005). The response function of this model is given by the following:
As can be seen from a comparison with Equation 1, this model contains an additional parameter
The data sets simulated in Simulation Study III contained five item testlets of equal size, which each contained one fifth of the overall item set. The corresponding five random effects
Results and conclusions of Simulation Study III
As this simulation study analyzed item sets that measured multiple traits, the authors could discern between item pairs in which both items measured the same latent traits from item pairs in which both items measured distinct traits. Again, the authors only present a summary here, whereas detailed results are found in the Online Appendix.
A comparison of the test results for the two different types of item pairs indicates that the p values tend to decrease in both types of item pairs for the tests based on
These effects can be explained by considering how these test statistics are calculated:
In statistics like
As was already the case for response dependence, the parametric bootstrap of Christensen et al. (2017) led to a lower rate of item pairs that were considered as critical than the other approaches. As in the previous studies, the results obtained for
The authors also found that the distinct behavior of the nonparametric tests in the data sets with trait dependence can be illustrated graphically. The first option are QQ plots, which compare the distribution of the p values for all item pairs against an uniform distribution. A second option are plots that present the matrix of p values as a matrix of circles, in which p values near 1 are represented by full and p values near 0 as empty circles. Similar plots were presented by Sinharay (2005) and Bechger and Maris (2015). Figure 1 provides two examples for each of these plots, which were generated using the car (Fox & Weisberg, 2011) and corrplot (Wei & Simko, 2017) packages for R. The right-hand plots were generated based on the results of the

Graphical model tests (top row: graphical illustration of p values; bottom row: QQ plots) based on the patterns of p values for the
Empirical Data Analyses
In this section, the authors briefly illustrate the application of the quasi-exact
In this subtest, the first two items ask the children to estimate a number of black dots, which are presented separately for 2 s. In Items 3 and 4, the numbers in two sets of three-dimensional objects, which are presented for 5 s, should be estimated. The fifth item requires a quantitative comparison of the objects presented in Items 3 and 4 (Question: “Were there more . . .?”). Based on the item content, the authors further investigate these two item pairs

Graphical model tests (top row: graphical illustration of p values; bottom row: QQ plot) based on the patterns of p values for the
As a comparison with Figure 1 shows, these results may indicate a multidimensionality of the data set. The authors do not investigate this hypothesis further.
Discussion
The results indicate that the proposed tests based on
Another distinct feature of
There are two additional points that should be further considered in practical application of these statistics: First, like the original
A second point concerns the detection of multidimensionality. As was reported in Simulation Study III, the p values of the tests based on
If local dependence has been detected on the level of specific item pairs, possible causes for the model violation should be examined (Kim et al., 2011). To address local dependence, the item set may be altered (e.g., by changing or removing items), or, to avoid the loss of information, by using an IRT model that allows the modeling of local dependence. Examples for such models, which are closely related to the Rasch model are the Rasch testlet model (Wang & Wilson, 2005), the multidimensional Rasch model, and the locally dependent Rasch model (Kelderman, 2007).
Software Used in This Study
All simulation studies were carried out in the R statistical software environment (R Core Team, 2017). The test statistics
Supplemental Material
Appendix_PDF – Supplemental material for Testing the Local Independence Assumption of the Rasch Model With Q3-Based Nonparametric Model Tests
Supplemental material, Appendix_PDF for Testing the Local Independence Assumption of the Rasch Model With Q3-Based Nonparametric Model Tests by Rudolf Debelak and Ingrid Koller in Applied Psychological Measurement
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Supplemental Material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
