Abstract
Local dependence (LD) refers to the violation of the local independence assumption of most item response models. Statistics that indicate LD between a pair of items on a test or questionnaire that is being fitted with an item response model can play a useful diagnostic role in applications of item response theory. In this article, a new score test statistic, Sb, based on the bifactor logistic model is described. To compare the performance of Sb with the score test statistic (St) based on the threshold shift model, and the LD X2 statistic, data were simulated under locally independent, bifactor, and threshold shift conditions. The results summarize the null distributions of all three diagnostic statistics, and their power for various degrees of bifactor and threshold shift LD. Future research directions are discussed, including the straightforward generalization of Sb for polytomous item response models, and the challenges involved in the corresponding generalizations of St and LD X2.
Item response theory (IRT) provides a collection of latent variable models and statistical procedures (including parameter estimation techniques, model evaluation methods and diagnostics) used for item analysis and test scoring (Thissen & Steinberg, 2009). One basic assumption of IRT models is local independence (LI), which requires that responses to different items are independent conditional on the latent variable of interest,
1
θ. Formally, LI implies that the probability of observing responses
Equation 1 formalizes the strong version of LI (McDonald, 1982); it serves as the basis for both estimating item parameters and scoring, where the likelihood function of the overall response pattern for a given θ is usually treated as though the contribution of each item is independent. Although strong LI is important, there is no reliable statistical procedure to test it, because it requires no conditional dependency among item responses in marginal subtables of any order. Consequently, only violations of pairwise independence in certain parametric forms, which are introduced in the next section, are tested in practice.
When pairwise local dependence (LD) is of interest, it is natural to distinguish between the positive and negative valence of LD. Positive LD indicates that positive association remains between the pair after the interdependency attributable to the latent variable(s) is taken into account. One frequently cited example of positive LD involves redundant items, that is, (nearly) the same question asked more than once in a test. Meanwhile, negative LD indicates negative conditional association, which is in general less common and more difficult to explain.
Existing Methods to Identify LD Models
Models
Two parametric models are considered in the current study for the purpose of identifying LD: the bifactor model (Gibbons & Hedeker, 1992) and a threshold shift model (Glas & Suárez Falcón, 2003). Figure 1 illustrates the two models as path diagrams. 2

Path diagrams: left, bifactor model; right, threshold shift model
The bifactor model is a special multidimensional factor analysis model (Gibbons & Hedeker, 1992; for the more general two-tier model, see Cai, 2010). The specification of the item bifactor model implies that an extra latent variable θ2 produces residual covariances for a subset of items when the data are fitted with a unidimensional model. A bifactor logistic model for dichotomous item pair p and q with responses xp, xq = {0, 1} is as follows:
in which ap and aq are the slope parameters on the primary latent dimension θ1, cp and cq are intercept parameters, and apq is the slope parameter on the secondary dimension θ2, representing LD.
To identify the model when the LD subset comprises only two items, some constraint must be imposed on the two secondary dimension slope parameters; a convenient restriction imposed in Equation 2 is equality of the absolute value of bifactor slopes. The positivity/negativity of the apqθ2 term in the second formula is determined by positive/negative LD to be modeled: A positive sign implies positive LD, and a negative sign implies negative LD. The locally independent two-parameter logistic (2PL) model is a special case of bifactor model obtained by setting apq = 0 in Equation 2.
The secondary factor loading can be understood analogously as representing error covariance in the linear-normal structural equation model: It can be proved (see Appendix A) that the LD bifactor model is equivalent to an error-covariance model if the indicator variables are normally distributed and the link function is identity. Another parameterization of a restricted bifactor model was given by Bradlow, Wainer, and Wang (1999), in which bifactor slopes are fixed to be equal to their corresponding primary slope, and the variance of the secondary factor is estimated.
The development of a threshold shift model may be traced to work by Kelderman (1984) and Jannarone (1986) on the Rasch model. It also appeared in work by Hoskens and De Boeck (1997), where it was called an “ordered-constant” interaction model, among four types of item response models using additional pairwise interaction parameters. Glas and Suárez Falcón (2003) derived a score test for this model’s interaction term (see also van der Linden & Glas, 2010).
In the 2PL case, for example, the trace line function of the second item in a locally dependent item pair p and q (p < q) may be written as follows:
where δ pq can be considered a threshold shift for the second response xq when the first item response xp is positive. In other words, endorsing the first item makes the second item less or more likely to be endorsed depending on δ pq being negative or positive. δ pq is also an ANOVA-like interaction term when the log odds of the pairwise response pattern is considered (Hoskens & De Boeck, 1997). Notice that if δ pq = 0, the threshold shift model reduces to the locally independent model.
In Appendix A, it is proved that for the linear-normal structural model, the threshold shift model with the shift parameter taken as a regression coefficient is equivalent to the error-covariance model, and thus to the bifactor model. However, in IRT where the indicator variables are Bernoulli distributed and the link function is logistic, the bifactor model and the threshold shift model are in general not exactly equivalent (with an exception of a degenerate 2PL case).
Apart from these models, there are other limited or full information parametric models for LD. For example, both LISREL and Mplus can include error covariances in a factor analytic model based on the polychoric correlation matrix of ordered categorical indicators (Muthén, 1978; see Wirth and Edwards, 2007, for a review). Braeken, Tuerlinckx, and De Boeck (2007) proposed to estimate the dependency parameter of the bivariate logistic distribution constructed with copula functions. Recently, Kim, De Ayala, Ferdous, and Nering (2011) compared the performance of 10 existing indices of LD under various simulated conditions. However, Kim et al.’s set of statistics did not overlap with those in the current study, 3 so no attempt is made to compare the results of this study with theirs. Interested readers are referred to their article for more information.
Statistical Procedures
Score test: Let
which is a special type of nested-model comparison in which the alternative model is reduced to the null after fixing one parameter to zero. The connection of this formal statistical test with the bifactor LD model can be established by setting η1 = apq and
The score test (Lehmann, 1999, also known as Lagrange multiplier test) statistic for testing Problem 4 is defined as
in which
To compute the LD score test statistics, partial derivatives of log-likelihood under each LD model will be derived in the next section. When LI is true (i.e., under H0), it can be proved (e.g., Lehmann, 1999) that
One advantage of the score test over other asymptotic tests, such as the Wald test and the likelihood ratio test, is that the computation of the test statistic does not rely on the MLE under the alternative, which would require model refitting. In LD detection, the hypothesis is usually tested for all pairs of items; the process of obtaining MLE for each pairwise LD model would be very time-consuming for long tests. Therefore, the score test is more efficient computationally.
Residual measures
Summary statistics of bivariate marginal residuals also provide evidence about the assumption of LI. One popular choice in this category is the LD X2 (Chen & Thissen, 1997) statistic. Taking the dichotomous case as an example, for each pair of items the 2× 2 tables as shown in Table 1 can be constructed for observed and expected probabilities; ps denote the observed cell proportions and πs the expected ones.
Bivariate Marginal Tables for Observed Proportions
Expected cell proportions can be calculated under the locally independent item response model:
in which ϕ(θ) is the probability density function, assumed to be N(0, 1). The LD X2 statistic is the Pearson’s X2, a quadratic form of the discrepancy of these observed and expected marginal probabilities (i.e., residuals):
The asymptotic null distribution for X2 is difficult to compute. Chen and Thissen suggested the use of
Other residual diagnostics have been proposed. For example, Yen’s (1984) Q3 is defined as the sample Pearson correlation between paired residuals. However, it has the same problem as LD X2, that the asymptotic distribution of the residual correlation is not clear under the null hypothesis. 4 Another residual diagnostic, the DIMTEST T statistic, was proposed by Stout (1987) for testing unidimensionality. As compared with Yen’s Q3, DIMTEST is a nonparametric procedure testing whether the average residual covariance in a given item subset is significantly larger than zero. Since all the procedures investigated in the current study are parametric, further details of the DIMTEST statistic are not included here.
The research described here investigates the feasibility and usefulness of a score test based on the bifactor model. Whether the performance of the LD score test depends on the choice of alternative models is also examined by comparing this statistic with the one based on the threshold shift alternative (Glas & Suárez Falcón, 2003) in various simulated conditions. LD X2 is also involved in the simulation study, which is intended to verify and extend the results of Chen and Thissen (1997). Because LD X2 statistics are reported by the computer software IRTPRO (Cai, Thissen, & du Toit, 2011) as item-pair level diagnostics, their performance would be of interest to many substantive users.
Theory of a Score Test
Derivatives in General Form
Let the response for each item be dichotomous (i.e., xij is 0 or 1). Bock (2010) provided, for the multidimensional dichotomous IRT model, a general form of the first derivatives of the marginal log-likelihood function log L(
where
In practice, however, this computation may become impossible for long tests because the number of operations to be performed increases as an exponential function of test length. One computationally feasible consistent estimate is the cross-product (XPD) approximation (Bock & Lieberman, 1970; Kendall & Stuart, 1961):
in which the sum is taken over all observed response patterns. Because XPD approximation is a consistent estimate of Fisher information, by Slutsky’s theorem (e.g., Lehmann, 1999) and Equation 6, an asymptotically equivalent version of score test statistic can be defined as
In Equations 9 and 11, only the first derivatives of trace line with respect to the parameters of the alternative model are involved when calculating the XPD estimates of information.
Derivatives for Multidimensional Logistic Model
Considering the bifactor LD model (and then 2PL model) as a special case of the multidimensional logistic model, the derivatives of the trace line function of the latter are needed. The trace line function has the following form:
in which
Derivatives for Threshold Shift Model
In the threshold shift LD model, the trace line for the second item in an item pair is given by Equation 3, and the corresponding first derivative with respect to δ pq is
The derivatives of item qs trace line with respect to its slope and intercept are in the same form of Equations 14 and 15, except that the multidimensional logistic trace line function is replaced by Equation 3. Partial derivatives of trace lines are inserted in Equations 9 and 11 to yield the gradient and XPD information, which are further used in Equation 12 to compute the score test statistic, St, based on the threshold shift model (Glas & Suárez Falcón, 2003).
A Special Issue With the Score Test for Bifactor Slopes
It was mentioned earlier that the bifactor LD model reduces to the locally independent 2PL model when the secondary slope is zero; however, the test statistic cannot be computed at exactly apq = 0. A detailed proof is provided in Appendix B that the score function is zero when apq is set to zero, which indicates a local maximum or saddle point. Figure 2 provides a visualization of this point by plotting the profile log-likelihood 6 as a function of apq. The data used to compute this profile log-likelihood were generated from a bifactor positive LD model. When the true model is used (i.e., positive LD model, solid line in Figure 2), apq = 0 is a saddle point; when the wrong model is used (i.e., negative LD model, dashed line of Figure 2), it is a local maximum.

Profile log-likelihood for bifactor model as a function of secondary slope
As a result, the null hypothesis apq = ε is used, with ε some small value (0.0001 in this research), and the score test statistic based on the bifactor model is computed as
Due to the symmetry (shown as the dotted line in Figure 2) of the likelihood function with respect to apq = 0, it is only necessary to consider apq > 0 when computing the score statistics for both positive and negative LD models. After the two statistics associated with these two models are obtained, the sign of score function at apq = ε is checked to determine which model to keep: The model with a positive score function value is retained; if both models produce negative values, then the LI model is taken as approximately correct.
Method
Simulation Design
To investigate the performance of LD statistics, dichotomous item response data were generated using R (R Development Core Team, 2010) from the locally independent 2PL model (null conditions), the threshold shift model (TS conditions), and the bifactor model (BF conditions), respectively.
For the null case, data were generated with 10, 25, and 50 items for the sample sizes of 200, 500, and 1,000 (see Table 2). Item parameters were sampled from the same distributions as Chen and Thissen (1997) used, which were believed to resemble the empirical distributions seen in practice. Specifically, the slopes were drawn from log-normal distribution loga ~ N(0, 0.5), and the thresholds (b = −c/a) were drawn from N(0, 1:5). The data generation procedure was replicated 1,000 times for each condition; LD statistics were extracted only for the first item pair of each replication.
Simulation Conditions
Note: TS = threshold shift model; BF = bifactor model. Design: No. of Items × Sample Sizes (× Strength).
To evaluate the power of statistics when LD is present, data were generated from both the threshold shift model and the bifactor model. Apart from sample size and number of items, there was also variation of the strength of LD, as shown in Table 2. The association parameters were generated on the factor loading scale, and then transformed to produce apq and δ pq parameters. Specifically, secondary factor loadings λ pq were generated from N(μλ, 0.01), where μλ = 0.3, 0.5, 0.7, indicating low, moderate, and strong LD. To obtain valid values for loadings and control the dispersion of item parameters, all these sampling distributions were truncated to the interval μλ − 0,2, μλ + 0,2.
The bifactor slopes can be transformed from loadings using the following exact formula (see Wirth & Edwards, 2007):
The transformation between δ pq and θ pq is not exact because no apparently equivalent factor analysis model is defined for the threshold shift model. One approximation is to use the result proved (in Appendix A) with continuous indicators:
where λ p is the primary slope for the first item in each LD pair.
Although negative LD does not have the same substantive interpretation as positive LD, neither of them is mathematically more important. Thus, under each simulated condition, 500 negative LD pairs were generated out of the total 1,000 replications by switching the sign of apq and δ pq to negative in the trace line of item q.
A modified version of the computational engine of the software IRTPRO was used for estimating item parameters and computing LD statistics.
Evaluation of LD Statistics
The distribution of the bifactor score test statistic Sb, the threshold shift score test statistic St, and LD X2 under the null hypothesis can be obtained by pooling over all replications within each cell. For the locally independent conditions, the empirical distributions are compared to the
The receiver operating characteristic (ROC) curves of all three statistics for all conditions are presented to provide information about the power of these tests. The horizontal axis of each ROC curve represents the false positive rate, or the alpha level in the setting of hypothesis testing; the vertical axis represents the true positive rate, which is the power of the statistics.
Results
Null Conditions
Several important quantiles (i.e., 0.25, 0.5, 0.75, 0.9, 0.95, and 0.99
8
) of the empirical distributions are tabulated for each condition of the simulation in Table 3; the corresponding quantiles of the
Empirical Quantiles of the Null Distribution of the Three Statistics
Note: LD = local dependence. The corresponding quantiles for
A general trend is that both Sb and St tend to be liberal, whereas LD X2 is conservative if the

Empirical CDF plots under the null conditions
From Figure 3, it can be concluded that
Threshold Shift Conditions
Figures 4 to 6 show the ROC curves for the three statistics with data generated by the threshold shift model.

ROC curves for weak TS condition (δ pq = 0.16)

ROC curves for medium TS condition (δ pq = 0.34)

ROC curves for strong TS condition (δ pq = 0.66)
In Figure 4, the results for weak threshold shift of only 0.16 standard units, show that there is very little difference between the empirical curves and the diagonal line of the ROC plots (i.e., random rejection of the null hypothesis) for any of the statistics, which indicates low power. Even though some power is gained with increasing sample size, it is no more than 0.2 with 1,000 observations if 0.05 is used as the nominal level.
Figure 5 shows that when δ pq = 0.34, there is, again, more power to detect LD in larger samples; the power is affected very little by test length. To illustrate, choosing 0.05 as the nominal level and only considering 50-item tests, the power of bifactor statistics is 0.11 for sample size 200, 0.28 for sample size 500, and 0.43 for sample size 1,000. The empirical ROC curves for all three statistics are very similar.
Figure 6 shows that as the threshold shift parameter increases to 0.66, the power of all statistics becomes higher (i.e., greater than 0.6) if sample size is 500 or 1,000, but remains low/moderate (i.e., about 0.25) for small samples.
To summarize, the ROC curves of all three statistics are roughly identical, which reflects their similar performance in all the TS conditions. The power increases when sample size increases, but remains about the same when test length changes.
Bifactor Conditions
Figures 7 to 9 show the ROC curves for the three statistics with data generated by bifactor LD model.

ROC curves for weak BF condition (apq = 0.54)

ROC curves for medium BF condition (apq = 0.96)

ROC curves for strong BF condition (apq = 1.67)
Figure 7 shows that when the impact of secondary factor is weak (apq = 0.54; error covariance is 0.09), the power is relatively low or moderate for all combinations of test lengths and sample sizes; the power at α = 0.05 ranges from (approximately) 0.1 to 0.3. However, the pattern of results is similar: Power increases as the number of observations increases and is not influenced very much by the number of items; all three statistics have nearly the same ROC curves.
Figure 8 shows that the power at nominal level 0.05 falls between 0.4 and 0.8 for all the statistics when apq = 0.96 on average (error covariance is 0.25), which is high compared with the medium TS conditions. Again, all three statistics yield similar results.
Figure 9 shows that when the residual association is strong (i.e., apq = 1.67, error covariance is 0.49), the power for all statistics is high for all conditions; especially with large samples, all three procedures have nearly perfect power to detect LD, which is shown by their extremely steep ROC curves.
In all, the results for BF conditions are similar to those obtained for TS conditions; the only difference is that all statistics seem to be more powerful. That may be attributed to the approximate formula (i.e., Equation 19) used to transform loadings to threshold shift parameters, which may lead to the unmatched LD levels between the threshold shift and bifactor generating models.
Discussion
For locally independent data, the empirical distributions of both score test statistics closely resemble
With respect to power, all three statistics exhibit similar patterns in terms of ROCs, whether a secondary factor or a threshold shift is the source of LD. Both score test statistics are very close numerically. It may suggest that the two mathematically distinct LD models are not empirically distinguishable; however, further investigation is required before any conclusion can be reached.
In summary, all three statistics considered in this study provide useful diagnostic information about LD. LD X2 is the easiest to compute; its power is comparable with the score statistics if its null distribution is known. In practice, however, unless resampling methods (e.g., jackknife or bootstrap) are invoked, it is in general not easy to approximate its null distribution. Both score statistics behave similarly in all the conditions considered in the present study; however, the computation of the bifactor statistic Sb requires numerically integrating over one more dimension, which makes it computationally more expensive than the threshold shift statistic St.
In future research, it would be interesting to generalize both score test statistics to polytomous IRT models, for example, the graded response model (Samejima, 1969). The generalization of bifactor statistic is straightforward; that involves adding another dimension to the original model. In contrast, because graded response model has more than one threshold, there are complications in generalizing the threshold shift statistic. Suppose the item pair has K response categories; one possibility is to incorporate different shift parameters for each of the K − 1 nonzero responses of the first item on each of the K − 1 thresholds of the second item, which leads the number of additional parameters to increase from 1 to (K − 1)2. It might be unwieldy to explain each of these shift parameters; nevertheless, it is acceptable to compute a multivariate score test statistic (i.e., with limiting distribution,
As one of the reviewers pointed out, further simulation studies considering multiple LD items should be done to investigate the performance of the proposed LD indices under more realistic circumstances. Following the paradigm of Kim et al. (2011) in this situation, the false-detection rates with respect to locally independent pairs in the test should also be incorporated as a criterion of evaluation. Nevertheless, it should be noticed that introducing multiple LD pairs (or subsets) blurs the boundary between the primary latent variable representing ability or trait and the secondary latent variables representing LD. Caution is required when interpreting simulation results of this kind.
Conclusions that can be drawn from the results of the current study are that (a) LD X2 is the easiest to compute, performs well, and has been implemented in commercial software; (b) score test statistic Sb based on the threshold shift model is also easy to compute, and works better than LD X2 with larger samples and shorter tests; (c) score test statistic St based on the bifactor model behaves as well as St—however, St is still preferred for 2PL model due to its computational efficiency; and (d) despite being distinct statistical models with different likelihood functions, the bifactor model and the threshold shift model are essentially indistinguishable as in asymptotic tests.
Footnotes
Appendix A
Appendix B
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
