Use of the Lagrange Multiplier Test for Assessing Measurement Invariance Under Model Misspecification

Abstract

This article studies the Type I error, false positive rates, and power of four versions of the Lagrange multiplier test to detect measurement noninvariance in item response theory (IRT) models for binary data under model misspecification. The tests considered are the Lagrange multiplier test computed with the Hessian and cross-product approach, the generalized Lagrange multiplier test and the generalized jackknife score test. The two model misspecifications are those of local dependence among items and nonnormal distribution of the latent variable. The power of the tests is computed in two ways, empirically through Monte Carlo simulation methods and asymptotically, using the asymptotic distribution of each test under the alternative hypothesis. The performance of these tests is evaluated by means of a simulation study. The results highlight that, under mild model misspecification, all tests have good performance while, under strong model misspecification, the tests performance deteriorates, especially for false positive rates under local dependence and power for small sample size under misspecification of the latent variable distribution. In general, the Lagrange multiplier test computed with the Hessian approach and the generalized Lagrange multiplier test have better performance in terms of false positive rates while the Lagrange multiplier test computed with the cross-product approach has the highest power for small sample sizes. The asymptotic power turns out to be a good alternative to the classic empirical power because it is less time consuming. The Lagrange tests studied here have been also applied to a real data set.

Keywords

MIMIC models binary data generalized Lagrange multiplier test

Introduction

Item response theory (IRT) models are used in psychological and educational research for measuring unobserved constructs, also known as factors or latent variables, from correlated observed variables/items. The main assumptions and features of an IRT model are (1) local independence among items conditional on the latent variable(s), (2) it is usually a parametric model for the probability of responding “correctly/positively” to an item given the latent variable(s) also known as response category probability and item characteristic curve (ICC), and (3) normal distribution for the latent variable(s) (Bartholomew et al., 2011). As with any statistical model, some of the above assumptions may be violated. The likelihood-ratio, the Wald, and the Lagrange multiplier or score (LM) test statistics (Cox and Hinkley, 1979) check model fit and they are asymptotically equivalent. Differently from the likelihood-ratio and the Wald test, the LM test only requires the computation of the restricted estimator (model under the null hypothesis). The LM test can be very convenient in IRT models, where multiple model violations (e.g., local dependence, nonnormality of latent distribution) can occur (Fox and Glas, 2005). The LM test does not need the estimation of an alternative model for each one of these violations. Moreover, there is model violation, such as differential item functioning (DIF), that requires testing items sequentially (Glas, 1998). The LM test does not require new parameter estimates for every tested item, making it computationally less intensive, especially in long tests. For these reasons, the LM test is used in IRT to detect DIF (Fox & Glas, 2005; Glas, 1998), local dependence (LD) (Fox and Glas, 2005; Glas, 1999; Glas & Falcón, 2003; Kim et al., 2011; Liu and Maydeu-Olivares, 2013; Liu & Thissen, 2012, 2014; Oberski et al., 2013; van der Linden & Glas, 2010) and deviation from the parametric model (i.e., ICC) (Glas, 1999; Glas & Falcón, 2003; Ranger & Kuhn, 2012).

The LM test depends on the Fisher information matrix. Different approximations of this matrix lead to different test performances. Accurate results for the LM test can be obtained by considering the expected Hessian and cross-product matrix, as shown in Liu and Maydeu-Olivares (2013), but they are unfeasible in long tests. For this reason, the observed versions of these matrices are preferred for the computation of the LM test. Some authors (Glas, 1998; Oberski et al., 2013) use the observed Hessian matrix, which we denote with LM(H), and others (Liu and Maydeu-Olivares, 2013; Liu & Thissen, 2012, 2014) the observed cross-product matrix, which we denote with LM(CP). Falk and Monroe (2018) compare both approaches. The LM(CP) test shows more inflated Type I error rates than the LM(H) test, especially with long tests and small sample size, but it is fast to compute (Falk & Monroe, 2018; Liu & Maydeu-Olivares, 2013; Liu & Thissen, 2012, 2014). In some works, the LM test statistic is applied in the case of model misspecification under the null and the alternative hypotheses, showing a good performance when the amount of model misspecification is overall small (Falk & Monroe, 2018; Glas & Falcón, 2003; Guastadisegni et al., in press). Different versions of the LM test are also derived under model misspecification (Boos, 1992; White, 1982). White (1982) proposes the generalized Lagrange multiplier (LM(S)) test, whose expression involves the sandwich variance and covariance matrix. Similarly Boos (1992) derives a generalized score (GS) test for least squares, robust M-estimation, and quasi-likelihood estimation methods that is equivalent to the LM(S) test when maximum likelihood (ML)-based methods are used. The generalized jackknife score (GS(J)) test is a version of the GS test, derived under model misspecification, where the covariance matrix of the score is computed using the jackknife estimates (J. Rao et al., 1998). The GS(J) test has not been studied in the IRT context. As far as we know, the LM(S) test is studied only by Falk and Monroe (2018) and Guastadisegni et al. (in press). Falk and Monroe (2018) compare the performance of the LM(S), LM(CP), and LM(H) tests for a single omitted cross-loading and Guastadisegni et al. (in press) compute the empirical and asymptotic power of the LM(S) and LM(H) tests to assess measurement invariance under misspecification of the latent variable distribution, without studying the Type I error/false positive rates of these two tests. Different from these works, we assess measurement invariance considering a more general framework, where the model misspecification is due to local dependence among items and different nonnormal latent variable distributions.

In the case of a one factor model, an item is measurement invariant if the conditional distribution of the item given the latent variable is independent of group membership identified by an external group variable (eg, sex, age, country) (Mellenbergh, 1982, 1983). An item is measurement noninvariant (also known as DIF), if it measures different abilities for different group memberships. In this case, the expected score of the item differs in the subgroups for the same level of the latent variable. Measurement invariance can be studied either in a multiple-group analysis setup (Jöreskog, 1971) or with the multiple indicator multiple causes (MIMIC) model (Jöreskog & Goldberger, 1975). The model allows direct and indirect effects of a binary group covariate on the probability of giving a “correct/positive” response to an item and on the latent variable respectively.

The contribution of this article is twofold. First, we assess item measurement invariance under model misspecification, using four versions of the LM test. The four versions differ in the form of the covariance matrix of the estimators. Mainly, the Hessian estimator (LM(H)), the cross-product estimator (LM(CP)), the sandwich estimator (LM(S)), and the jackknife estimator (GS(J)) are discussed and studied here. Second, we compute the power of the LM(H), LM(CP), and LM(S) tests in two ways, empirically through Monte Carlo simulation methods and asymptotically using the distribution of each test under the alternative hypothesis, which depends on a noncentrality parameter often difficult to compute (Gudicha et al., 2017). The noncentrality parameter is approximated using the procedure derived by Gudicha et al. (2017) for the Wald and likelihood-ratio tests and it is applied in Guastadisegni et al. (in press) to the LM(H) and LM(S) tests under misspecification of the latent variable distribution. We extend this method to the case of local dependence and to the LM(CP) test.

Through an extensive simulation study, we compare the performance of the different versions of the LM tests in terms of Type I error rate, false positive rate, and empirical and asymptotic power, varying the type and the misspecification level and considering single and multiple parameter hypotheses tests for measurement invariance. Moreover, we illustrate the use of these tests to a real data set.

The article is organized as follows. First, we present the MIMIC model with covariate effects. Second, we describe the four versions of the LM tests and the procedure to estimate the asymptotic power for the LM(H), LM(CP), and LM(S) tests. Next, we present a Monte Carlo simulation study and the results from the real data analysis. Finally, some concluding remarks are presented and discussed.

The MIMIC Model for Binary Data

Let us denote by $y_{1}, . . ., y_{p}$ a set of observed binary variables/items, by $z$ the latent variable, and by $x$ a binary variable such as sex, country, or any other group variable. Given $n$ individuals, the $i$ th subject belongs to either the focal or the reference group when $x_{i} = 1$ or $x_{i} = 0$ respectively. To test for item(s)’ measurement invariance, we consider the MIMIC model with the group variable $x$ affecting both the item(s) $y$ and the latent variable $z$ . Group differences can be present only on the item intercept (uniform DIF) or simultaneously on the item intercept and slope (nonuniform DIF) (Fox and Glas, 2005; Glas, 1998). The response probability for the $i$ th individual to the $j$ th item is modelled using a logistic model (measurement model) where the model for the latent variable is a linear model (structural model) defined by:

\begin{matrix} P (y_{ij} = 1 | z_{i}, x_{i}) = π_{ij} (z_{i}, x_{i}) = \frac{\exp (α_{0 j} + α_{1 j} z_{i} + γ_{1 j} x_{i} + γ_{2 j} x_{i} z_{i})}{1 + \exp (α_{0 j} + α_{1 j} z_{i} + γ_{1 j} x_{i} + γ_{2 j} x_{i} z_{i})} \\ z_{i} = β x_{i} + ϵ_{i} ϵ ~ N (0, 1) \end{matrix}

(1)

where $i = 1, . . ., n$ and $j = 1, . . ., p$ . Under nonuniform DIF, the intercept and factor loading parameters are ( $α_{0 j}, α_{1 j}$ ), and ( $α_{0 j} + γ_{1 j}, α_{1 j} + γ_{2 j}$ ) for the reference and focal groups, respectively (Glas, 1998). The parameter $β$ allows the mean of the latent variable $z$ to be different in the two groups, although it is set to $N (0, 1)$ in the reference group for identification purposes. For a random sample of size $n$ the log-likelihood is:

l (y, θ) = \sum_{i = 1}^{n} \ln f (y_{i}, θ) = \sum_{i = 1}^{n} \ln \int Π_{j = 1}^{p} π_{ij} (z_{i}, x_{i})^{y_{ij}} (1 - π_{ij} (z_{i}, x_{i}))^{1 - y_{ij}} ϕ (z_{i} | x_{i}) d z_{i},

(2)

where $θ$ is the vector of the unknown parameters and the model assumes conditional/local independence among the items. Equation (2) is maximized using either an expectation–maximization (EM) algorithm (Bock & Aitkin, 1981) or a direct maximization, such as the Newton–Raphson algorithm (Skrondal & Rabe-Hesketh, 2004).

Uniform and nonuniform DIF for an item $y_{j}$ is assessed by testing the statistical significance of the parameters $γ_{1 j}$ and ( $γ_{1 j}$ , $γ_{2 j}$ ) respectively. We consider situations where the parameters $γ_{1 j}$ or ( $γ_{1 j}, γ_{2 j}$ ) are fixed to zero and to constants different from zero under the null hypothesis. Moreover, the performance of the LM tests is assessed under violations of local independence and normality distribution of the latent variable.

Lagrange Multiplier Tests

The Classical Lagrange Multiplier Test

The LM test (C. R. Rao, 1948) evaluates the statistical significance of imposed restrictions on model parameters. We consider a sample $y_{1}, . . ., y_{n}$ from a model $f (y, θ)$ . The true parameter vector is denoted by $θ_{0}$ . Let $θ_{0}$ be divided into two sub-vectors $θ_{0}^{'} = (θ_{01}^{'}, θ_{02}^{'})$ . $θ_{01}$ includes the intercept parameters ( $α_{0 j}, j = 1, \dots, p$ ) and factor regression coefficients ( $α_{1 j}, j = 1, \dots, p$ ). When uniform DIF is assessed, $θ_{02}$ includes the parameters $γ_{1 j}$ and when nonuniform DIF is assessed, $θ_{02}$ includes $γ_{1 j}$ and $γ_{2 j}$ , where $j = 1, \dots, p$ . The hypotheses $H_{0}$ and $H_{1}$ can be formalized as follows:

H_{0} : θ_{02}^{'} = c vs H_{1} : θ_{02}^{'} \neq c,

(3)

where c is a vector of constants.

The LM statistic is (C. R. Rao, 1948):

LM = S (\tilde{θ})' A_{n} (\tilde{θ})^{- 1} S (\tilde{θ}),

(4)

where $\tilde{θ}' = ({\tilde{θ}}_{1}^{'}, c)$ denotes the restricted maximum likelihood estimates of the parameters $θ$ , $S (\tilde{θ}) = \frac{\partial \ln l (y, θ)}{\partial θ}$ is the vector of score functions evaluated at $\tilde{θ}$ , and $A_{n} (\tilde{θ}) = - E [\frac{\partial^{2} l (y, θ)}{\partial θ \partial θ'}]$ is the Fisher information matrix evaluated at $\tilde{θ}$ . Given that the part of the score vector evaluated in ${\tilde{θ}}_{01}$ is $0$ , the LM statistic given in Equation (4) is reduced to

LM = S_{2} (\tilde{θ}) A_{n}^{22} (\tilde{θ})^{- 1} S_{2} (\tilde{θ}),

(5)

where $S_{2} (\tilde{θ})$ is a subset of $S (\tilde{θ})$ that corresponds to the parameters $θ_{02}$ evaluated at $\tilde{θ}$ and $A_{n}^{22} (\tilde{θ})$ is a block of the partitioned Fisher information matrix computed as (Engle, 1984)

A_{n}^{22} = A_{n 22} - A_{n 21} A_{n 11}^{- 1} A_{n 12},

(6)

and evaluated at $\tilde{θ}$ . The partition of $A_{n}$ into $A_{n 22}, A_{n 21}, A_{n 11}, A_{n 12}$ is derived from the partition of $θ_{0}^{'}$ into $(θ_{01}^{'}, θ_{02}^{'})$ .

Two different versions of the LM test are studied here depending on which matrix is used for estimating $A_{n} (\tilde{θ})$ . The Hessian approach (LM(H)), uses the observed Hessian matrix given by

{\hat{A}}_{n} (θ) = - \sum_{i = 1}^{n} \frac{\partial^{2} l_{i} (y_{i}, θ)}{\partial θ \partial θ'}

(7)

whereas the cross-product approach (LM(CP)), uses the observed cross-product matrix

{\hat{B}}_{n} (θ) = \sum_{i = 1}^{n} \frac{\partial \ln l_{i} (y_{i}, θ)}{\partial θ} \frac{\partial \ln l_{i} (y_{i}, θ)}{\partial θ}

(8)

Under correct model specification, ${\hat{A}}_{n} (θ) = {\hat{B}}_{n} (θ)$ (White, 1982) and the LM(H) and LM(CP) tests are equivalent.

Under a correctly specified likelihood and under $H_{0}$ , the LM test statistic, computed with the Hessian and cross-product approaches, is asymptotically distributed as a $χ_{r}^{2}$ , with degrees of freedom ( $r$ ) equal to the dimension of $θ_{02}$ .

To compute the local asymptotic power of the LM test, a standard approach is to consider a set of local alternatives close to the null value for large $n$ , $H_{1} : θ_{02} = c + \frac{ξ}{\sqrt{n}}$ , where $ξ$ is an arbitrary vector with the same dimension of $θ_{02}$ (Boos & Stefanski, 2013). When the model defined under $H_{1}$ is true, the LM test is asymptotically distributed as a non-central chi-square that depends on two parameters, namely the degrees of freedom (equal to the dimension of $θ_{02}$ ), and a noncentrality parameter $λ$ given by (Cox & Hinkley, 1979):

λ = \frac{1}{n} ξ' A_{n}^{22} (θ_{0}) ξ

(9)

The asymptotic power is computed as $P (χ_{r}^{2} (λ) > χ_{r}^{2} (λ, 1 - α))$ .

Approximation Procedure for the Asymptotic Power

The asymptotic distribution of the LM test as a non-central chi-square with noncentrality parameter in equation (9) holds when the model defined under the set of local alternatives is true, that is, when the model under the null hypothesis is barely incorrect for large $n$ (see Agresti, 2002; Reiser, 2008). In practice, it is often reasonable to adopt an alternative hypothesis for fixed and finite $n$ (Agresti, 2002), as $H_{1} : θ_{02} = c + ξ$ , or to use hypotheses as in (3) (Gudicha et al., 2017). Here, we consider the approximation procedure for the asymptotic power derived by Gudicha et al. (2017) for the likelihood-ratio and the Wald tests. This procedure is extended to the LM(H) test in Guastadisegni et al. (in press). The method can also be used for the LM(CP) test and can be summarized in the following steps:

From the model defined under the alternative hypothesis, create a large data set (e.g., $N = 10000$ observations).

Fit the model under $H_{0}$ to the data generated under Step 1.

Take the value of the LM(H) or LM(CP) statistic as the estimate of the noncentrality parameter $λ$ (Bollen, 1989; Satorra, 1989).

Compute the noncentrality parameter for a sample of size 1 equal to $λ_{1} = \frac{λ}{N}$ .

The noncentrality parameter for a sample of size $n$ is $λ_{n} = n λ_{1}$ .

The asymptotic power of the LM(H) or LM(CP) test can be determined by comparing the $λ_{n}$ obtained in Step 5 with the tabled values of the noncentral chi-square with $df$ corresponding to the number of parameters constrained under $H_{0}$ and significance level $α$ (Bollen, 1989).

The Generalized Lagrange Multiplier Test

Consider a sample $y_{1}, . . ., y_{n}$ from a model with true density $g (y)$ , that assumes either local dependence among the items or a nonnormal distribution of the latent variable. The model with density $f (y; θ)$ , which assumes both local independence among the items and a normal distribution of the latent variable, is erroneously assumed to be the true model for the data and it is used for ML analysis. If the assumptions A1 to A6 (White, 1982, pp. 2-6), that ensure the existence, consistency, asymptotic normality, and identifiability of the quasi-ML estimator, are fulfilled, the parameter vector ${\hat{θ}}_{n}$ , which maximizes the log-likelihood function based on model $f (y; θ)$ , converges in probability to $θ_{*}$ , the parameter vector that minimizes the Kullback–Leibler information criterion. Moreover, the covariance matrix of ${\hat{θ}}_{n}$ , based on $n$ observations, is the so-called sandwich estimator given by ${\hat{C}}_{n} ({\hat{θ}}_{n}) = {\hat{A}}_{n}^{- 1} ({\hat{θ}}_{n}) {\hat{B}}_{n} ({\hat{θ}}_{n}) {\hat{A}}_{n}^{- 1} ({\hat{θ}}_{n})$ , where the matrix ${\hat{A}}_{n}$ and ${\hat{B}}_{n}$ are the observed Hessian matrix and the observed cross-product matrix defined in Formulas (7) and (8), respectively, and evaluated at ${\hat{θ}}_{n}$ .

Under model misspecification, the null and the alternative hypotheses are now specified in terms of $θ_{*}$ . Let $θ_{*}$ be divided in two subvectors $θ_{*}^{'} = (θ_{* 1}^{'}, θ_{* 2}^{'})$ . To test for uniform and nonuniform DIF, the parameters $θ_{* 1}^{'}, θ_{* 2}^{'}$ are grouped as in The classical Lagrange multiplier test section. The hypotheses in (3) can be formalized as follows:

H_{0} : θ_{* 2}^{'} = c vs H_{1} : θ_{* 2}^{'} \neq c,

(10)

where c is a vector of constants.

The Generalized Lagrange Multiplier test is defined as (Engle, 1984; White, 1982):

LM (S) = S_{2} ({\tilde{θ}}_{n})' {\hat{A}}_{n}^{22} ({\tilde{θ}}_{n})^{- 1} {\hat{C}}_{n 22} ({\tilde{θ}}_{n})^{- 1} {\hat{A}}_{n}^{22} ({\tilde{θ}}_{n})^{- 1} S_{2} ({\tilde{θ}}_{n}),

(11)

where ${\hat{A}}^{22} ({\tilde{θ}}_{n})$ is computed as in (6) replacing $A_{n}$ with ${\hat{A}}_{n}$ , evaluated at ${\tilde{θ}}_{n}$ and ${\hat{C}}_{n 22} ({\tilde{θ}}_{n})$ is the part of the matrix ${\hat{C}}_{n}$ corresponding to $θ_{* 2}^{'}$ , evaluated at ${\tilde{θ}}_{n}$ . Under $H_{0}$ , LM(S) is distributed as a $χ_{r}^{2}$ , with degrees of freedom $r$ equal to the dimension of $θ_{* 2}$ . If the model is correctly specified, the statistic LM(S) is equal to the LM test, computed both with the Hessian or the cross-product approach (White, 1982).

As before, the local asymptotic power of the LM(S) test is obtained by considering a set of local alternatives given by $H_{1} : θ_{* 2} = c + \frac{ξ}{\sqrt{n}}$ , where $ξ$ is an arbitrary vector of dimension $θ_{* 2}$ . Under $H_{1}$ , LM(S) converges in distribution to a $χ_{r}^{2} (λ)$ , with degrees of freedom $r$ equal to the dimension of $θ_{* 2}$ and $λ$ is the noncentrality parameter given by (Bera et al., 2020):

λ = \frac{1}{n} ξ' A_{n}^{22'} (B_{n 22} - A_{n 21} A_{n 11}^{- 1} B_{n 12} - B_{n 21} A_{n 11}^{- 1} A_{n 12} + A_{n 21} A_{n 11}^{- 1} B_{n 11} A_{n 11}^{- 1} A_{n 12})^{- 1} A_{n}^{22} ξ

(12)

where $A_{n 11}, A_{n 12}, A_{n 21}$ are the blocks of the expected Fisher information matrix $A_{n}$ and $B_{n 11}, B_{n 12}, B_{n 21}, B_{n 22}$ of the expected cross-product matrix $B_{n}$ , derived from the partition of $θ_{*}^{'}$ into $(θ_{* 1}^{'}, θ_{* 2}^{'})$ . $A_{n}^{22}$ is computed as in (6). All matrices in Formula (12) are evaluated at $θ^{*}$ . The asymptotic power estimation method described in the approximation procedure for the asymptotic power section is used here to estimate the asymptotic power for the LM(S) test. In Step 3, the LM(S) statistic is taken as the estimate of the noncentrality parameter (the proof of this result can be found in Satorra, 1989). Moreover, the model fitted under $H_{0}$ at Step 2 is assumed to be misspecified. Under correct model specification the LM(S) and the LM(H)/LM(CP) tests have the same noncentrality parameter and, consequently, the same asymptotic power.

The Jackknife Generalized Score Test

When ML-based methods are used, the LM(S) test derived by White (1982) is equivalent to the GS test derived by Boos (1992) under model misspecification and valid under different types of estimation methods, such as least squares, quasi-ML, and robust M-estimation. The Generalized Score test for the hypothesis testing given in (10) is

GS = S_{2} (\tilde{θ})' V_{S_{2}}^{- 1} (\tilde{θ}) S_{2} (\tilde{θ}),

(13)

where $S_{2} (\tilde{θ})$ and $\tilde{θ}$ are defined similarly as in The Generalized Lagrange Multiplier Test section, but $S_{2}$ does not necessarily come from the derivative of a log-likelihood because it depends on the estimation method chosen. $V_{s_{2}} (\tilde{θ})$ is the covariance matrix of $S_{2}$ , evaluated at $\tilde{θ}$ .

When likelihood-based methods are used, $V_{s_{2}} (\tilde{θ})$ is equal to ${\hat{A}}_{n}^{22} (\tilde{θ}) {\hat{C}}_{n 22} (\tilde{θ}) {\hat{A}}_{n}^{22} (\tilde{θ})$ and Formulas (13) and (11) are equivalent. Under $H_{0}$ , the GS test is distributed as a $χ_{r}^{2}$ , where $r$ are the $df$ equal to the dimension of $θ_{* 2}$ .

J. Rao et al. (1998) proposed a version of the generalized score test in a general estimating equations framework (Godambe and Thompson, 1986) for a stratified multistage sampling design, based on a consistent jackknife estimator of $V_{S_{2}} (\tilde{θ})$ . We use the test proposed by J. Rao et al. (1998), for independent and identically distributed (i.i.d.) observations and maximum likelihood estimation methods and we refer to this test as the jackknife generalized score (GS(J)) test. The GS(J) test is given in Formula (13), where $V_{S_{2}} (\tilde{θ})$ is estimated with the delete-1 jackknife method as:

{\hat{V}}_{s_{2}} ({\tilde{θ}}_{n}) = \frac{n}{n - 1} \sum_{i = 1}^{n} ({\tilde{S}}_{2 (i)} - {\tilde{S}}_{2}) ({\tilde{S}}_{2 (i)} - {\tilde{S}}_{2})' .

(14)

${\tilde{S}}_{2 (i)}$ is the score function computed by removing the $i$ -th observation and evaluated at ${\tilde{θ}}_{n (i)}$ , (i.e., the ML estimate obtained by maximizing the score function without the $i$ th observation), and ${\tilde{S}}_{2}$ is the score function of the original sample evaluated at ${\tilde{θ}}_{n}$ . Shao (1992) proved the consistency of the jackknife method for a parameter estimator $θ$ for i.i.d. responses, while J. Rao et al. (1998) gave a sketch of the proof of the consistency of the jackknife score variance estimator for basic survey weights.

Simulation Study

We study the performance of the LM(H), LM(CP), LM(S), and GS(J) test statistics under no misspecification and misspecification either due to local dependence or in the latent variable distribution. Since the main focus of this work is the case of model misspecification, the results under correct model specification are reported in the Supplemental Material. Under a correct model specification, data are generated from the two-parameter logistic (2-PL) model (Birnbaum, 1968) with a linear structural model. When the model is correctly specified, we find results in line with the literature. In particular, the LM(CP) test shows inflated Type I error rates whereas the LM(H) and LM(S) tests have simulated Type I error rates quite close to the nominal level $α$ and similar power. Moreover, the power of the tests increases with the sample size and the number of items. Similar results are found by Liu and Maydeu-Olivares (2013), Liu and Thissen (2014), and Falk and Monroe (2018).

In the Violation of Local Independence and the Misspecification of the Latent Variable Distribution sections, uniform and nonuniform DIF are studied in the simulation as well as single and multiple parameter hypotheses. The performance of the GS(J) test is evaluated in a separate simulation study in The study on the GS(J) Test section.

We consider the following simulation conditions: number of items $(p = 10, 20)$ × sample size ( $n = 200, 500, 1000$ )× test statistic ( $LM (H), LM (CP), LM (S)$ ). To evaluate the asymptotic behaviour of the tests, in some of the cases, $n = 5000$ is considered. In some cases, the asymptotic power is computed in addition to the empirical power. Direct maximization through the Newton–Raphson method is used to obtain the ML-estimates under the null hypothesis and numerical derivatives are used to compute the Hessian and cross-product matrices.

The optimization is conducted in R with the function “optim”, and numerical derivatives are obtained with the “NumDeriv” R package. In all the simulation scenarios, $N = 500$ replications are considered and the nominal level $α$ is fixed to 0.05. Only for the results under correct model specification, and reported in the Supplemental Material, do we consider $N = 200$ .

Under model misspecification, in hypothesis testing we should account for the true data generating value $θ_{0}$ and for the parameter value $θ_{*}$ as follows:

when $H_{0} : θ_{*} = c$ , provided that $θ_{0} = c$ and $θ_{*} = c$ , the Type I error rate is obtained. The null hypothesis is true under model misspecification and the parameter is correctly fixed to its data generating value.

when $H_{0} : θ_{*} = c$ , provided that $θ_{0} = c$ and $θ_{*} \neq c$ , the false positive rate is obtained. The null hypothesis is not true under model misspecification, but the parameter is correctly fixed to its data generating value. Some authors, such as Green et al. (1998), consider the rejections of parameter fixed to its data generating value as Type I error instead of false positive rate, even under model misspecification. For this reason, we expect the tests to have false positive rates close to the nominal level $α$ if they have good performance.

when $H_{0} : θ_{*} = c$ , provided that $θ_{0} \neq c$ and $θ_{*} \neq c$ , the power is obtained. The null hypothesis is not true under model misspecification and the parameter is not fixed to its data generating value.

the case $H_{0} : θ_{*} \neq c$ , provided that $θ_{0} \neq c$ and $θ_{*} = c$ , is not examined in this study.

To estimate the unknown parameters $θ_{*}$ , we fit the unconstrained model under hypothesis $H_{1}$ to a sample of 5,000 observations generated from the true model. Under model misspecification we always study the false positive rates instead of the Type I error rates $(θ_{0} \neq θ_{*})$ . Nonvalid statistics, for example, negative statistics, are excluded from the analysis. The Type I error, false positive, and power rates are computed as $\hat{p} = \sum_{l = 1}^{N_{v}} \frac{I (T_{l} \geq c)}{N_{v}}$ , where $N_{v}$ is the number of valid statistics out of the number of replications, $I$ is an indicator function, $T_{l}$ is the value of the test statistic evaluated in the $l$ -th replication and $c$ is the theoretical asymptotic critical value corresponding to the 95th percentile of the $χ_{df}^{2}$ distribution, with degrees of freedom equal to the number of constrained parameter(s) under $H_{0}$ . The confidence interval (CI) of each rate $\hat{p}$ is computed as $\hat{p} \pm 1.96 \sqrt{\frac{0.05 (1 - 0.05)}{N_{v}}}$ .

Violation of Local Independence

Conditional dependence among certain items is introduced in the data generating model via a common individual specific random variables $u$ in the logistic measurement model. Data are generated from the following model:

\begin{matrix} logit (π_{ij}) = α_{0 j} + α_{1 j} z_{i}, i = 1, . . ., n j = 1, . . ., d, 1 \leq d \leq p \\ logit (π_{iJ}) = α_{0 J} + α_{1 J} z_{i} + u_{i}, J = d + 1, . . ., p u ~ N (0, σ_{u}^{2}) \\ z_{i} = β x_{i} + ϵ_{i} ϵ ~ N (0, 1) \end{matrix}

(15)

Both for $p = 10$ and for $p = 20$ , the intercept parameters are generated from a multivariate log-normal distribution with mean 0 and standard deviation (SD) 0.1, the slope parameters are generated from a multivariate log-normal distribution with mean 0 and SD 0.5, the values of the covariate $x$ are generated from a Bernoulli distribution with success probability equal to 0.7, and the residuals $ϵ$ are generated from a standard normal distribution. The parameter $β$ is fixed to 0.9. The random effects $u$ induce the local dependence among the items $y_{d + 1}, . . ., y_{p}$ . The percentages of local dependent items considered in the simulations are 20% and 50%. For example, when $LD = 20$ % and $p = 10$ , two items are local dependent. Also, $σ_{u}^{2}$ influences the amount of misspecification in the simulation study. The random effects are generated from a normal distribution with mean 0 and three different values of $σ_{u}^{2}$ , 0.25, 1, and 2.25. In the data generating model there is absence of uniform and nonuniform DIF.

To test for nonuniform DIF under model misspecification, we consider the following unconstrained model:

\begin{matrix} logit (π_{ij}) = α_{0 j} + α_{1 j} z_{i}, i = 1, . . ., n j = 1, 2, . . ., k 1 \leq k \leq p \\ logit (π_{ij}) = α_{0 j} + α_{1 j} z_{i} + γ_{1 j} x_{i} + γ_{2 j} x_{i} z_{i}, j = k + 1, . . ., p \\ z_{i} = β x_{i} + ϵ_{i}, ϵ ~ N (0, 1), \end{matrix}

(16)

where items ( $k + 1, . . ., p$ ) are tested for measurement invariance. In the case of uniform DIF, equation (16) does not include the parameter $γ_{2 j}$ on the items $k + 1, . . ., p$ .

In our simulations, the model fitted to the data is given in (16) with parameters $γ_{1 j}$ and $γ_{2 j}$ fixed to constant values. The false positive rates are studied using Hypotheses A, B, and C and the empirical power using Hypotheses D, E, and F. The asymptotic power is studied for Scenario D.

A $H_{0} : γ_{1 j *} = 0 vs H_{1} : γ_{1 j *} \neq 0,$

This implies that one item is tested for uniform DIF.

B $H_{0} : γ_{1 *}^{'} = 0 vs H_{1} : γ_{1 *}^{'} \neq 0$ ,

where $γ_{1 *}^{'},$ is a $5 \times 1$ vector (i.e. five items are tested for uniform DIF).

C $H_{0} : (γ_{1 j *}, γ_{2 j *}) = 0 vs H_{1} : (γ_{1 j *}, γ_{2 j *}) \neq 0$ ,

One item is tested for nonuniform DIF.

D $H_{0} : γ_{1 j *} = 0.7 vs H_{1} : γ_{1 j *} \neq 0.7,$

One item is tested for uniform DIF.

E $H_{0} : γ_{1 *}^{'} = c vs H_{1} : γ_{1 *}^{'} \neq c$ , where $c = (0.7, 0.7, 0.7, 0.7, 0.7),$

Five items are tested for uniform DIF.

F $H_{0} : (γ_{1 j *}, γ_{2 j *}) = c vs H_{1} : (γ_{1 j *}, γ_{2 j *}) \neq c$ , where $c = (0.7, 1),$

One item is tested for nonuniform DIF.

Table 1 presents the false positive rates for the LM(H), LM(CP), and LM(S) tests under local dependence for Scenarios A, B, and C.

Table 1.

False Positive Rates of The LM(H), LM(CP), and LM(S) Tests Under Scenarios A, B, and C, $p = 10$ , $n = 200, 500, 1000, 5000$ .

				$σ_{u}^{2} = 0.25$			$σ_{u}^{2} = 1$			$σ_{u}^{2} = 2.25$
SC	$p$	LD	$n$	LM(H)	LM(CP)	LM(S)	LM(H)	LM(CP)	LM(S)	LM(H)	LM(CP)	LM(S)
A	10	20%	200	0.05	0.066	0.052	0.044	0.066	0.034	0.044	0.082	0.052
			500	0.072	0.08	0.074	0.072	0.084	0.078	0.086	0.104	0.088
			1000	0.064	0.076	0.07	0.05	0.054	0.052	0.09	0.112	0.104
			5000	0.046	0.05	0.048	0.092	0.098	0.094	0.23	0.246	0.246
		50%	200	0.042	0.078	0.044	0.044	0.08	0.052	0.092	0.168	0.112
			500	0.072	0.082	0.074	0.116	0.148	0.134	0.256	0.298	0.282
			1000	0.076	0.08	0.072	0.152	0.184	0.17	0.412	0.458	0.446
	20	20%	200	0.04	0.094	0.05	0.056	0.09	0.056	0.06	0.118	0.068
			500	0.044	0.06	0.048	0.058	0.078	0.07	0.092	0.108	0.096
			1000	0.046	0.054	0.052	0.076	0.088	0.078	0.152	0.174	0.162
		50%	200	0.052	0.11	0.06	0.074	0.13	0.088	0.15	0.242	0.178
			500	0.052	0.076	0.058	0.132	0.168	0.148	0.334	0.388	0.358
			1000	0.054	0.07	0.064	0.188	0.224	0.212	0.58	0.622	0.604
B	10	20%	200	0.1	0.122	0.052	0.092	0.106	0.036	0.074	0.112	0.044
			500	0.062	0.07	0.042	0.066	0.082	0.054	0.076	0.088	0.058
			1000	0.064	0.064	0.048	0.046	0.066	0.05	0.094	0.094	0.086
		50%	200	0.062	0.124	0.036	0.11	0.190	0.078	0.394	0.386	0.148
			500	0.05	0.092	0.044	0.236	0.298	0.226	0.796	0.71	0.61
			1000	0.068	0.096	0.08	0.492	0.456	0.426	0.978	0.954	0.942
	20	20%	200	0.03	0.162	0.032	0.06	0.194	0.05	0.082	0.208	0.068
			500	0.048	0.074	0.048	0.06	0.09	0.056	0.144	0.114	0.08
			1000	0.04	0.054	0.046	0.082	0.084	0.066	0.246	0.16	0.132
		50%	200	0.036	0.178	0.04	0.11	0.26	0.098	0.288	0.442	0.214
			500	0.058	0.096	0.066	0.206	0.244	0.18	0.648	0.608	0.518
			1000	0.064	0.096	0.072	0.418	0.384	0.34	0.946	0.916	0.886
C	10	20%	200	0.06	0.104	0.04	0.058	0.094	0.046	0.066	0.112	0.046
			500	0.068	0.092	0.068	0.056	0.08	0.054	0.06	0.118	0.08
			1000	0.064	0.068	0.056	0.042	0.06	0.052	0.086	0.128	0.112
		50%	200	0.062	0.102	0.036	0.056	0.122	0.05	0.094	0.214	0.086
			500	0.062	0.086	0.062	0.084	0.14	0.098	0.2	0.278	0.22
			1000	0.058	0.08	0.068	0.11	0.154	0.142	0.34	0.398	0.364
	20	20%	200	0.056	0.156	0.052	0.056	0.138	0.06	0.062	0.172	0.066
			500	0.072	0.092	0.07	0.05	0.098	0.074	0.06	0.11	0.07
			1000	0.048	0.068	0.052	0.06	0.09	0.072	0.122	0.17	0.146
		50%	200	0.064	0.16	0.058	0.052	0.17	0.068	0.124	0.286	0.146
			500	0.064	0.086	0.062	0.112	0.172	0.112	0.256	0.36	0.284
			1000	0.064	0.078	0.07	0.132	0.172	0.156	0.494	0.538	0.52

Note. Values in boldface indicate that the nominal level $α$ is not included in their confidence interval. SC = scenario; LD = local dependence; LM(H) = Lagrange multiplier test using observed Hessian matrix; LM(CP) = Lagrange multiplier test using observed cross-product matrix; LM(S) = Lagrange multiplier test using sandwich variance and covariance matrix.

In the majority of cases, we can see that when the variance of the random effect is low ( $σ_{u}^{2} = 0.25$ ), the false positive rates of the LM(H) and LM(S) tests are quite close to the nominal level $α = 5 %$ , while the LM(CP) test rejects more often than expected. With the increase of model misspecification ( $σ_{u}^{2} = 1$ and $LD = 50 %$ , $σ_{u}^{2} = 2.25$ and $LD = 20 %, 50 %$ ) the false positive rates increase with the sample size and there are no significant differences in tests behaviour between 10 and 20 items. It is evident that the false positive rates are dramatically affected by the variance of the random effect and the number of items that are conditionally dependent. Moreover, the LM(CP) test has the most inflated false positive rates under all conditions of the study, while no improvement has been found when using the LM(S) test. Both LM(S) and LM(H) show a very similar behaviour under all scenarios.

Table 2 presents the empirical and asymptotic power for the LM(H), LM(CP), and LM(S) tests under local dependence for Scenario D.

Table 2.

Empirical Power (EP) and Asymptotic Power (AP) of the LM(H), LM(CP), and LM(S) Tests Under Scenario D, $p = 10, 20$ , $n = 200, 500, 1000, 5000$ .

					$σ_{u}^{2} = 0.25$			$σ_{u}^{2} = 1$			$σ_{u}^{2} = 2.25$
SC	$p$	LD	$n$		LM(H)	LM(CP)	LM(S)	LM(H)	LM(CP)	LM(S)	LM(H)	LM(CP)	LM(S)
D	10	20%	200	EP	0.308	0.398	0.32	0.38	0.452	0.388	0.484	0.55	0.494
				AP	0.459	0.506	0.485	0.473	0.514	0.493	0.543	0.584	0.562
			500	EP	0.702	0.724	0.71	0.776	0.806	0.798	0.864	0.878	0.872
				AP	0.836	0.877	0.859	0.849	0.884	0.867	0.905	0.930	0.917
			1000	EP	0.936	0.942	0.938	0.97	0.974	0.974	0.994	0.994	0.994
				AP	0.985	0.993	0.990	0.988	0.994	0.991	0.996	0.998	0.997
			5000	EP	1	1	1	1	1	1	1	1	1
				AP	1	1	1	1	1	1	1	1	1
		50%	200	EP	0.324	0.44	0.356	0.49	0.57	0.516	0.637	0.706	0.624
				AP	0.497	0.552	0.527	0.586	0.649	0.621	0.723	0.777	0.739
			500	EP	0.752	0.774	0.758	0.888	0.898	0.89	0.956	0.96	0.956
				AP	0.870	0.911	0.893	0.931	0.959	0.948	0.981	0.990	0.984
			1000	EP	0.952	0.956	0.952	0.992	0.994	0.992	1	1	1
				AP	0.992	0.997	0.995	0.998	0.999	0.999	1	1	1
	20	20%	200	EP	0.382	0.528	0.392	0.484	0.606	0.484	0.574	0.66	0.582
				AP	0.473	0.506	0.492	0.523	0.557	0.542	0.570	0.603	0.588
			500	EP	0.824	0.858	0.83	0.886	0.910	0.889	0.94	0.946	0.936
				AP	0.849	0.877	0.866	0.891	0.914	0.904	0.922	0.939	0.932
			1000	EP	0.982	0.986	0.982	0.994	0.994	0.994	1	1	1
				AP	0.988	0.993	0.991	0.995	0.997	0.996	0.997	0.998	0.998
		50%	200	EP	0.416	0.558	0.42	0.59	0.68	0.592	0.74	0.832	0.742
				AP	0.497	0.531	0.517	0.624	0.668	0.649	0.752	0.794	0.772
			500	EP	0.844	0.866	0.846	0.962	0.97	0.964	0.992	0.994	0.992
				AP	0.870	0.896	0.886	0.949	0.966	0.959	0.986	0.992	0.989
			1000	EP	0.992	0.994	0.994	1	1	1	1	1	1
				AP	0.992	0.995	0.994	0.999	1	0.999	1	1	1

Note. SC = scenario; LD = local dependence; LM(H) = Lagrange multiplier test using observed Hessian matrix; LM(CP) = Lagrange multiplier test using observed cross-product matrix; LM(S) = Lagrange multiplier test using sandwich variance and covariance matrix.

Overall, there are some numerical differences between the asymptotic and empirical power that decrease with the increase in the number of items and the sample size. It is worth noting that the behaviour of the empirical and asymptotic power is the same. Indeed, according to both methods, LM(CP) has the highest power and LM(H) and LM(S) have a very similar power under all conditions. The empirical and asymptotic power increases with both the sample size and the number of items. Since there are no substantial differences between the two procedures, only the empirical power is computed for Scenarios E and F. Table 3 presents the empirical power for the LM(H), LM(CP), and LM(S) tests under local dependence for Scenarios E and F.

Table 3.

Empirical Power of the LM(H), LM(CP), and LM(S) Tests Under Scenarios E and F, $p = 10, 20$ , $n = 200, 500, 1000, 5000$ .

				$σ_{u}^{2} = 0.25$			$σ_{u}^{2} = 1$			$σ_{u}^{2} = 2.25$
SC	$p$	LD	$n$	LM(H)	LM(CP)	LM(S)	LM(H)	LM(CP)	LM(S)	LM(H)	LM(CP)	LM(S)
E	10	20%	200	0.449	0.58	0.37	0.502	0.606	0.412	0.538	0.624	0.432
			500	0.9	0.926	0.902	0.934	0.948	0.928	0.966	0.974	0.958
			1000	0.998	1	1	0.996	0.998	0.998	0.998	1	0.998
		50%	200	0.518	0.606	0.364	0.730	0.716	0.372	0.858	0.779	0.3
			500	0.948	0.954	0.926	0.994	0.984	0.968	0.998	0.998	0.978
			1000	1	1	0.998	1	1	1	1	1	1
	20	20%	200	0.742	0.876	0.722	0.802	0.856	0.692	0.834	0.866	0.722
			500	0.994	0.996	0.994	1	0.998	0.994	1	1	0.994
			1000	1	1	1	1	1	1	1	1	1
		50%	200	0.814	0.906	0.966	0.9	0.934	0.818	0.966	0.962	0.894
			500	1	1	0.998	1	1	1	1	1	1
			1000	1	1	1	1	1	1	1	1	1
F	10	20%	200	0.660	0.632	0.416	0.674	0.662	0.486	0.743	0.758	0.598
			500	0.957	0.946	0.898	0.978	0.976	0.944	0.992	0.99	0.98
			1000	0.998	0.998	0.998	1	1	1	1	1	1
		50%	200	0.637	0.61	0.388	0.641	0.636	0.398	0.662	0.617	0.381
			500	0.945	0.932	0.902	0.951	0.94	0.91	0.940	0.926	0.894
			1000	0.998	0.998	0.996	1	0.998	0.998	1	1	1
	20	20%	200	0.807	0.848	0.666	0.860	0.888	0.756	0.896	0.91	0.802
			500	0.992	0.996	0.982	0.996	0.996	0.996	1	1	0.998
			1000	1	1	1	1	1	1	1	1	1
		50%	200	0.803	0.844	0.664	0.852	0.872	0.696	0.823	0.862	0.682
			500	0.992	0.996	0.984	0.996	0.996	0.992	0.991	0.996	0.99
			1000	1	1	1	1	1	1	1	1	1

Under the multiple parameters scenarios (E and F) and small sample sizes ( $n = 200$ ), the LM(S) test has the lowest power. Moreover, under all scenarios and for small sample size, LM(H) and LM(CP) have similar power whereas, in the majority of cases for large sample sizes, all tests reach the same power. Thus, the power seems less affected by the degree of local dependence compared to the the false positive rate and it increases with both the sample size and the number of items. Moreover, in terms of power, LM(CP) has the best performance because it has the highest power under most simulation conditions and it produces valid results for all replications. It is worth noting that, under scenarios $E$ and $F$ , in some cases the LM(H) test produces nonvalid results, ranging from 0.2% to 22.4% of the replications, where the highest percentages correspond to small sample sizes, $σ_{u}^{2} = 2.25$ and $LD = 50 %$ .

Misspecification of the Latent Variable Distribution

The data are generated from the following model:

\begin{matrix} logit (π_{ij}) = α_{0 j} + α_{1 j} z_{i} \\ z_{i} = β x_{i} + ε_{i}, i = 1, . . ., n j = 1, 2, . . ., p \end{matrix}

(17)

Three different distributions are assumed for the latent variable. Namely, the error term is generated from a mixture of normals as $ϵ ~ f (ϵ) = 0.3 N (- 1.5, 0.2) + 0.7 N (1, 0.4)$ and also from a skew-normal distribution with parameter $κ = 1, 3$ . The probability density function of a skew-normal with skewness parameter $κ$ is the following (Azzalini, 1985):

ϕ (ϵ; κ) = 2 ϕ (ϵ) Φ (ϵ; κ)

where $ϕ$ and $Φ$ are the standard normal density and distribution function, respectively. The parameter $κ$ can take values from $- \infty$ to $+ \infty$ and for $κ = 0$ reduces to a standard normal distribution.

Intercepts ( $α_{0 j}$ ), factor coefficients ( $α_{1 j}$ ), regression coefficient ( $β$ ), and group variable $x$ are generated as in the Violation of Local Independence section. Similarly here, we consider the model in Equation (16) as the unconstrained model. The simulation scenarios of the Violation of Local Independence section are considered here to study the false positive rates and the empirical power of the tests. As before, the asymptotic power is studied for Scenario D.

Table 4 reports the false positive rates for the LM(H), LM(CP), and LM(S) tests under misspecification of the latent variable distribution for Scenarios A, B, and C.

Table 4.

False Positive Rates of the LM(H), LM(CP), and LM(S) Tests Under Scenarios A, B, and C, $p = 10, 20$ , $n = 200, 500, 1000$ .

			$ϵ ~ 0.3 N (- 1.5, 0.2) + 0.7 N (1, 0.4)$			$ϵ ~ SN (1)$			$ϵ ~ SN (3)$
SC	$p$	$n$	LM(H)	LM(CP)	LM(S)	LM(H)	LM(CP)	LM(S)	LM(H)	LM(CP)	LM(S)
A	10	200	0.048	0.066	0.042	0.046	0.076	0.024	0.089	0.132	0.008
		500	0.046	0.052	0.04	0.05	0.066	0.042	0.076	0.07	0.022
		1000	0.048	0.052	0.05	0.06	0.062	0.056	0.06	0.058	0.042
	20	200	0.054	0.082	0.056	0.054	0.116	0.044	0.06	0.112	0.026
		500	0.05	0.058	0.05	0.054	0.066	0.058	0.056	0.07	0.044
		1000	0.042	0.04	0.038	0.052	0.07	0.066	0.054	0.06	0.054
B	10	200	0.06	0.10	0.046	0.134	0.156	0.016	0.198	0.242	0.002
		500	0.058	0.066	0.048	0.112	0.09	0.032	0.195	0.082	0.004
		1000	0.066	0.066	0.058	0.086	0.06	0.042	0.196	0.066	0.002
	20	200	0.058	0.140	0.042	0.066	0.222	0.04	0.119	0.293	0.002
		500	0.044	0.064	0.034	0.056	0.102	0.044	0.066	0.114	0.016
		1000	0.064	0.076	0.054	0.042	0.064	0.05	0.072	0.09	0.042
C	10	200	0.07	0.118	0.048	0.065	0.164	0.026	0.133	0.216	0.012
		500	0.066	0.072	0.036	0.05	0.078	0.042	0.075	0.092	0.032
		1000	0.062	0.068	0.056	0.066	0.068	0.052	0.076	0.084	0.026
	20	200	0.076	0.154	0.046	0.062	0.218	0.042	0.087	0.235	0.02
		500	0.05	0.094	0.044	0.044	0.084	0.046	0.046	0.09	0.03
		1000	0.068	0.084	0.056	0.044	0.064	0.042	0.07	0.098	0.048

Note. Values in boldface indicate that the nominal level $α$ is not included in their confidence interval. SC = scenario; LM(H) = Lagrange multiplier test using observed Hessian matrix; LM(CP) = Lagrange multiplier test using observed cross-product matrix; LM(S) = Lagrange multiplier test using sandwich variance and covariance matrix.

The misspecification of the latent variable distribution in the case of a mixture of normals does not affect the false positive rates of the LM(H) and LM(S) tests, whereas the LM(CP) test has inflated false positive rates, especially under Scenarios B and C. When $ϵ ~ SN (1)$ , only the LM(S) test never shows inflated false positive rates, even if it rejects less than it should for small sample sizes and 10 items. The performance of the tests deteriorates with the increase of skewness from $κ = 1$ to $κ = 3$ . For some of our simulation scenarios, the LM(H) and the LM(CP) tests have inflated false positive rates and the LM(S) test rejects less than expected. When $ϵ$ is distributed as a skew-normal under all scenarios, the LM(H) test produces a considerable number of nonvalid results, ranging from 0.2% to 43.4% of the replications. The number of nonvalid LM(H) statistics increases with the skewness of the latent variable distribution and for small sample sizes.

Table 5 presents the empirical and asymptotic power for LM(H), LM(CP), and LM(S) tests under misspecification of the latent variable distribution for Scenario D.

Table 5.

Empirical Power (EP) and Asymptotic Power (AP) of the LM(H), LM(CP), and LM(S) Tests Under Scenario D, $p = 10, 20$ , $n = 200, 500, 1000$ .

				$ϵ ~ 0.3 N (- 1.5, 0.2) + 0.7 N (1, 0.4)$			$ϵ ~ SN (1)$			$ϵ ~ SN (3)$
SC	$p$	$n$		LM(H)	LM(CP)	LM(S)	LM(H)	LM(CP)	LM(S)	LM(H)	LM(CP)	LM(S)
D	10	200	EP	0.316	0.396	0.324	0.195	0.28	0.15	0.129	0.186	0.03
			AP	0.425	0.459	0.443	0.307	0.326	0.301	0.226	0.208	0.170
		500	EP	0.684	0.71	0.7	0.424	0.462	0.406	0.235	0.244	0.094
			AP	0.772	0.799	0.835	0.632	0.664	0.623	0.480	0.440	0.354
		1000	EP	0.95	0.958	0.952	0.748	0.762	0.75	0.406	0.402	0.328
			AP	0.977	0.986	0.982	0.902	0.921	0.895	0.771	0.725	0.611
	20	200	EP	0.38	0.488	0.382	0.292	0.414	0.282	0.197	0.299	0.092
			AP	0.385	0.400	0.392	0.397	0.421	0.391	0.232	0.237	0.218
		500	EP	0.76	0.804	0.768	0.596	0.64	0.586	0.406	0.464	0.354
			AP	0.751	0.770	0.759	0.766	0.794	0.759	0.492	0.502	0.461
		1000	EP	0.98	0.98	0.978	0.902	0.906	0.898	0.662	0.692	0.644
			AP	0.961	0.968	0.965	0.967	0.976	0.964	0.783	0.794	0.749

Note. SC = scenario; LM(H) = Lagrange multiplier test using observed Hessian matrix; LM(CP) = Lagrange multiplier test using observed cross-product matrix; LM(S) = Lagrange multiplier test using sandwich variance and covariance matrix.

Overall, the numerical differences between the asymptotic and empirical power are small. As in the case of local dependence, the empirical and asymptotic power give the same information. For Scenario D and large sample sizes, the power of all tests is not affected by the latent variable having a mixture of normal distributions. When $ϵ ~ SN (1)$ , LM(CP) has the highest power while LM(H) and LM(S) have a very similar power. When $ϵ ~ SN (3)$ , the power is lower for all tests, especially for LM(S) and small sample sizes, and LM(H) produces a considerable number of nonvalid results for small sample size (11.6% of the replications). Since there are no substantial differences between the two procedures, only the empirical power is computed for Scenarios E and F.

Table 6 presents the power for LM(H), LM(CP), and LM(S) tests under misspecification of the latent variable distribution for Scenarios E and F.

Table 6.

Empirical Power of the LM(H), LM(CP), and LM(S) Tests Under Scenarios E and F, $p = 10, 20$ , $n = 200, 500, 1000$ .

			$ϵ ~ 0.3 N (- 1.5, 0.2) + 0.7 N (1, 0.4)$			$ϵ ~ SN (1)$			$ϵ ~ SN (3)$
SC	$p$	$n$	LM(H)	LM(CP)	LM(S)	LM(H)	LM(CP)	LM(S)	LM(H)	LM(CP)	LM(S)
E	10	200	0.516	0.614	0.402	0.218	0.446	0.124	0.100	0.313	0.02
		500	0.926	0.93	0.91	0.627	0.756	0.632	0.347	0.408	0.09
		1000	0.998	0.998	0.998	0.946	0.972	0.962	0.642	0.7	0.312
	20	200	0.674	0.853	0.646	0.524	0.782	0.456	0.385	0.642	0.076
		500	0.992	0.996	0.99	0.946	0.968	0.946	0.739	0.81	0.488
		1000	1	1	1	1	1	1	0.974	0.98	0.954
F	10	200	0.588	0.547	0.318	0.356	0.484	0.188	0.223	0.462	0.158
		500	0.916	0.89	0.838	0.834	0.844	0.722	0.585	0.772	0.532
		1000	0.99	0.988	0.988	0.974	0.982	0.972	0.867	0.966	0.882
	20	200	0.449	0.48	0.174	0.713	0.787	0.52	0.608	0.783	0.434
		500	0.826	0.784	0.7	0.988	0.986	0.97	0.921	0.984	0.952
		1000	0.978	0.97	0.952	1	1	1	0.958	1	1

Similarly to the false positive rates study, the power of all tests studied here is not affected by the latent variable having a mixture of normal distributions and it is lower for small sample sizes. Interestingly, when $ϵ ~ SN (1)$ , the LM(CP) test has the highest power whereas, when $ϵ ~ SN (3)$ , the power is lower for all tests, particularly for LM(S) in the case of small sample sizes. However, the power, even for $κ = 3$ , increases with the increase of sample size and number of items. When $ϵ$ is distributed as a skew-normal, the LM(H) test produces nonvalid results in some of the simulation scenarios, ranging from 0.2% to 30.2% of the replications and, as in the previous setting, the number of nonvalid LM(H) statistics increases with the skewness of the latent variable distribution and decreases as the sample size increases.

The Study on the GS(J) Test

The GS(J) test is computationally expensive compared with the other tests. Indeed, in each replication of a sample of size $n$ , the jackknife score covariance matrix given in (14) requires $n$ times the ML-estimates of the parameters. To reduce the time complexity for this method, a faster model estimation is obtained by using the “Itm” R package, which uses a combination of the E-M algorithm and direct maximization. As before, numerical derivatives for the Hessian and cross-product matrix are obtained with the “NumDeriv” R package. We conduct a small-scale simulation to compare the performance of the LM(H), LM(CP), and LM(S) tests with the GS(J) test under no misspecification, misspecification due to local dependence, and misspecification of the latent variable distribution. All models considered here will only have a measurement model and no structural model. We consider the following simulation conditions: number of items $(p = 10)$ × sample size ( $n = 200, 500, 1000$ ) × test statistic ( $LM (H), LM (CP), LM (S), GS (J)$ ) and 500 replications for each scenario. To study the Type I error/false positive rates, we consider three data generating models (DGMs): (1) under a correct model specification, data are generated from the 2-PL model (Birnbaum, 1968), (2) under local dependence from the model given in Equation (15), and (3) under misspecification of the latent variable distribution from the model given in Equation (17). To study the power, we set the parameter $γ_{1 j}$ equal to 0.5 and 2, on the last item of the three DGMs (2-PL, Equations 15 and 17). For all of them, the covariate $x$ does not affect the latent variable ( $β$ =0) and intercepts, factor loadings, and the values of the group variable $x$ are generated as in the Violation of Local Independence section. When data are generated from (15), we consider $σ_{u}^{2} = 1$ and $LD = 20 %$ . For data generated from (17), we assume $ϵ ~ SN (3)$ . We consider the model in Equation (16), without the structural model, as the unconstrained model. Under Scenario A, $γ_{1 j}$ is fixed to 0 under the null hypothesis. Scenario A is used to study the Type I error/false positive rate, because all items in the data generating models are measurement invariant, and to study the power, because a uniform DIF parameter is introduced on the last item of all DGMs. Table 7 reports the Type I error/false positive rates of the GS(J), LM(H), LM(CP), and LM(S) tests under correct model specification, local dependence, and misspecification of the latent variable distribution, for Scenario A.

Table 7.

Type I Error/False Positive racte of the GS(J), LM(H), LM(CP), and LM(S) Tests Under Scenario A, $p = 10$ , $n = 200, 500, 1000$ .

Data generating model	SC	$p$	$n$	GS(J)	LM(H)	LM(CP)	LM(S)
2-PL	A	10	200	0.042	0.048	0.064	0.046
			500	0.06	0.06	0.072	0.06
			1000	0.062	0.062	0.062	0.062
(15)	A	10	200	0.034	0.042	0.054	0.034
			500	0.056	0.058	0.064	0.056
			1000	0.056	0.058	0.064	0.058
(17)	A	10	200	0.036	0.044	0.072	0.036
			500	0.044	0.048	0.058	0.044
			1000	0.048	0.052	0.056	0.048

Note. Values in boldface indicate that the nominal level $α$ is not included in their confidence interval. SC = scenario; GS(J) = generalized jackknife score test; LM(H) = Lagrange multiplier test using observed Hessian matrix; LM(CP) = Lagrange multiplier test using observed cross-product matrix; LM(S) = Lagrange multiplier test using sandwich variance and covariance matrix; 2-PL = two-parameter logistic.

The GS(J) test and the LM(S) test perform similarly under all conditions. In general, all tests have good performance and only the LM(CP) test shows inflated false positive rates under some conditions.

Table 8 presents the empirical power for the GS(J), LM(H), LM(CP), and LM(S) tests under correct model specification, local dependence, and incorrect distribution of the latent variable, for Scenario A.

Table 8.

Empirical Power of the GS(J), LM(H), LM(CP), and LM(S) Tests Under Scenario A, $p = 10$ , $n = 200, 500, 1000$ .

Data generating model	SC	$p$	$γ_{1 j}$	$n$	GS(J)	LM(H)	LM(CP)	LM(S)
2-PL	A	10	0.5	200	0.23	0.292	0.296	0.238
				500	0.488	0.53	0.52	0.494
				1000	0.754	0.778	0.772	0.758
			2	200	0.962	0.98	0.982	0.962
				500	1	1	1	1
				1000	1	1	1	1
(15)	A	10	0.5	200	0.176	0.236	0.234	0.186
				500	0.394	0.434	0.422	0.396
				1000	0.67	0.686	0.676	0.67
			2	200	0.956	0.978	0.978	0.962
				500	1	1	1	1
				1000	1	1	1	1
(17)	A	10	0.5	200	0.11	0.200	0.196	0.13
				500	0.344	0.414	0.392	0.344
				1000	0.62	0.678	0.634	0.622
			2	200	0.634	0.893	0.903	0.732
				500	0.996	1	0.998	0.996
				1000	1	1	1	1

Note. SC = scenario; GS(J) = generalized jackknife score test; LM(H) = Lagrange multiplier test using observed Hessian matrix; LM(CP) = Lagrange multiplier test using observed cross-product matrix; LM(S) = Lagrange multiplier test using sandwich variance and covariance matrix; 2-PL = two-parameter logistic.

Under all conditions for small sample size, the power of the GS(J) test is always equal to or lower than the one of the LM(S) test. When the sample size increases, the two tests reach the same power. Similarly to the Type I error/false positive rate study, the performance of the GS(J) test is never superior to that of the other tests. For this reason, and for its high computational cost, we do not use the GS(J) test in the real data analysis.

An Application to a Real Data Set

In this section, we assess measurement invariance under model misspecification through the LM(H), LM(CP), and LM(S) tests on a real data set, taken from Miller et al. (1984). We select the same sample of observations and items analyzed by Duncan (1979). In 1953, in the Detroit Area, the following questions regarding sex role expectations were asked to a sample of 257 women:

Here are some things that might be done by a boy or a girl. As I read each of these to you, I would like you to tell me if it should be done as a regular task by a boy, by a girl, or by both: (1) Shoveling walks, (2) Washing the car, (3) Dusting furniture, (4) Making beds.

Responses of “boy” to Items 1 and 2 and “girl” to Items 3 and 4 are coded as “0” and refer to traditional answers. Responses of “both” for all items are coded as “1” and refer to “egalitarian” answers. For the same sample of women, in addition to the four binary items, we consider a group variable, that we call “Work,” taken from the original data set (Miller et al., 1984). The following question was asked to the sample of mothers “What is your occupation? What kind of business is that in?” The possible responses were the following: “Professional, technical, and kindred workers”, “Managers, officials and proprietors, except farm”, “Clerical and kindred workers”, “Sales workers”, “Operatives and kindred workers”, “Private household workers, service workers”, “Laborers, except farm and mine”, and “Not in labor force”. We group these responses into two classes:

Class coded as “0”, which includes only answers “Not in labor force”. This class includes the group of nonworking women ( $n_{0} = 199$ ).

Class coded as “1”, which includes all the other responses. This class includes the group of working women ( $n_{1} = 58$ ).

The percentages of “egalitarian” answers among the group of nonworking women are 31%, 31%, 29% and 42% to Items 1 to 4, respectively. The percentages of “egalitarian” answers among the group of working women are 43%, 29%, 50% and 55% to Items 1 to 4, respectively. Women in the working group give more “egalitarian” answers than women in the nonworking group, especially to Items 3 and 4. The data set is analyzed by Mavridis and Moustaki (2009) and Irincheeva (2011). They show that the classical unidimensional IRT model with the latent variable distributed as a standard normal has a poor fit on this data set. Irincheeva (2011) estimates a semi-nonparametric (SNP) unidimensional IRT model to the data, that allows for more flexibility in the shape of the latent variable distribution, and gives a better fit of the proposed model to the data compared with the classic unidimensional IRT model. Moreover, the results found by Irincheeva (2011) suggest that the shape of the true latent variable is right skewed or even more complex.

Starting from these results, in this study we consider a unidimensional IRT model for binary data based on the assumption of standard normal latent variable distribution under the null hypothesis, that we know to be misspecified. Measurement invariance on the intercept of each item is tested through $H_{0} : γ_{1 j *} = 0 vs H_{1} : γ_{1 j *} \neq 0$ , where $γ_{1 j *}$ is the effect of the group variable “Work” on the item intercept. Measurement invariance on the item slope of each item is tested through $H_{0} : γ_{2 j *} = 0 vs H_{1} : γ_{2 j *} \neq 0$ , where $γ_{2 j *}$ is the effect of the group variable “Work” on the item slope. Rejecting the null hypothesis implies that the item intercept, or slope, is measurement noninvariant. Due to the small sample size and low number of items, we avoid considering multiple parameter hypothesis testing. The p values of the tests are computed in two ways, using the asymptotic distribution of the tests under the null hypothesis and bootstrap hypothesis testing (Efron and Tibshirani, 1994). As observed in the Simulation Study section, under high misspecification of the latent variable distribution, the LM tests do not match their theoretical distributions under the null hypothesis. In particular, the LM(H) and LM(S) tests have the worst performance in terms of power under small sample sizes. The bootstrap hypothesis testing does not depend on the asymptotic distribution of the test statistic under the null hypothesis and can be a good alternative under model misspecification (Lu and Young, 2012).

The first step of the bootstrap hypothesis testing procedure is to generate $B$ bootstrap samples, or simulated data sets, indexed by $h$ , that should satisfy the null hypothesis (Efron and Tibshirani, 1994). We consider a parametric bootstrap, where the bootstrap samples are generated from a classical unidimensional IRT model with the latent variable distributed as a standard normal and parameter estimates obtained fitting the same model to the original sample of observations. Under the null hypothesis, the group variable “Work” has no effect on the intercept and slope of each item. For this reason, the values of the group variable in each bootstrap sample are randomly drawn from a Bernoulli variable with success probability estimated on the original sample of observations. The parametric bootstrap can be used even when the model under the null hypothesis is misspecified (Lu and Young, 2012). The bootstrap hypothesis testing is composed using the following steps (Efron and Tibshirani, 1994):

Calculate the statistic $\hat{τ}$ (the LM(H), LM(CP) and LM(S) tests) in the original sample of observations.

Calculate the statistic $τ$ in each bootstrap sample, called $τ_{h}^{*}$ .

Compute the bootstrap p value as ${\hat{p}}^{*} (\hat{τ}) = \frac{1}{B} \sum_{h = 1}^{B} I (τ_{h}^{*} > \hat{τ})$ , where $I$ is the indicator function.

Reject the null hypothesis if ${\hat{p}}^{*} (\hat{τ}) < α$ .

When $τ$ is pivotal, that is its distribution does not depend on unknown parameters, and the number of bootstrap samples $B$ is such that $α (B + 1)$ is an integer, the bootstrap hypothesis testing procedure can yield exact test (Dwass, 1957). We choose $B = 999$ , which is usually a good choice for the number of bootstrap samples to be used in hypothesis testing (MacKinnon, 2002).

Table 9 presents the p values for the LM(H), LM(CP), and LM(S) tests based on their theoretical distributions (TD) under the null hypothesis and on bootstrap hypothesis testing (BH) for measurement invariance on the item intercept and slope.

Table 9.

Theoretical Distributions (TD) and Bootstrap Hypothesis Testing (BH) p Values of the LM(H), LM(CP), and LM(S) Tests for Measurement Invariance on the Item Intercept and Slope.

Parameter tested	Item	Method	LM(H)	LM(CP)	LM(S)
$γ_{1 j *}$	1	TD	0.387	0.390	0.391
		BH	0.397	0.404	0.398
	2	TD	0.107	0.082	0.097
		BH	0.114	0.102	0.105
	3	TD	-	0.014	0.059
		BH	-	0.023	0.020
	4	TD	0.78	0.795	0.801
		BH	0.800	0.811	0.811
$γ_{2 j *}$	1	TD	0.399	0.351	0.353
		BH	0.393	0.346	0.337
	2	TD	0.116	0.112	0.131
		BH	0.124	0.118	0.114
	3	TD	0.048	0.038	0.098
		BH	0.101	0.049	0.031
	4	TD	0.050	0.118	0.223
		BH	0.083	0.163	0.172

Note 1. Values in boldface indicate p values less than the nominal level $α$ . LM(H) = Lagrange multiplier test using observed Hessian matrix; LM(CP) = Lagrange multiplier test using observed cross-product matrix; LM(S) = Lagrange multiplier test using sandwich variance and covariance matrix.

For all tests, TD and BH do not reject the null hypothesis of intercept and slope invariance for Items 1, 2, and 4. This is consistent with the simulation results, in which the false positive rates are less affected than the power of the tests by the misspecification of the latent variable distribution. However, BH and TD disagree for Item 3. Interestingly, only the LM(CP) test produces similar results to the BH p values of the LM(S) test, rejecting the null hypothesis of measurement invariance on the intercept and slope. This is consistent with the simulation results, where the LM(CP) test has the highest power for small sample sizes under misspecification of the latent variable distribution. The bootstrap hypothesis testing procedure for the LM(S) and LM(CP) tests turns out to be a good instrument to make a clearer decision on the acceptance or rejection of the null hypothesis, especially when these tests show contradictory results. By contrast, the LM(H) test gives negative statistics in the real data set and in a large number of bootstrap replications, as in some simulation scenarios under high misspecification of the latent variable distribution and small sample size. This makes it difficult to interpret results and worsens the performance of the bootstrap hypothesis testing procedure. Indeed, for measurement invariance on the intercept of Item 3, the TD and BH p values of the LM(H) test cannot be computed because the statistic calculated in the real data set is negative. Moreover, in the measurement invariance testing of the slope of Item 3, the result of the BH p value of LM(H) test is not stable because in 11.5% of the bootstrap replications we obtain nonvalid statistics that have been excluded from the BH p value computation.

Discussion

In this work, we evaluated the performance of the LM(H), LM(CP), LM(S), and GS(J) tests to assess measurement invariance under both correct model specification and different types of model misspecification by means of a wide simulation study and in a real data analysis. Moreover, we computed the empirical and asymptotic power of the LM(H), LM(CP), and LM(S) tests, using for the latter the asymptotic distributions of the statistics under the alternative hypothesis.

Under model misspecification, there are some differences between the three tests due to the type and the strength of the model misspecification. Under low local dependence, and when the latent variable is generated from a mixture of normals or from a moderate skew-normal, all tests have good performance in terms of false positive rates and power for large sample sizes. Only the LM(CP) test shows inflated false positive rates in some cases. For this reason, under mild model misspecification, we discourage the use of the LM(CP) test due to its inflated false positive rates. When the misspecification is high, the tests performance deteriorates. Indeed under high local dependence the false positive rates for all tests are seriously inflated while, when the latent variable is highly skewed, with 10 items and for small sample sizes, the LM(H) and LM(S) tests have very low power. Under high model misspecification, the LM(CP) test has the highest power for small sample sizes. It is worth noting that the LM(S) test, although derived under model misspecification, does not have better performance than the LM(H) test, particularly in terms of power but it always produces valid statistics. Under all types of misspecification considered, we do not find significant differences in the tests’ behavior between the case of measurement invariance on the intercept and that on the intercept and slope, both in single and multiple parameter hypothesis testing.

The simulation study highlights that there are small numerical differences between the asymptotic power, computed through the approximation method for the noncentrality parameter, and the empirical power. However, the results given by the two procedures are coherent and the asymptotic power can be a valid alternative to obtain the power of a test, since it allows us to reduced the time complexity compared with the empirical power.

Concerning the GS(J) test, it is never superior to the other tests and, due to its high computational cost, we do not recommend the use of this test to assess measurement invariance under model misspecification.

Consistently with the simulation results, in the real data analysis the LM(CP) test has the highest power to detect item measurement noninvariance under high misspecification of the latent variable distribution. The bootstrap hypothesis testing procedure turns out to be a good instrument under model misspecification. Indeed, it helps to make a clearer decision on the acceptance or rejection of the null hypothesis when the asymptotic tests provide contradictory results.

For further studies on the performance of the LM tests under model misspecification, different types of estimation methods could be considered. Moreover, we found that when data are generated assuming a skew-normal distribution for the latent variable, parameter estimates are seriously biased with respect to the true parameters’ values. Further research should be devoted to exploring misspecified models where the parameter estimates are consistent with regard to the true parameter values. In these cases, the LM tests should have a better performance.

Footnotes

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Supplemental Material

Supplemental material for this article is available online.

ORCID iD

Silvia Cagnone

References

Agresti

(2002). Categorical data analysis. Wiley.

Azzalini

(1985). A class of distributions which includes the normal ones. Scandinavian Journal of Statistics, 12(2), 171-178.

Bartholomew

D. J.

Knott

Moustaki

(2011). Latent variable models and factor analysis: A unified approach (3rd ed.). Wiley.

Bera

A. K.

Bilias

Yoon

M. J.

Taşpınar

Doğan

(2020). Adjustments of Rao’s score test for distributional and local parametric misspecifications. Journal of Econometric Methods, 9(1), 20170022. https://doi.org/10.1515/jem-2017-0022

Birnbaum

A. L.

(1968). Some latent trait models and their use in inferring an examinee’s ability. In Lord

F. M.

Novick

M. R.

(Eds.), Statistical theories of mental test scores (pp. 397-479). Addison-Wesley.

Bock

R. D.

Aitkin

(1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika, 46(4), 443-459. https://doi.org/10.1007/BF02293801

Bollen

K. A.

(1989). Structural equations with latent variables. Wiley.

Boos

D. D.

(1992). On generalized score tests. The American Statistician, 46(4), 327-333. https://doi.org/10.1080/00031305.1992.10475921

Boos

D. D.

Stefanski

L. A.

(2013). Hypothesis tests under misspecification and relaxed assumptions. In Essential statistical inference: Theory and methods (pp. 339-359). Springer.

10.

Cox

D. R.

Hinkley

D. V.

(1979). Theoretical statistics. CRC Press.

11.

Duncan

O. D.

(1979). Indicators of sex typing: Traditional and egalitarian, situational and ideological responses. American Journal of Sociology, 85(2), 251-260. https://doi.org/10.1086/227009

12.

Dwass

(1957). Modified randomization tests for nonparametric hypotheses. Annals of Mathematical Statistics, 28(1), 181-187.

13.

Efron

Tibshirani

R. J.

(1994). An introduction to the bootstrap. CRC press.

14.

Engle

(1984). Wald, likelihood ratio, and Lagrange multiplier tests in econometrics. In Griliches

Intriligator

M. D.

(Eds.), Handbook of econometrics (1st ed., Vol. 2, pp. 775-826). Elsevier.

15.

Falk

C. F.

Monroe

(2018). On Lagrange multiplier tests in multidimensional item response theory: Information matrices and model misspecification. Educational and Psychological Measurement, 78(4), 653-678. https://doi.org/10.1177/0013164417714506

16.

Fox

Glas

C. A. W.

(2005). Bayesian modification indices for IRT models. Statistica Neerlandica, 59(1), 95-106. https://doi.org/10.1111/j.1467-9574.2005.00282.x

17.

Glas

C. A. W.

(1998). Detection of differential item functioning using Lagrange multiplier tests. Statistica Sinica, 8, 647-667.

18.

Glas

C. A. W.

(1999). Modification indices for the 2-PL and the nominal response model. Psychometrika, 64(3), 273-294. https://doi.org/10.1007/BF02294296

19.

Glas

C. A. W.

Falcón

J. C. S.

(2003). A comparison of item-fit statistics for the three-parameter logistic model. Applied Psychological Measurement, 27(2), 87-106. https://doi.org/10.1177/0146621602250530

20.

Godambe

Thompson

M. E.

(1986). Parameters of superpopulation and survey population: Their relationships and estimation. International Statistical Review/Revue Internationale de Statistique, 54(2), 127-138.

21.

Green

S. B.

Thompson

M. S.

Babyak

M. A.

(1998). A Monte Carlo investigation of methods for controlling Type I errors with specification searches in structural equation modeling. Multivariate Behavioral Research, 33(3), 365-383. https://doi.org/10.1207/s15327906mbr3303_3

22.

Guastadisegni

Cagnone

Moustaki

Vasdekis

(in press). The asymptotic power of the lagrange multiplier tests for misspecified IRT models. In Wiberg

Molenaar

González

Böckenholt

Kim

J.-S.

(Eds.), Quantitative psychology: The 85th annual meeting of the Psychometric Society virtual, 2020. Springer.

23.

Gudicha

D. W.

Schmittmann

V. D.

Vermunt

J. K.

(2017). Statistical power of likelihood ratio and Wald tests in latent class models with covariates. Behavior Research Methods, 49(5), 1824-1837. https://doi.org/10.3758/s13428-016-0825-y

24.

Irincheeva

(2011). Generalized linear latent variable models with flexible distributions [Unpublished doctoral dissertation]. University of Geneva.

25.

Jöreskog

K. G.

(1971). Simultaneous factor analysis in several populations. Psychometrika, 36(4), 409-426. https://doi.org/10.1007/BF02291366

26.

Jöreskog

K. G.

Goldberger

A. S.

(1975). Estimation of a model with multiple indicators and multiple causes of a single latent variable. Journal of the American Statistical Association, 70(351a), 631-639. https://doi.org/10.1080/01621459.1975.10482485

27.

Kim

De Ayala

Ferdous

A. A.

Nering

M. L.

(2011). The comparative performance of conditional independence indices. Applied Psychological Measurement, 35(6), 447-471. https://doi.org/10.1177/0146621611407909

28.

Liu

Maydeu-Olivares

(2013). Local dependence diagnostics in IRT modeling of binary data. Educational and Psychological Measurement, 73(2), 254-274. https://doi.org/10.1177/0013164412453841

29.

Liu

Thissen

(2012). Identifying local dependence with a score test statistic based on the bifactor logistic model. Applied Psychological Measurement, 36(8), 670-688. https://doi.org/10.1177/0146621612458174

30.

Liu

Thissen

(2014). Comparing score tests and other local dependence diagnostics for the graded response model. British Journal of Mathematical and Statistical Psychology, 67(3), 496-513. https://doi.org/10.1111/bmsp.12030

31.

H. Y. K.

Young

G. A.

(2012). Parametric bootstrap under model mis-specification. Computational Statistics & Data Analysis, 56(8), 2410-2420. https://doi.org/10.1016/j.csda.2012.01.018

32.

MacKinnon

J. G.

(2002). Bootstrap inference in econometrics. Canadian Journal of Economics/Revue canadienne d’économique, 35(4), 615-645. https://doi.org/10.1111/0008-4085.00147

33.

Mavridis

Moustaki

(2009). The forward search algorithm for detecting aberrant response patterns in factor analysis for binary data. Journal of Computational and Graphical Statistics, 18(4), 1016-1034. https://doi.org/10.1198/jcgs.2009.08060

34.

Mellenbergh

G. J.

(1982). Contingency table models for assessing item bias. Journal of Educational Statistics, 7(2), 105-118. https://doi.org/10.3102/10769986007002105

35.

Mellenbergh

G. J.

(1983). Conditional item bias methods. In Irvine

S. H.

Berry

J. E.

(Eds.), Human assessment and cultural factors (pp. 293-302). Springer.

36.

Miller

Swanson

G. E.

Newcomb

T. M.

(1984). Detroit area study, 1953: Child training patterns among urban families and attitudes and perceptions of consensus of group members. Inter-university Consortium for Political and Social Research.

37.

Oberski

D. L.

van Kollenburg

G. H.

Vermunt

J. K.

(2013). A Monte Carlo evaluation of three methods to detect local dependence in binary data latent class models. Advances in Data Analysis and Classification, 7(3), 267-279. https://doi.org/10.1007/s11634-013-0146-2

38.

Ranger

Kuhn

J.-T.

(2012). Assessing fit of item response models using the information matrix test. Journal of Educational Measurement, 49(3), 247-268. https://doi.org/10.1111/j.1745-3984.2012.00174.x

39.

Rao

C. R.

(1948). Large sample tests of statistical hypotheses concerning several parameters with applications to problems of estimation. Mathematical Proceedings of the Cambridge Philosophical Society, 44(1), 50-57. https://doi.org/10.1017/S0305004100023987

40.

Rao

Scott

A. J.

Skinner

C. J.

(1998). Quasi-score tests with survey data. Statistica Sinica, 8, 1059-1070.

41.

Reiser

(2008). Goodness-of-fit testing using components based on marginal frequencies of multinomial data. British Journal of Mathematical and Statistical Psychology, 61(Pt 2), 331-360. https://doi.org/10.1348/000711007X204215

42.

Satorra

(1989). Alternative test criteria in covariance structure analysis: A unified approach. Psychometrika, 54(1), 131-151. https://doi.org/10.1007/BF02294453

43.

Shao

(1992). Jackknifing in generalized linear models. Annals of the Institute of Statistical Mathematics, 44(4), 673-686. https://doi.org/10.1007/BF00053397

44.

Skrondal

Rabe-Hesketh

(2004). Generalized latent variable modeling: Multilevel, longitudinal, and structural equation models. CRC Press.

45.

van der Linden

W. J.

Glas

C. A. W.

(2010). Statistical tests of conditional independence between responses and/or response times on test items. Psychometrika, 75(1), 120-139. https://doi.org/10.1007/s11336-009-9129-9

46.

White

(1982). Maximum likelihood estimation of misspecified models. Econometrica, 50(1), 1-25. https://doi.org/10.2307/1912526

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.02 MB