Abstract
The linear logistic test model (LLTM) has been widely applied to investigate the effects of item covariates on item difficulty. The LLTM was extended with random item residuals to account for item differences not explained by the item covariates. This extended LLTM is called the LLTM-R. In this article, statistical inference methods are investigated for these two models. Type I error rates and power are compared via Monte Carlo studies. Based on the simulation results, the use of the likelihood ratio test (LRT) is recommended over the paired-sample t test based on sum scores, the Wald z test, and information criteria, and the LRT is recommended over the profile likelihood confidence interval because of the simplicity of the LRT. In addition, it is concluded that the LLTM-R is the better general model approach. Inferences based on the LLTM while the LLTM-R is the true model appear to be largely biased in the liberal way, while inferences based on the LLTM-R while the LLTM is the true model are only biased in a very minor and conservative way. Furthermore, in the absence of residual variance, Type I error rate and power were acceptable except for power when the number of items is small (10 items) and also the number of persons is small (200 persons). In the presence of residual variance, however, the number of items needs to be large (80 items) to avoid an inflated Type I error and to reach a power level of .90 for a moderate effect.
Keywords
Introduction
The linear logistic test model (LLTM; Fischer, 1973) has been widely applied when researchers are interested in the effects of item covariates on item difficulty (e.g., Embretson & Wetzel, 1987; Freund, Hofer, & Holling, 2008; Green & Smith, 1987; Hornke & Habon, 1986; Sheenan & Mislevy, 1990; Spada & McGaw, 1985; Whitely & Schneider, 1981). The LLTM was extended to the LLTM with random item residuals (LLTM-R) to account for unexplained item differences, either within a two-stage empirical Bayes regression model framework (Mislevy, 1988) or within a generalized linear mixed effect models (GLMM) framework (De Boeck, 2008; Janssen, Schepers, & Peres, 2004). To determine whether the effects of item covariates in the LLTM and the LLTM-R are statistically significant, null hypothesis significance testing for the effect has been used in most LLTM applications (e.g., Embretson & Wetzel, 1987; Gorin, 2005; Whitely & Schneider, 1981). Test statistics (e.g., t- or z-statistics) and corresponding p values are used to test whether the population value of the effect differs from a specified value (generally 0). In addition to testing the effect, the two models, the LLTM(-R) with and without residual variance, are compared to detect item covariate effects using model comparison approaches such as the likelihood ratio test (LRT) or information criteria (e.g., Hohensinn & Kubinger, 2011; Kubinger, 2009).
The LLTM and the LLTM-R can be formulated as GLMMs (De Boeck, 2008; De Boeck & Wilson, 2004). Null hypothesis significance testing for fixed effects in GLMMs has been addressed using t- and z-statistics. In applications of the LLTM and the LLTM-R, the challenges are as follows. First, because the number of items is smaller (often less than 50) than the number of persons in many applications, small sample inference may result in inaccurate results when the z-statistic is used as a Wald test. For evident reasons, the z-statistic can be used only when the number of items is sufficiently large. Second, the z-statistic is based on estimated standard errors. Because the estimated standard errors do not take into account the sampling variability introduced by estimating the unknown variance of the estimator, the estimated standard errors underestimate the true variability of fixed effects (Dempster, Rubin, & Tsutakawa, 1981). Although the t-statistic may be more appropriate for smaller number of items, it suffers from the same downward bias of the estimated standard error, and in addition it is not clear what the degrees of freedom are. For example, suppose that there are one item covariate and 20 items. When the LLTM-R is applied, the number of item parameters to be estimated is two, that is, one fixed effect of the item covariate, and one variance of random item residuals, and in addition there is a fixed intercept. Does this mean that there are 17 (
As an alternative to null hypothesis significance testing, the use of effect sizes (i.e., estimates of item covariates in the LLTM and the LLTM-R) and their corresponding confidence interval (CI) can be used (e.g., Hecht et al., 2015). CIs can be more useful than null hypothesis significance testing, especially when researchers address issues that involve the magnitude of the effects (Steiger, 2004). In such a case, it is important to obtain sufficiently narrow CIs, which can be evaluated with the accuracy in parameter estimation (Kelly & Maxwell, 2003). The Wald CI works properly when the distribution of the parameter estimator is symmetric, and the standard error is a good estimate of the standard deviation of the estimator (Wald & Wolfowitz, 1939). These are the same conditions as for the z test. Because the standard error is obtained based on asymptotic variances obtained from the information matrix in maximum likelihood estimation, the Wald CI may not perform well for small sample sizes for the same reason as a z test. Profile likelihood CIs are recommended for small sample sizes because such CIs do not assume the normality of the estimator (e.g., Cox & Hinkley, 1974). To the authors’ knowledge, the degree of improvement in using a profile likelihood CI when the LLTM and the LLTM-R are applied to a small number of items (e.g., 10) has not been shown.
Regarding model comparison approaches, Weirich, Hecht, and Böhme (2014) examined the Type I error rate and the power of the LRT when using the LLTM-R for detecting the (fixed) effects of item position effects in a large-scale assessment such as the National Assessment of Educational Progress (NAEP). They found that the LLTM-R provided adequate power (e.g., >.90) when there are a large number of persons (e.g., 4,000) and a large number of items (e.g., 80). However, the performance of other model comparison approaches such as the Akaike information criterion (AIC; Akaike, 1974) and Bayesian information criterion (BIC; Schwarz, 1978) has not been investigated for the LLTM and the LLTM-R.
Table 1A in the Online Appendix gives a descriptive overview of LLTM and LLTM-R studies. Item covariate effects have been used to examine the effects of cognitive variables on item difficulty (e.g., Fischer, 1973; Whitely & Schneider, 1981), the effects of item characteristics on item difficulty (e.g., De Boeck, 2008), to design tests based on cognitive rules (e.g., Hornke & Habon, 1986; Gorin, 2005), and to investigate the design effects in large-scale tests (e.g., Hartig, Frey, Nold, & Klieme, 2012; Hecht, Weirich, Siegle, & Frey, 2015). As one can see, various approaches have been used for inferences and no consensus has developed. An additional complication is that estimates and the standard errors of the LLTM can be largely underestimated if random residuals are ignored (e.g., Hartig et al., 2012).
Therefore, the purpose of this study is to address the inferential qualities (Type I error rate and power) of different approaches: null hypothesis significance testing based on the estimates of covariate effects, profile likelihood CI, and model comparisons approaches (i.e., the LRT and information criteria). Although one may also be interested in profile likelihood CIs for other reasons than significance testing, such as effect size considerations, the former was focused on. Specifically, this study investigated whether the inferences hold nominal Type I error rates (
In the current study, a prototypical case with one binary covariate and a balanced design was chosen. Theoretically a very extensive study is possible, with different types of item covariates (binary-, ordered-, or nominal-category covariates), with different designs (balanced, slightly unbalanced, or largely unbalanced), with different numbers of covariates, and with and without collinearity between covariates. Here, a simple design was considered for several reasons. First, in the more traditional approach, the analogue of the LLTM is a repeated-measures design with a within-subjects factor, and a balanced design is the more prototypical case. For binary data, one would then commonly use sum scores for the two levels of the repeated measures factor, because binary variables are difficult to handle in a traditional approach. In this case, using a paired-sample t test for the sum scores (the equivalent of a simple within-subjects repeated-measures analysis of variance [ANOVA]) would be the way to go to test the effect of the factor. For a similar reasoning, see Kubinger (2009). This allows us to make a link and a useful comparison between two very different approaches: an LLTM approach and a traditional paired-sample t test. Second, Johnson, Barry, Ferguson, and Müller (2015) have found that the type of covariate is not a factor that affects power, and other influences such as correlation between covariates seem very similar to those in multiple regression (Green & Smith, 1987). There are no special reasons why well-known effects in multiple linear regression (such as the effect of collinearity) would not play a role in the LLTM(-R). For the present study, the focus is not on the types of covariate configurations but on a comparison of inferential approaches, and to begin with investigating this issue, it is useful to concentrate on a simple to interpret case that would not imply various complications and a huge number of conditions in the simulation design. As will become clear from the following, there are already very many comparisons to make for the simple case. Third, because in practice, often more than one item covariate is used, a real data example with multiple covariates was provided, to make the contribution less abstract and to see the relevance for a broader situation than covered in the simulation study.
In the following, the LLTM and the LLTM-R and their parameter estimation methods are described. Next, null hypothesis significance testing, profile likelihood CI, and model comparison approaches are presented for inferences regarding an item covariate effect. Then follows the simulation study: design, hypotheses, and results. The authors end with a summary and a discussion.
LLTM and Parameter Estimation
The LLTM is a constrained Rasch model, and the difficulty parameters of the LLTM are estimated as linear combinations of a smaller number of item covariates such as cognitive processing demands of tasks and item positions in a test. The LLTM is described as follows:
where p is a person index (
The LLTM-R is expressed in the following equation:
where
Parameter Estimation
In this study, maximum likelihood estimation implemented in glmer of the lme4 R package (Bates, Maechler, & Bolker, 2011) was chosen to estimate the parameters of the LLTM and the LLTM-R. The LLTM-R has crossed-random effects (i.e.,
Because the model comparison methods chosen in this article (LRT and information criteria) use maximized log-likelihood values, it may be instructive to present marginalized likelihood functions for the LLTM and the LLTM-R. The marginal likelihood for the LLTM is as follows:
where
where
Detecting Item Covariate Effects
In this section, the null hypothesis significance testing, profile likelihood CI, and model comparison approaches are described for detecting
Null Hypothesis Significance Testing
z test for
For the effect of an item covariate,
Paired-sample t test based on sum scores
A paired-sample t test can be used to test whether differences between the sum scores for subcategories of items defined on the basis of the item covariate are significantly different from 0. For example, an item covariate has values (0 and 1) and they correspond to two conditions, with the first five items in Condition 1 (covariate value of 0) and the remaining items in Condition 2 (covariate value of 1). With this example, the difference score can be calculated for each person p, that is,
where
Profile Likelihood CI
CI approaches are in the first place of interest in an effect size context but they can also be used indirectly to test the null hypothesis of no effect of an item covariate,
The
Model Comparisons
To test
LRT
LRT is for comparisons of the two nested models. Assume that the null hypothesis is given by
The LRT statistic is defined as
where L is the likelihood function,
Information criteria
The AIC is an estimate of the expected relative Kullback–Leibler (K-L) divergence (Akaike, 1974). Thus, the AIC implicitly estimates the divergence between the true model and the candidate model. Even though the actual K-L divergence is unknown because of the unknown true model, it was shown that the candidate model with the lowest AIC has the lowest expected K-L divergence (e.g., Burnham & Anderson, 2002). The AIC is efficient but it is not strongly consistent unless the true model is among the candidate models. The AIC penalizes for the number of parameters as follows:
where Num is the number of parameters.
Schwarz (1978) derived the BIC to serve as an asymptotic approximation to a transformation of the Bayesian posterior probability of a candidate model. The BIC is consistent, which means that it selects the true model as N tends to infinity (when the true model is among the candidate models or the number of parameters in the true model is finite) (see Claeskens & Hjort, 2008, Chapter 4, for details). The lowest BIC value is taken to indicate the best-fitting model. The BIC penalizes for the number of parameters (Num) more the larger the sample size is (N):
In this study, the number of item responses (
Burnham and Anderson (2002, pp. 298-301) compared the performance of the AIC and the BIC in selecting covariates in the linear regression model (
The detection of item covariate effects using the methods described above was illustrated using an empirical dataset in the Online Appendix. The empirical dataset is one with multiple item covariates, in contrast with the simulation study that will be described next.
Simulation Study
A simulation study was designed to investigate the Type I error rate and the power for detecting an item covariate effect in various designs that may influence power and precision for the effects. The data generating models were the LLTM and the LLTM-R. The LLTM is a special case of the LLTM-R when the residual variance is 0. The same generated datasets were fit to the LLTM and the LLTM-R to investigate (a) misfitting, ignoring random residuals and thus fitting the LLTM when there is indeed residual variance, and (b) overfitting and thus fitting the LLTM-R when there is no residual variance.
Simulation Design
The following factors were used in the design of the simulation study: (a) number of items, (b) number of persons, (c) size of the item covariate effect
The number of items
The number of items was selected as
The number of persons
The number of persons was set to P = 200, 2,000, and 4,000. The number of 200 was chosen to approximate sample sizes in LLTM applications (e.g., Freund et al., 2008; Gorin, 2005; Kubinger, 2009; Medina Díaz, 1993; Whitely & Schneider, 1981; see Table A1 in the Online Appendix). Based on the findings in Weirich et al. (2014), the number of 4,000 was considered to examine the large sample properties of the investigated methods. A number of persons, 2,000, was chosen as a moderate level of sample size. This condition was also considered in Weirich et al. (2014) for the comparison with a level of 4,000.
Magnitude of covariate effect
Three levels of magnitudes (
Residual variances
The residual variances were set to
Combining the magnitude of the item covariate effect with the residual variance yields different values of the variance between items as well as different proportions of these variances explained by the item covariate. For a zero effect, the variances are 0, 0.2, and 0.4, depending on the three values of the residual variance (0, 0.2, 0.4), and of course, the proportion explained variance is either undefined (0/0) or zero (0/0.2 and 0/0.4). For an effect of 0.2, 0.04 needed to be added to the residual variance so that the variances are 0.04 (
The person trait was generated with a standard normal distribution in each replication,
The four simulation conditions were fully crossed, yielding 81 (
Evaluation Measures
The proportion of 1,000 replications leading to an inference that
Hypotheses
The comparisons to be planned are more than exploratory. Expectations are based on the characteristic of the approaches. Therefore, the simulation study can be considered as being partly a hypothesis testing study with potentially important results for inferential practices. If the hypotheses turn out to be supported by the results, current practices may have to change. The study may suggest that some inferential approaches should be preferred on other approaches or that one model is a better basis to test covariate effects than another model.
Type I error rate
To formulate expectations regarding the Type I error rate, two categories of methods were differentiated: statistical tests and information criteria. The first category consists of the paired-sample t test with sum scores, the z test of the estimated effect, and the LRT for the comparison of two models. Although the profile likelihood CI is primarily meant as an effect size approach, it can also be used for null hypothesis testing. The second category consists of the AIC and the BIC.
In principle, the Type I error rate should be maintained at the nominal level. However, for some parts of the design, an inflated Type I error rate was expected for the paired-sample t test and the z test. For the paired-sample t test, the basis of the expectation is that when the item variance is larger than zero, the means as a function of the item covariate will be randomly different with a standard deviation of
Also for the z test, an inflated Type I error rate is expected when the number of items is small, because the z test assumes a normal (and thus a narrower) distribution of the estimated effect, which is not realized for smaller numbers of items in the presence of residual variance. This is a problem for the LLTM-R and more so the smaller the number of items is. However, because the z test does not ignore the item source of variance while the paired-sample t test does, the inflation is by far not expected to be as large for the former as for the latter. Because the LRT and the profile likelihood CI rely on the likelihood of the whole model and the number of df is clear (difference in number of parameters), it is expected that these approaches respect the nominal
The information criterion logic does not allow us to formulate an
Power
Power is expected to be based on two factors: the amount of information available in the data and the effect size. Therefore, it is expected that the power increases as the number of persons and the number of items increase, and that the power increases with the magnitude of the effect and decreases with the residual variance. This is because the explained variance (and thus the standardized effect size calculated in terms of correlations) increases with the magnitude of the item covariate effect and decreases with the residual variance. Furthermore, because of the same problems as with the Type I error inflation, it is expected that the paired-sample t test has a much larger power and the z test has a slightly larger power compared with the LRT and the profile likelihood CI for the case the residual variance is larger than zero. Based on the same reasons as for the Type I error rate, the AIC is expected to have a higher power (strictly speaking a higher proportion of true positives) than the BIC, and the power of LRT is expected to be smaller than that of AIC and to be higher than that of BIC.
Using a different model than the data generation model
When the data-estimation model is different from the data generation model, the two estimation models were compared for the same data: the LLTM and the LLTM-R were both used as estimation models for the LLTM and the LLTM-R as data generation models. Using the LLTM for LLTM-R generated data means that a clearly misfitting (too constrained) model is used because the true residual variance is ignored. Using the LLTM-R for LLTM generated data means that an overfitting model (not sufficiently constrained) is used by allowing for a nonexisting true residual variance. For the LLTM-R used for LLTM data a somewhat deflated Type I error rate is expected because the estimate of the residual variance, even when very small, will contribute to the uncertainty of the item covariate estimated effect. For the LLTM used for LLTM-R data, an inflated Type I error rate and an overestimated power are expected because the uncertainty of the residual variance is not taken into account.
The hypotheses for the Type I error rate and power are highly related. This is of course because the power depends on the Type I error rate. Still it is possible to have an adequate Type I error rate, one that corresponds with the nominal
Results
The results are presented in Figure 1 for the Type I error rate, in Figure 2 for power, in Figure 3 for the Type I error rate and power of a misfitting model (using the LLTM for LLTM-R data), and in Figure 4 for the Type I error and power of an overfitting model (using the LLTM-R for LLTM data). For the comparison purposes, results in the case of

Type I error rate (for

Power (for

Type I error rate (for

Type I error rate (for
Type I error rate
Type I error rate as presented in the top half of Figure 1(top) was calculated for the case of
For the case of
The following patterns were observed for
Power
Power as presented in the top half of Figure 2 was calculated for the case of
The following trends showed in the case of
Using a different model than the data generation model
When the LLTM-R was used in the absence of residual variance (see Figure 3), the Type I error rate was slightly deflated and the power was also slightly lower than when the LLTM was used. The differences showed primarily for the small sample size (
The results are more drastic when the LLTM was used in the case of
Summary and Discussion
The LLTM and LLTM-R have been used to investigate the effects of item covariates (e.g., cognitive variables, item characteristics, item design factors). In this study, the paired-sample t test based on sum scores, the z test, profile likelihood CI, LRT, AIC, and BIC were evaluated via Monte Carlo studies. Based on the simulation results, the following conclusions can be drawn. First, the LRT seems a better test than the z test. It does respect the nominal
Second, the information criteria are not a good basis for inference regarding the effect of a covariate. The AIC leads to a large proportion of false positives, and the BIC results in a large proportion of false negatives. These findings are in line with the common knowledge that the AIC favors more complex models and that the BIC is conservative if the sample size is not very large (e.g., Burnham & Anderson, 2002). Although the number of persons was very large in one of the simulation conditions (
Third, the LLTM-R is clearly the better choice as an estimation model. In the presence of residual variance, the LLTM leads to largely invalid tests and an extremely high proportion of false positives. The better approach is to start with the LLTM-R to check whether there is residual variance. Only when there is no indication at all for residual variance should the LLTM be used. Note that even when the correlation between the item covariate and the item parameters is 0.745 (
Fourth, to reach the nominal Type I error rate and the desired power rate of .90, a very large number of items (e.g.,
Not all these recommendations are equally strong. For example, although the profile likelihood CI and the LRT seem to do better than the z test, there is still room for improvement because they still show some Type I error inflation for smaller numbers of items. Regarding power, a major concern is that either the LLTM is needed, which implies a (nearly) perfect explanation of the item difficulties, or a really large number of items. It is worth considering other possible purposes that can be fulfilled with the LLTM-R, other than null hypothesis testing of item covariate effects. For example, the LLTM-R can be used for item generation purposes (De Boeck, Cho, & Wilson, 2016) and the model may still be useful for the item part of a model when the model is used in the first place to derive ability scores and not so much to estimate item effects. Finally, the authors do not want to depreciate the AIC and BIC as general model selection methods. The information criteria were used in a similar way as null hypothesis testing methods, which they are not, as explained earlier.
The simulation conditions employed in the study are limited to one item covariate. When multiple item covariates are considered, one may hope that residual item variance is further reduced, so that the ideal situation is better approached. However, the residual variance as such seems important, and adding item covariates does not imply a high proportion of item variance can be explained. In case multiple item covariates are available, a new problem arises, which is the selection of the best model, with the best set of item covariates. The number of comparison models is
Another limitation is the type of item covariate. In this study, a balanced item covariate with an equal number of items per value of the item covariate was used. One may expect that the power is affected in unbalanced designs. In addition, the item covariate was binary, but that does not seem to be a factor that affects power in the GLMM framework (e.g., Johnson et al., 2015). Finally, Green and Smith (1987) found that collinearity in continuous item covariates (e.g., frequency with which an item attribute d is needed) affected the accuracy of item covariate effect estimates in the LLTM, which is similar result as for linear regression. In the presence of collinearity, they found via a simulation study that the item covariate estimates and standard errors were not accurate when the sample size was small (less than 200) (see Table 3, Green & Smith, 1987, p. 378).
In sum, the authors believe that the results in the current study are informative even when limited to a balanced binary covariate. The results concern specific but basic issues regarding the LLTM and the LLTM-R, such as ignoring residual item variation. However, one may expect that factors that play a role in regular linear regression would also play a role in the LLTM and LLTM-R, such as model selection, collinearity, and heteroscedasticity of error terms. To evaluate the more precise effects of these factors, further studies with more complex simulation studies would be required.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
