Abstract
A modified and improved inductive inferential approach to evaluate item discriminations in a conditional maximum likelihood and Rasch modeling framework is suggested. The new approach involves the derivation of four hypothesis tests. It implies a linear restriction of the assumed set of probability distributions in the classical approach that represents scenarios of different item discriminations in a straightforward and efficient manner. Its improvement is discussed, compared to classical procedures (tests and information criteria), and illustrated in Monte Carlo experiments as well as real data examples from educational research. The results show an improvement of power of the modified tests of up to 0.3.
1. Introduction
A crucial question in psychometrics is the problem of evaluating the discriminations of items. In applications of psychometric models ranging from factor analysis and structural equation modeling to item response theory, dealing with it is almost inevitable. Some item response models allow for different item discriminations, and others do not. The Rasch model (Fischer & Molenaar, 1995; Rasch, 1960) assumes equal item discriminations. Testing this assumption on the basis of data is central to applications of the model. Inductive inference on assumptions and parameters of the Rasch model is often based on a conditional maximum likelihood (CML) approach (Andersen, 1970; Molenaar, 1995; Pfanzagl, 1993). A discussion and justification of this approach in a wider research context has recently been given by Skrondal and Rabe-Hesketh (2022). In the CML framework, there is essentially only one well-established approach suitable to investigate differences in item discriminations. It is the comparison of item parameters between groups of persons yielding different scores, that is, numbers of correct responses, which are sufficient statistics for the person parameters of the model. Rasch (1960) suggested a purely descriptive and graphical check comparing the estimates of item parameters. Andersen (1973) derived the first hypothesis testing procedure serving this purpose, that is, a likelihood ratio test of the hypothesis of invariance of item parameters across different person score groups. Both are contained in the R (R Core Team, 2022) package eRm (Mair & Hatzinger, 2007). Other asymptotically equivalent test statistics are straightforward as pointed out by Glas and Verhelst (1995) and Draxler and Alexandrowicz (2015), that is, the Rao score or Lagrange multiplier (Rao, 1948; Silvey, 1959) and the Wald test (Wald, 1943). Recently, Draxler et al. (2022) showcased the application of another relatively new test statistic in this context, that is, the gradient test (Lemonte, 2016; Terrel, 2002). All four test statistics are easily accessible in the R package tcl (Draxler & Kurz, 2023). The comparison of item parameters between person score groups makes sense immediately. For example, if an item was easier relative to the other items for persons with a high score (persons with high ability or proficiency) and at the same time the respective item was more difficult for persons with a low score, the item would have higher discrimination than the other items.
This article suggests a new modified procedure to evaluate item discriminations in the CML framework. It discusses four statistical tests of the hypothesis of invariance of item parameters across person score groups. The modification implies the specification of a linearly restricted alternative hypothesis, that is, an appropriately chosen subset of probability distributions that represents the scenario of different item discriminations in a more straightforward and more efficient manner. It discusses the underlying theory and illustrates in an extensive Monte Carlo study its improvement over the original Andersen likelihood ratio test (ALRT) showing up to
The remainder of this text is organized as follows. In Section 2, the ideas and concepts are formalized, and the respective test statistics are derived. Section 3 describes the design of a Monte Carlo investigation into relevant properties of the tests (like type I and II error rates), and Section 4 shows its results. Section 5 presents real data examples from educational research. Section 6 gives a discussion and final remarks.
2. Theoretical Foundation
Assume that the true unknown probability distribution generating the data in a sample space is a member of a parametric family of distributions that is specified by a psychometric model indexed by parameters taking values in parameter space
The psychometric model is given by
where
where
and
Hence, further considerations can be restricted to the distributions of the sufficient statistics. Since the parameter
It does not depend on the nuisance parameter vector
2.1. Andersen’s Likelihood Ratio Test
Let
where
with
and
(as assumed by the Rasch model) against the alternative
The test statistic is based on the likelihood ratio or, equivalently, the log likelihood difference. The key to it is to maximize
Andersen (1973) proved that the limiting distribution of T
1 is the central
In practice of data analysis, usually only two groups of persons are considered instead of all
2.2. A Modified Procedure
The suggested modification implies the specification of a proper subset of
Instead of
The restricted model is then given by
and the joint distribution of the responses of all persons to all items by
For identifiability, let
and
A conditional distribution and conditional likelihood can again be derived by conditioning on
and
All three of them are also the members of multiparameter exponential families (as can immediately be seen). The function
Thus,
where the additive constant
The score function is given by
Note that the score function
The Fisher information matrix can be obtained as follows. Let
be the Jacobian matrix of the linear function
It is simply a linear function of the information matrix
The CML estimate of
which is obtained by solving
and
when the number of persons
2.3. Statistical Tests
Consider the two subclasses of distributions or (equivalently) subsets of the restricted parameter space
which is the same as
(as assumed by the Rasch model) against the alternative
which represents the scenario of a constant difference of the easiness or attractiveness of at least one item between successive person score groups and thus a different discrimination of at least one item relative to the others.
Four test statistics derived from asymptotic theory may be used. Denote by
A Rao score test statistic can be obtained by
where both the score function
where the notation
where
Note that the gradient test is a relatively recent development and thus is still little known in the context of psychological and educational research. To the best of the authors’ knowledge, a discussion of the gradient test in respect of psychometric problems can only be found in two papers. Draxler et al. (2022) discuss it in a CML and Zimmer et al. (2022) in a marginal maximum likelihood framework. It shows similar asymptotic and finite sample size properties and performance to the other three likelihood-based tests not only in a psychomeric but also in a much broader context (Lemonte, 2016; Terell, 2002). The test statistic can be derived from a combination of Rao score and Wald test statistics. One of the most striking features is that, unlike Rao score and Wald test statistics, it does not depend on an information matrix, which can entail computational and practical advantages in several scenarios and models. A peculiarity of it is that the test statistic is not necessarily nonnegative in finite sample sizes (since the test statistic is composed of the scalar product of two vector-valued random variables) but it is asymptotically.
All four test statistics of the modified approach are accessible through the R package tcl (Draxler & Kurz, 2023).
2.4. Rationale for the Restricted Model
The more parsimonious parametrization in the restricted model implies more information concerning the parameters of interest, that is, the
If the total sample is divided only into two person score groups (e.g., low vs. high), as it is often the case in practice of data analysis as noted in Section 2.1, and thus, one obtains only two vectors of
An obvious question concerns the similarity of the restricted model given by (1) and the well-known two parameter logistic (2PL) model (Birnbaum, 1968), which is typically applied in scenarios of different item discriminations. The 2PL model assumes not only a parameter for each person and a parameter for each item but a discrimination parameter for each item as well. A brief discussion on the similarity of the two models is as follows. Consider the linear predictor for a person with parameter
where
Consider the difference in the linear predictor between two arbitrary persons with person parameters
and for the linearly restricted model
Setting these two differences equal yields a condition for the two models to be equivalent, that is
Hence, if the difference between the scores of any two persons is proportional to the difference in their person parameters, the two models will be equivalent. In case of
2.5. Remarks on Generalizations
The extension of the modified procedure introduced so far to more general psychometric models, which consider the responses of persons to items in more than two ordered (or unordered, i.e., nominal) categories, is basically straightforward. An example of one of the most popular and frequently applied models is the partial credit model (Masters, 1982). It assumes that all response categories have the same discrimination for all items. This assumption can be tested along the same lines as in the binary Rasch model.
Let
Assume a division of the sample of persons according to the potential values an element of
be the linear decomposition that allows the person score group specific item/category parameters to differ between successive person score groups only by constants represented by the vector
Of interest is to test the hypothesis
3. Monte Carlo Design
To investigate finite sample size properties of the four tests of the modified procedure, Monte Carlo experiments are carried out. This includes the approximation of the limiting distribution and type I and type II error or power rates. Compared to the classical ALRT, an improvement of the power of the tests is expected. Various practically relevant scenarios are considered to get an idea of how much of an increase of power can be expected with sufficient accuracy.
3.1. Fundamentals of the Design
The design considers different numbers of persons and items, that is
Note that
The data have been generated using the 2PL model. The item parameters have been chosen from an interval of
Additionally, scenarios are considered, in which all discrimination parameters vary in four different ranges: from 0.9 to 1.1, from 0.8 to 1.2, from 0.7 to 1.3, and from 0.6 to 1.4. Thus, in these cases, each single item differs in its discrimination from each other item, which are very natural and realistic scenarios. The exact values of the discrimination parameters are shown in Online Appendix B.
The person parameters have been drawn from the standard normal or Gaussian distribution
The test statistics computed for every replication in each scenario are the classical ALRT T 1, and the four based on the suggested modified approach: T 2 (LRT), T 3 (Rao score), T 4 (Wald), and T 5 (gradient). Note that in respect of the computation of the ALRT T 1, the sample is divided only into two score groups, that is, below (or equal to) the median of the frequency distribution of the person scores and above.
Additionally, a likelihood ratio test based on a marginal likelihood procedure that (directly) compares Rasch and 2PL models is considered. The CML procedure is not applicable in this case. Instead, a marginal ML approach is suitable, which requires the predetermination of a parametric family of distributions for the person parameters. Hence, such a marginal likelihood is not a function of the individual person parameters but the parameters of the assumed distribution and the item and discrimination parameters (individual person parameters need not be estimated). In this work, the standard normal distribution (Gaussian) is assumed (for the person parameters). Thus, the marginal likelihood is only a function of the k item and k discrimination parameters. In respect of the Rasch model, which is a special case of the 2PL by setting all discrimination parameters equal, the marginal likelihood is then a function of the k item parameters and a common discrimination parameter for all items, that is, it is a function of
Furthermore, the so-called information criteria or indices that are derived from information theory are considered. Such an approach does not provide a hypothesis test with error probabilities. The indices are widely used in general statistical modeling problems, that is, in selecting an (empirically) appropriate model. They are also popular in psychometrics. Thus, its worthwhile to consider them too. Some specifics and technical details on these are provided in the next subsection.
An overview of the Monte Carlo study design is given in Online Appendix B, and the complete code used for all computations can be downloaded from an online repository https://github.com/canguerer/JEBS_item_discrimination_cml.
3.2. Specifics on Information Criteria
Two well-known and well-established indices are the so-called Akaike information criterion (AIC; Akaike, 1974) and Bayes’s information criterion (BIC; Schwarz, 1978). They are given by
where p denotes the number of free parameters of the model. They add to the conditional likelihood a penalizing term that is a function of the number of parameters and, in case of the BIC, additionally a function of the sample size. These indices are computed for both the model allowing for different item discriminations given by (1) and the Rasch model, which is a special case by setting
Hence, the difference in AIC values between the two models can be expressed as a function of the likelihood ratio test statistic T 2 and the number of items k. When T 2 is greater than the number of free or estimated parameters of model (1), the AIC decides in favor of (1) and rejects the Rasch model. In respect of the BIC, one obtains
Thus, the BIC selects model (1) when T
2 is greater than
When comparing the likelihood ratio test T
2 with the AIC in respect of selection rates of model (1), the following can immediately be derived. The test T
2 selects model (1), that is, it rejects the hypothesis of equal item discriminations, when the test statistic T
2 is greater than the
In respect of the BIC, it follows from analogous considerations that it is, in all scenarios considered (regarding n and k), a more conservative procedure than the test T 2 and the AIC, that is, lower selection rates of model (1). It simply penalizes much more than the AIC.
All these considerations apply in the same manner when analyzing selection rates of the 2PL compared to the Rasch model by AIC and BIC based on the marginal likelihood function. Hence, these selection rates are also computed and compared with type I error and power rates of the likelihood ratio test based on the marginal likelihood (that directly compares Rasch and 2PL models).
4. Results
The complete results are uploaded in an online repository and can be downloaded from https://github.com/canguerer/JEBS_item_discrimination_cml. The main results are threefold.
The first part refers to the observation that the common limiting distribution of the test statistics assuming all item discriminations as equal is approximated fairly well already for sample sizes from about
Observed Distributions of the Six Test Statistics Compared to the Limiting Chi Square Distribution With
Note. ALRT is used as an abbreviation for Andersen’s likelihood ratio test; Mod.LRT for the likelihood ratio test based on the modified approach; Score, Wald, and Gradient for the respective other tests of the modified approach; Marg.LRT for the likelihood ratio test based on the marginal maximum likelihood procedure; and Perc. for percentile.
Observed Distributions of the Six Test Statistics Compared to the Limiting χ2 Distribution With
Note. The upper block refers to results obtained from the assumption of a rectangular distribution for the true distribution of person parameters and the lower block to results referring to a mixed Gaussian distribution. ALRT is used as an abbreviation for Andersen’s likelihood ratio test; Mod.LRT for the likelihood ratio test based on the modified approach; Score, Wald, and Gradient for the respective other tests of the modified approach; Marg.LRT for the likelihood ratio test based on the marginal maximum likelihood procedure; and Perc. for percentile.
In scenarios with sample sizes of
The second part of the results refers to observed type I error and power rates of the tests. The comparison of the modified tests and the ALRT unveils an explicit picture. The modified tests outperform the ALRT in all scenarios considered. When the power of the tests is neither close to 0 nor 1, a gain of power of up to
Observed Type I Error and Power Rates in Percentages Referring to Scenarios of
Note. The discrimination parameters of two items (middle and extreme) are varied. All other discrimination parameters are set to 1. The assumed
The third part of the results refers to a comparison of the four modified tests (based on CML) with the LRT based on marginal ML (directly comparing Rasch and 2PL models). As has been expected, the marginal LRT performs slightly better in respect of observed power rates in case of small numbers of items. For larger item numbers, hardly any differences that are of practical relevance are observed any more. Furthermore, Table 2 shows the results of scenarios, in which the marginal likelihood-based LRT is severely biased and practically useless, that is, when the distribution of the person parameters is misspecified. In such scenarios, the CML-based tests are undoubtedly preferable.
5. Real Data Examples
Finally, two real data examples from educational research have been considered, which are contained in the R package sirt (Robitzsch, 2022). The names of the files are data.pisaMath.rda and data.pisaRead.rda. The former contains binary responses of 565 students to 11 mathematics items and the latter binary responses of 623 students to 12 reading items. The results shown in Table 4 confirm the main observations from the Monte Carlo investigation and what is to be expected from theory. The p values are all
Results of Real Data Examples
Note. AIC and BIC based on conditional likelihood computed for both model (1) and Rasch model. AIC and BIC based on marginal likelihood computed for both 2PL and Rasch model. ALRT is used as an abbreviation for Andersen’s likelihood ratio test; Mod.LRT for the likelihood ratio test based on the modified approach, Score; Wald, and Gradient for the respective other tests of the modified approach; and Marg.LRT for the likelihood ratio test based on the marginal maximum likelihood procedure. AIC = Akaike information criterion; 2PL = two parameter logistic; BIC = Bayes’s information criterion.
Regarding the mathematics example, the estimates of the
Estimates With Standard Errors of Real Data Examples
Note. In both data examples, the last
6. Final Remarks
The linearly restricted model and the modified approach of investigating item discriminations in the CML framework presented in this work can be viewed as a special case of a model already utilized on a number of occasions, for example, by Gürer and Draxler (2022) as well as Draxler et al. (2022), when the person score (number of correct responses of a person) is considered as a covariate (or predictor or explanatory variable) and is assumed to have a linear effect on the log odds of response probabilities. As already noted in Section 2.2, the modified procedure is compatible with conditioning not only on person but also on item scores, that is, column sums of the response matrix. This procedure eliminates both
The case of missing data may also briefly be discussed. Basically, one can distinguish between data not missing at random, for example, by design of the investigation, and data missing at random. The latter entails scenarios in which data are missing completely at random and missing at random (conditional on covariates). Generally, the modified approach can be applied in missing data scenarios in so far as the classical approach of comparing different person score groups by Rasch (1960) and Andersen (1973) is appropriate. In case of data missing completely at random or missing data depending on any covariates unrelated to the person parameters and thus missing at random conditional on the covariates, its application is straightforward. If the frequencies of missing responses depend on the person parameters, one may yield biased results and/or a reduction of the power of the tests. Consider, for instance, that persons with higher ability tend to have more missing responses and consequently obtain lower person scores than persons with lower ability but fewer missing responses. In such a case, the dependence between person parameters and person scores may not be monotone any more. A further discussion on the suitability of the modified approach of investigating item discriminations in missing data scenarios (also missing by design) shall be kept for another occasion.
The outperformance of the modified procedure compared to the classical approach (ALRT) by Andersen (1973) in terms of power rates of the statistical tests is undoubtedly impressive. Nevertheless, a brief word of caution in respect of the practice of psychometric analyses is in order. Unlike the ALRT, the modified procedure is only suggested in case of evaluating item discriminations, that is, testing the hypothesis of equal discriminations for all items against the alternative that at least one item possesses a different discrimination (than the others). The higher power rates are achieved by specifying a subset of probability distributions in respect of the alternative hypothesis. In case of the ALRT, the alternative hypothesis is more general. Hence, one can expect a reasonable power of the ALRT against other violations of assumptions of the Rasch model as well, not only a violation of equal item discriminations. Therefore, it is reasonable to suggest both the modified tests (or at least one of the four) and the ALRT (or another asymptotically equivalent
When the discriminations of the items are not equal (and thus, the Rasch model is not applicable), one has several choices of models considering a discrimination parameter. Most popular are the well-known 2PL and 3PL models (Birnbaum, 1968) or, in case of ordinal responses in more than two categories, the generalized partial credit model (Muraki, 1992). The CML approach is not compatible with these examples of models since they are not of a multiparameter exponential family. As an alternative, one may consider a less common model, that is, the so-called one parameter logistic model (OPLM) for binary and ordinal responses (Verhelst & Glas, 1995; Verhelst et al., 1994). The only difference to the 2PL model and the generalized partial credit model is that the discrimination parameters are treated as given (or known) constants called discrimination indices. Thus, they need not be estimated. In the binary case, the OPLM has only one parameter per person and one parameter per item just like the Rasch model, and consequently, the sufficient statistic for the person parameters is obtained as a weighted sum of correct responses for each person, where the given discrimination indices are used as weights. Thus, one can condition on the observed values of these weighted scores, eliminate the person parameters from the conditional distribution, and use CML. The practical problem is the specification of the discrimination indices, which are usually unknown. Information on the items’ discriminations can be obtained from the score function
The final remark is in respect of the conditional approach to statistical inference, which is at the core of this work. The strongest philosophical and theoretical argument for conditioning on a statistic is met when the statistic is ancillary (e.g., Lehmann & Romano, 2005), that is, when its probability distribution does not depend on the parameters of interest (in this case, the
Supplemental Material
Supplemental Material, sj-pdf-1-jeb-10.3102_10769986231183335 - An Improved Inferential Procedure to Evaluate Item Discriminations in a Conditional Maximum Likelihood Framework
Supplemental Material, sj-pdf-1-jeb-10.3102_10769986231183335 for An Improved Inferential Procedure to Evaluate Item Discriminations in a Conditional Maximum Likelihood Framework by Clemens Draxler, Andreas Kurz, Can Gürer and Jan Philipp Nolte in Journal of Educational and Behavioral Statistics
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
