Abstract
This paper focuses on two likelihood-based indices of person fit, the index lz and the Snijders’s modified index lz *. The first one is commonly used in practical assessment of person fit, although its asymptotic standard normal distribution is not valid when true abilities are replaced by sample ability estimates. The lz * index is a generalization of lz , which corrects for this sampling variability. Surprisingly, it is not yet popular in the psychometric and educational assessment community. Moreover, there is some ambiguity about which type of item response model and ability estimation method can be used to compute the lz * index. The purpose of this article is to present the index lz * in a simple and didactic approach. Starting from the relationship between lz and lz *, we develop the framework according to the type of logistic item response theory (IRT) model and the likelihood-based estimators of ability. The practical calculation of lz * is illustrated by analyzing a real data set about language skill assessment.
1. Introduction
One of the most important problems in educational measurement is to ensure that the responses of an examinee to some specified tests are in agreement with the design of that test. This issue is often called appropriateness measurement or person fit. Two particular person fit patterns should ideally be detected. The first pattern comes up when the examinee tries to perform better than his or her actual ability level, for instance, by cheating during the examination. The other pattern is actually the opposite behavior: It might happen that some examinees with high ability levels are attempting to answer incorrectly to easy items, in order to get lower scores. This astonishing tendency can occur if, for instance, the examinees are classified into groups of various difficulty levels, and their interest is to be classified into an “easier” group to get better chances to perform very well at the graduation. This situation was detected, among others, by Dodeen (2003), Meijer (1998), Raîche and Blais (2003a, 2003b, 2005), and Zickar and Drasgow (1996).
Other abnormal patterns might be detected, such as those of examinees who reply randomly to multiple choice questionnaires or experiencing distraction during the test. Several classifications exist to distinguish between the different types of misfit (e.g., Cronbach, 1946; Ro, 2001; Smith, 1982). However, one is usually interested in detecting these abnormal response patterns first, before a more detailed examination of each particular case (Levine & Drasgow, 1980).
A vast literature about person fit detection methods is available, but most of these refer to the same principle. It consists in computing a person fit index for each examinee and to compare this index with some preestablished threshold to draw a conclusion about the possible misfit. There are two main categories of person fit index: Those based on parametric probability distribution of item responses, usually item response theory (IRT) modeling of the item response and those based on summary statistics from the data set without requiring an IRT solution. We further refer to these categories as the parametric and the nonparametric indices, respectively. For a broad overview of the different indexes, parametric or not, see Karabatsos (2003), Klauer (1995), Levine and Drasgow (1982; 1988), and Meijer and Sijtsma (2001), as well as the first issue of the 1996 volume of Applied Measurement in Education.
The aim of this article is to concentrate on one well-known parametric index of person fit, called the lz index (Drasgow, Levine & Williams, 1985) and its extended version, the lz * index (Snijders, 2001). The lz index is very popular because it is derived from the log likelihood function and is easy to compute. Moreover, lz has an asymptotic standard normal distribution when the true ability levels are considered (Molenaar & Hoijtink, 1990, 1996), so the detection of person fit can be performed by comparing the values of lz to some appropriate quantiles of the N(0,1) distribution. In fact, lz seems to be very powerful to detect aberrant response patterns, with respect to other person fit indices (Armstrong, Stoumbos, Kung, & Shi, 2007; Drasgow, Levine, & McLaughlin, 1987; Li & Olejnik, 1997; Nering & Meijer, 1998).
Unfortunately, the main drawback of lz is that its asymptotic distribution is not standard normal anymore when the true ability levels are replaced by sample ability estimates (Molenaar & Hoijtink, 1990; Nering, 1995, 1997; Reise, 1995). To overcome this problem, Snijders (2001) proposed a slight modification of the lz index to obtain the desired asymptotic distribution with sample estimates of ability instead of the true (unknown) values. His index is nowadays referred to as the lz * index, since it is an extension of lz . Moreover, Snijders established that lz * could be computed with various estimators of ability and different IRT models and that the process for correcting the lz index can actually be adapted to other person fit indices. In the following, and unless explicitly stated, we simply mention Snijders to refer to his 2001 paper.
In this context, it is very surprising that the lz * index did not become more popular in the last years. For instance, the lz * index is not considered by Karabatsos (2003) in the comparison of 36 person fit statistics. A brief review of the literature indicated only a few papers that make reference to Snijders’s approach, either by explicitly describing the index (de la Torre & Deng, 2008; van Krimpen-Stoop & Meijer, 1999) or by simply citing it (Emons, 2009; Ferrando, 2004; Meijer & Sijtsma, 2001). Hence, despite of its appealing properties, Snijders’s approach remains largely unknown in the psychometric and educational assessment community.
To our opinion, the main reason for such an ignorance of lz * is due to the way Snijders’s article was written. This is a rather sophisticated technical article and, although the developments are concise and well explained, they remain hardly accessible for a wide range of applied researchers. A concrete illustration is that the choice of both an IRT model and an estimator of ability imply different calculations for lz *, and although it is clearly stated by Snijders, it becomes difficult for the practitioner to understand the precise form of the index. Moreover, some texts referring to Snijders’s work are also confusing. For instance, van Krimpen-Stoop and Meijer’s 1999 paper suggests that the distribution of lz * is established in the particular framework of the two-parameter logistic model and with Warm’s weighted likelihood estimator of ability, while it is only a particular case that was (partly) discussed by Snijders’s. The presentation of lz * was actually based on a more general framework, where both the IRT model and the estimator of ability, though unspecified, are constrained to fulfill some hypotheses (this is discussed later on in this article).
In sum, the corrected index lz * is a promising indicator of person fit, because it incorporates the estimator of ability instead of the true ability level, but it is still largely ignored. The purpose of this paper is to palliate the lack of diffusion of lz *, mostly by proposing a didactic review of Snijders’s approach. First, we set the IRT framework of person fit, by fixing the notations and briefly reviewing the lz index, using the notations of van Krimpen-Stoop and Meijer (1999) for sake of simplicity. Then, we present the lz * index from Snijders’s general framework and with connection to the notations for lz . In the next section, we discuss some possible choices of item response models and estimators of ability, and their relationship with the form of lz* index. We end with the practical analysis of an example about language skill assessment.
2. IRT Assessment of Person Fit and the lz Index
Since our main goal is to present some parametric person fit indices, we consider the following framework throughout the paper. A sample of examinees from a targeted population was assigned a test of n items. The underlying model that describes the probability of answering correctly to item i (i = 1, . . . , n) takes the form of the three-parameter logistic (3PL) model (Birnbaum, 1968):
An important assumption in the person fit framework is that the knowledge about the item parameters is sufficiently good so that they can be considered as fixed values. For instance, this assumption is meaningful if prior study or research was carried out in order to get very precise estimates of these parameters. In other words, the item characteristic curves are completely determined as simple functions of the ability parameter θ only.
Many indices have been proposed to detect person fit, but as already explained, we restrict on the lz
index of Drasgow et al. (1985). To present this index in the simplest and clearest way, we make use of the notations by van Krimpen-Stoop and Meijer (1999). Given a test of n items with fixed item parameters and item characteristic curves described by Equation 1, the likelihood function of any response pattern (X1
, . . ., Xn
) is given by
However, l
0 is not standardized and its distribution thus depends on θ. Drasgow et al. (1985) proposed the following standardized version of l
0:
However, the asymptotic normality of the lz
index is only valid when the true ability values θ are used to compute this index, but unfortunately the latter are usually unknown. Considering the use of
3. The lz* Index
In his 2001 article, Snijders discusses first the inadequacy of the distribution of a general class of person fit indices (including lz ) with estimated ability levels. Then, he proposes a correction for these indices such that, once computed with some convenient estimate of ability, they are asymptotically standard normal distributed. This class of indices is built up on a general framework that incorporates the choice of an IRT model and the selection of an ability estimator. In particular, the corrected version of lz is nowadays referred to as the lz* index (van Krimpen-Stoop & Meijer, 1999) since it represents an extension of lz . To present it, we start from Snijders’s general notations. Whenever appropriate, the connection with the lz index will be stated.
We start by considering that the lz
index is actually a particular case of the broader class of standardized person fit statistics of the following form:
In order to derive the distribution of
In particular, by setting the weights wi
(θ) equal to those in Equation 12, Equation 19 is nothing else than the standardized version of lz
, in other words the lz* index. At this step, it is useful to rewrite Equation 19 with the notations of van Krimpen-Stoop and Meijer (1999). We make use of the notation
In sum, the correction introduced by Snijders modifies both the expectation and the variance of the statistic Wn
(θ) to take the sampling variability of
4. Model Selection and Ability Estimation
The previous section outlined the different steps involved in the calculation of the lz* index with any suitable ability estimator
4.1. Estimation of Ability
First, we consider the ML estimator
In fact, it is equivalent to consider a constant prior distribution. Hence, the ML estimator is suitable for computing the lz* index, since it satisfies the condition in Equation (14) with
The last estimator to be considered is the weighted likelihood estimator
In conclusion, these three likelihood-based estimators can be considered as suitable methods for computing the lz* index and ensuring its asymptotic standard normal distribution. They differ by setting the function r 0(θ) but share the same forms of functions ri (θ), the latter being set up by the logistic model under consideration. The maximum a posteriori method can be considered with any type of prior distribution, although the usual normal density yields a simple form for r 0(θ).
It is important to notice that these estimators are maybe not the only one to be accurate in this framework. As already mentioned, any estimator that satisfies the condition in Equation 14 is a potential estimator for using with lz*. For instance, the expected a posteriori (EAP) estimator
4.2. Model Selection
Since the functions ri (θ) in Equations 22, 27, and 33 have the same form for the three estimators studied previously, it is of primary interest to look at their particular form with some specific IRT models. Only logistic models are considered in this presentation.
We start from the 3PL model, given by Equation 1. To simplify the notations, set ei
(θ) = exp [ai
(θ − bi
)]. One has then
If we consider the 2PL model, some calculation steps of lz* can also be simplified. For instance, since it comes from Equations 37 and 38 that
In sum, the selection of a logistic IRT model has some impact on the functions ri (θ), which are required to compute the person fit index lz*. Moreover, except for weighted likelihood estimation, these models do not influence the function r 0(θ), which is entirely set up by the method of ability estimation.
5. lz* in Practice
It is time now to summarize the developments of The lz* Index and Model Selection and Ability Estimation sections and to propose a schematic overview of the computational process of lz*. This process is sketched in Figure 1 .

Schematic process to compute lz*.
First, the selection of an IRT model determines the form of the ri
(θ) functions. This initial step is actually not to be realized for computing lz*, because the model is selected and the item parameters are estimated prior to person fit investigation. The second step consists in selecting a suitable estimator of ability, which yields the estimate
Once these steps are performed, and the functions r0
(θ) and ri
(θ) are set, the next steps are straightforward and hierarchically ordered. Using the weight functions wi
(θ) in Equation 17, one successively computes
The latter point merits some additional discussion. First, Equation 20 indicates that lz* is simply a shifted and rescaled version of lz , when the latter is computed with sample estimates of θ. Thus, extremely negative values of lz* also correspond to extremely negative values of lz and the same rule applies for extreme positive values. This is the reason why the same rule for detecting person fit applies for both lz and lz*. In addition, small negative values of lz refer to aberrant patterns for which misfit obviously occurs, while large positive values of lz indicate an almost perfect fit of the response pattern to the test characteristics.
To highlight this trend, consider an examinee with true ability level θ. The variance of l
0 (Equation 7) is then a positive constant value, and the sign of lz
is determined by the numerator of Equation 5, which can be rewritten as
6. An Example
We shall illustrate the computation of lz* by making use of a real test about English-language aptitude. The TCALS-II (Test de Classement en Anglais, Langue Seconde - Version II [Placement Test of English as a Second Language]) questionnaire aims at evaluating the level of English knowledge from French-speaking students entering into college graduation, in order to assign them to appropriate groups of difficulty level (Laurier, Froio, Paero, & Fournier, 1998). The test consists of 85 multiple-choice items, divided into eight subgroups of questions, and is identical for all French-speaking colleges of the province of Quebec (Canada). The data under consideration are the responses of 1,373 students (749 females and 624 males) entering into the College of Outaouais (Gatineau, Quebec, Canada) in the academic year 1998.
In the present analysis, we will start by giving the detailed calculation of both lz and lz* indices for one randomly selected examinee among the 1,373 students of the study. Then, the full distribution of lz and lz* will be displayed to provide a global overview of the TCALS-II data set.
First, the items were calibrated by marginal ML, using the 2PL model. The R package ltm (Rizopoulos, 2006) was considered for such an item calibration. The estimated discrimination and difficulty parameters are reported in Table 1 . It can be noticed that the TCALS-II is an easy test: the average difficulty level is −1.34 (SE = 0.83), with minimum and maximum difficulties of −4.49 and 0.66, respectively. Moreover, the average discrimination level is 1.63 (SE = 0.62), with minimum discrimination of 0.14 and maximum discrimination of 4.07. It is therefore expected that many examinees will respond correctly to many test items, as noticed by Raîche (2002).
Item Discrimination and Difficulty Estimates of the TCALS-II Data Set and One Randomly Selected Response Pattern
Table 1 also holds the complete response pattern of the randomly selected examinee who was assigned the TCALS-II questionnaire. This particular pattern will be considered in the next calculations.
Since the 2PL model is considered, the ri (θ) functions reduce to the discriminations ai displayed in Table 1. Moreover, the weighted likelihood estimator was used to get an estimation of the examinee’s proficiency levels. All forthcoming calculations were performed using an implementation of the person fit indices into the R (R Development Core Team, 2009) software. The main commands of the R code are displayed in the Appendix.
By solving Equation 14, using Equations 38 and 40 due to the 2PL model, it turned out that
Probability Estimates Pi(θˆ), Weights wi(θˆ) and Modified Weights wi˜(θˆ) of the TCALS-II Data Set
Eventually, successive calculations with the previous findings led to the following values:
Thus, after this detailed calculation of both person fit indices, one may conclude that the response pattern of the selected examinee is not that abnormal with respect to the lz index, since the quantile z α with lower tail α = 5% is equal to −1.64. On the other hand, when correcting for the ability estimate, the lz* index is equal to −2.57, which is largely under the acceptable threshold. In conclusion, the examinee has an aberrant response pattern with respect to the lz* criterion.
For completeness, the indices lz and lz* were computed for all examinees of the TCALS-II data set. Figure 2 displays the density estimates of the distributions of lz (solid line) and lz* (dashed line) together with the standard normal density (dotted line). The density estimates are obtained by kernel density estimation (Silverman, 1986), using the Gaussian kernel and an optimal bandwidth selection (Sheather & Jones, 1991). One can observe that the variance of lz* (Equation 21) is much larger than that of lz (Equation 7), as expected. None of the distributions seem to be quite close to the standard normal density, which means for lz* that there is much more misfit than expected under the normal distribution. This fact has indeed been validated by Raîche and Blais (2002). On the other hand, according to Drasgow et al. (1985), the test is sufficiently long (85 items) to consider the asymptotic framework as valid. Therefore, one possible explanation is that the 2PL model is not suitable for this real data set.

Density estimates of lz and lz * (2PL model, weighted likelihood estimator) and standard normal density, TCALS-II data set.
7. Conclusion
In this article, we tried to provide an overview of Snijders’s approach to correct the lz index of person fit. This correction is necessary when ability estimates are used instead of the true proficiency levels. Although being a promising approach, this corrected index lz * is not yet popular in the field of appropriateness measurement, possibly because of the technical features in Snijders’s 2001 article. Our presentation will hopefully provide some clearer insight on that topic and be helpful for interested researchers in the future.
The analysis of the TCALS-II data set not only illustrated the differences between the distributions of lz and lz* but also highlighted that lz* may have a somewhat different distribution than the expected standard normal density. It would be useful then to perform a simulation study to control for examinees which follow the item response model perfectly. In addition, the distributions of lz and lz * could be compared for various choices of item response models and ability estimators, and this would constitute an extension of Snijders’s (2001) simulated results.
One of the most striking advances of Snijders’s article is that other person fit indices could be adjusted similarly to lz
in order to get a better asymptotic framework for identifying misfit. Some promising indices are the Zeta index (Tatsuoka, 1984, 1996), U (Wright & Stone, 1979), W (Wright & Masters, 1982), and the class of Extended Caution Indices (ECI Tatsuoka & Linn, 1983). With an appropriate selection of the weight functions, these indices could be adjusted to take the replacement of θ by
Recently, some person fit indexes and tests for misfit have been introduced for polytomous items (Emons, 2009; Glas & Dagohoy, 2007). In this framework, the lz index can be naturally extended as the standardized log likelihood function from a polytomous item response model (Ro, 2001), for example, the graded response model (Samejima, 1969) or the partial credit model (Masters, 1982). Until now, however, Snijders’s correction has not been developed for polytomous items. Such a correction would also probably improve the identification of person fit, as it is the case for dichotomous items.
Finally, very little has been done to adapt parametric person fit indexes in the Bayesian paradigm of model and parameter estimation. Most of the developments related to misfit measurement are made under the usual likelihood framework. However, the increase in Bayesian methods and software for solving complex issues should not be discarded. Instead of computing person fit indexes such as lz or lz* and comparing them to theoretical cutoff scores, one could imagine to compute posterior fit (or misfit) probabilities based on posterior estimates of the item and subject parameters. This field of application seems promising and will hopefully receive more attention in the years to come.
Footnotes
Acknowledgments
The authors wish to thank the editor and the referees for their helpful comments and suggestions. This work was supported by a research grant “Chargé de recherches” from the National Funds for Scientific Research (FNRS), Belgium, the grant 122750 from the Social Sciences and Humanities Research Council of Canada (SSHRC), and the grant 103426 from the Fonds Québécois de Recherche sur la Société et la Culture (FQRSC).
Appendix: R Code for Computing lz and lz*
## Response probabilities, first and second derivatives under the 3PL model (equation 36)
## th: ability value
## it: matrix of item parameters: one row per item, three columns
## (discrimination, difficulty, pseudo-guessing)
Pi<-function(th,it){
res1<-res2<-res3<-NULL
for (i in 1:nrow(it)) {
res1[i]<-it[i,3]+(1-it[i,3])*exp(it[i,1]*(th-it[i,2]))/(1+exp(it[i,1]*(th-it[i,2])))
res2[i]<-(1-it[i,3])*it[i,1]*exp(it[i,1]*(th-it[i,2]))/(1+exp(it[i,1]*(th-it[i,2])))ˆ2
res3[i]<-(1-it[i,3])*it[i,1]ˆ2*exp(it[i,1]*(th-it[i,2]))*(1-exp(it[i,1]*(th-it[i,2])))/(1+exp(it[i,1]*(th-it[i,2])))ˆ3
}
RES<-list(Pi=res1,dPi=res2,d2Pi=res3)
return(RES)}
## Functions ri and r0 (equations 24, 27 and 33)
## method: « ML » for maximum likelihood, « BM » for Bayesian modal (or MAp), « WL » for weighted likelihood
## mu, sigma: prior mean and standard deviation parameters of the normal distribution
ri<-function(th,it) Pi(th,it)$dPi/(Pi(th,it)$Pi*(1-Pi(th,it)$Pi))
r0<-function(method=”ML”,th,it,mu=0,sigma=1){
res<-switch(method,
ML=0,
BM=(mu-th)/sigmaˆ2,
WL=sum(ri(th,it)*Pi(th,it)$d2Pi)/(2*sum(ri(th,it)*Pi(th,it)$dPi)))
return(res)}
## Ability estimation (constrained to range [-4; 4]) (equation 14)
thetaEst<-function(x,it,method=”ML”,mu=0,sigma=1){
f<-function(th) r0(method=method,th,it=it,mu=mu,sigma=sigma)+sum((x-Pi(th,it)$Pi)*ri(th,it))
if (f(-4)<0 & f(4)<0) res<--4
else{
if (f(-4)>0 & f(4)>0) res<-4
else res<-uniroot(f,c(-4,4))$root
}
return(res)}
## Weight function wi (equation 12)
wi<-function(th,it) log(Pi(th,it)$Pi/(1-Pi(th,it)$Pi))
## Function Wn(theta) (equation 9)
## x: response pattern (same length as nrow(it), zeros and ones as entries)
Wn<-function(x,th,it) sum((x-Pi(th,it)$Pi)*wi(th,it))
## Function sig2n (equation 11)
sig2n<-function(th,it)
sum(wi(th,it)ˆ2*Pi(th,it)$Pi*(1-Pi(th,it)$Pi))/nrow(it)
## Function cn (equation 15)
cn<-function(th,it) sum(Pi(th,it)$dPi*wi(th,it))/sum(Pi(th,it)$dPi*ri(th,it))
## Function wi ‘tilde’ (equation 16)
wiTilde<-function(th,it) wi(th,it)-cn(th,it)*ri(th,it)
## Function tau2n (equation 18)
tau2n<-function(th,it)
sum(wiTilde(th,it)ˆ2*Pi(th,it)$Pi*(1-Pi(th,it)$Pi))/nrow(it)
## Indexes lz and lz*
## snijders: logical argument: FALSE returns lz, TRUE returns lz*
Lz<-function(x,it,method=”ML”,mu=0,sigma=1,snijders=FALSE){
th<-thetaEst(x=x,it=it,method=method,mu=mu,sigma=sigma)
if (snijders==TRUE){
EWn<--cn(th,it)*r0(method=method,th=th,it=it,mu=mu,sigma=sigma)
VWn<-nrow(it)*tau2n(th,it)
}
else{
EWn<-0
VWn<-nrow(it)*sig2n(th,it)
}
res<-(Wn(x,th,it)-EWn)/sqrt(VWn)
return(res)}
