Abstract
A multidimensional Bayesian item response model is proposed for modeling item position effects. The first dimension corresponds to the ability that is to be measured; the second dimension represents a factor that allows for individual differences in item position effects called persistence. This model allows for nonlinear item position effects on the item side as well as on the person side. Moreover, a flexible loading structure on the two dimensions is allowed. A fully Bayesian estimation procedure is proposed, and its performance is investigated by a simulation study. Further, the model is applied to empirical data collected in the Programme for International Student Assessment 2000 in the reading domain. The additional value of the model’s extended flexibility compared to more restrictive models is shown. The findings show that the linear hypothesis of change in performance during a test does not hold in general.
Introduction
Educational testing generally aims to measure individual traits or states that represent abilities or competencies in various domains of education. In order to derive these measures, a sequence of items is usually administered to persons. In an ideal situation of educational testing, the behavior a person shows in responding to items within a test is influenced only by their trait or state, which is what the test intended to measure. However, in realistic situations of educational testing, response behavior is influenced by several other factors besides the trait or state of the testees. For instance, the behavior a person shows in responding to items within a test that consists of several items depends on the position within this test in which the item is administered to that person. Almost seven decades ago, Mollenkopf (1950) investigated the influence of item positions on the response behavior of persons in speed and power tests. He already found that items tended to be more difficult when the item appeared later in the test rather than in earlier item positions.
In the literature, several approaches based on item response theory (IRT) have been proposed for analyzing item position effects. Item position effects can be viewed from the item side and from the person side (De Boeck & Wilson, 2004; Hartig & Buchholz, 2012). Viewed from the item side, the position effects of items are modeled as additional effects on item parameters and, therefore, are constant across persons. In the Rasch model, item position effects can be analyzed using the linear logistic test model (Alexandrowicz & Matschinger, 2008; Hohensinn et al., 2008; Kubinger, 2008). Viewed from the person side, test takers may be affected differently by item position effects. To allow for individual differences, item position effects can be specified in such a way that the item position loads on a factor that can vary across persons (Debeer & Janssen, 2013; Hartig & Buchholz, 2012; Schweizer et al., 2009). The effect, that is the loading of the item position, is usually held constant across items. The factor that the item position loads on is often called persistence (e.g., Debeer & Janssen, 2013). It should be noted that, to analyze position effects on the item side, items have to be administered to the test takers in different orders (Hartig & Buchholz, 2012). However, to analyze position effects on the person side, tests with a fixed item order (i.e., using only one test booklet; see Schweizer et al., 2009) can be used and individual differences in persistence can also be identified. In this article, we extend the multidimensional model for item position effects proposed by Debeer and Janssen (2013) by allowing for a more flexible functional form of persistence and for a loading structure of the items.
The remainder of the article is organized as follows. First, the model is introduced and its main features are summarized. Second, the identifiability of the parameters under ideal conditions is discussed, that is, for data arising from a test design in which every item occurs in every position and in which there are substantial links of items in different positions within a booklet. Third, a Bayesian estimation procedure to fit the proposed model is described. Fourth, the statistical properties of the Bayesian parameter estimates are evaluated in terms of bias, root mean square error of approximation, and coverage rates, calculated based on the results of a simulation study. Fifth, the model is applied to data collected from the first seven booklets, including the first three block positions, of the Programme for International Student Assessment (PISA) 2000. Finally, further extensions of the model and prospective research directions are discussed.
Modeling Item Positions in an IRT Framework
Item Response Models for Item Position Effects
We assume a test consisting of items
Example for a Booklet Design Consisting of Four Booklets
Note. There are four blocks (B1, B2, B3, and B4), and no two blocks have one item in common.
Let
where bi
is a fixed effect representing the item-specific difficulty. Note that we used the
The vector
where the parameter
To ensure the identifiability of the model,
The model in Equation 1 can be considered to be a generalized nonlinear mixed effects model (Debeer & Janssen, 2013; De Boeck & Wilson, 2004) with fixed item effects. However, a generalization to models with random item effects, resulting in a cross-classification, could easily be made (e.g., Fox & Verhagen, 2010). To estimate the parameters of such models becomes computationally demanding when using maximum likelihood methods. Therefore, we propose a Markov chain Monte Carlo (MCMC) approach that easily allows the model to be extended to even more complicated models.
The model in Equation 1 can be restricted in several ways, and three versions of these model restrictions are discussed below.
Regarding the item position effects, we could assume that these effects have a certain shape. For instance, we could assume that the effects can be described by a function
The parameter
Further, the model in Equation 1 can be restricted to have constant item-specific loadings. The loading parameters are set to
In this model,
Finally, the position effects can be modeled on the item side only. This means that the second person factor
In this model, the person factor
Identification of the Parameters
We show that the parameters of the statistical model in Equation 1 can be identified after integrating out the random effects. The main principle of statistical identification is that unknown parameters can be uniquely estimated based on observed data only (see San Martin et al., 2013) and not by another source of information (like prior distributions, see San Martin & González, 2010). This means that parameters can be determined if the probability distribution of the observed data is fully known (i.e., there is no sampling error).
In the following, we assume that the model restrictions described in the previous section are made, that is,
Let the index
The identification proof consists of six steps. In each step, it is shown that a parameter or a group of parameters can be calculated as a function of the observed data distribution.
First, we show how the item loadings
Second, we show how the item loadings
Third, we identify the correlation
This equation has the only unknown
Fourth, we show how to identify
Fifth, we show how to identify item intercepts bi
. Using data from Item i in the first position, it holds that
Sixth, we show how to identify the position parameters
A detailed analysis of the requirements of the six identification steps reveals that specific test designs are needed to identify all of the parameters. Such sufficient requirements can formulated as follows: All booklets consist of blocks containing more than 1 item. The booklets are designed such that the total number of block positions is greater than 2. For every item exist two booklets k and l such that the item appears in booklet k at Block Position 1 and in booklet l at Block Position 2. For every block position
The booklets we selected in the empirical example from the PISA 2000 design (see Table 2) fulfill these requirements, and therefore, it is ensured that all of the univariate and bivariate distributions needed in the six steps could be estimated. 2
Booklet Design of PISA 2000 Assessments and the Position-Specific Relative Frequencies of Correct Responses Regarding Reading Items and the First Three Block Positions Only
Note. R1–R9 are clusters consisting of reading items. M and S represent clusters consisting of mathematics and science items, respectively. Cells displayed in gray were not considered for the analyses.
MCMC Estimation Procedure
In the following, a full Bayesian procedure to estimate the proposed model is described, namely, an MCMC procedure using the Gibbs sampler (Gelfand & Smith, 1990). To make the estimation procedure applicable to stratified clustered samples, we extended the usual MCMC procedure by incorporating sampling weights (e.g., Goldstein, 2011). The procedure was implemented in R (R Core Team, 2018). MCMC procedures are often a facilitation of estimation (see Gelman et al., 2013; Hoff, 2009; Levy & Mislevy, 2016), and, in general, researchers can implement MCMC methods relatively easily.
This is one reason for why the MCMC approach has found widespread use in the estimation of item response models (e.g., Fox, 2010; Rupp et al., 2004). Using the Gibbs sampling approach, a complicated high-dimensional estimation problem can be reduced to stepwise evaluations of lower dimensional posterior distributions (which are often unidimensional). However, compared to maximum likelihood estimation, the Bayesian approach additionally requires the specification of prior distributions (with densities denoted by
Our Gibbs sampler employed augmented latent data z which can be interpreted as continuous versions of corresponding dichotomous item responses y (Albert & Chib, 1993). The use of augmented data simplifies the Gibbs sampling steps (Levy & Mislevy, 2016). By incorporating the augmented data, the joint posterior density of the parameters given the data is specified as
where the proportional sign is used because a normalizing constant is left out of the equation.
For notational reasons, let
It is easily seen that the posterior consists of a likelihood part involving data from all persons and a joint prior part
Conventional statistical inference is based on the assumption that all elements of a sample are independent. However, especially in educational testing, samples often exhibit a stratified clustered structure, and, consequently, the assumption of independence is violated. Ignoring such a sample structure might lead to underestimated standard errors and might threaten the validity of the inferences drawn from those samples (Rust, 2013). To meet this challenging sample structure, sample weights in conjunction with appropriate resampling methods are usually applied. In order to be in line with these techniques, as a first step, we incorporated sample weights into our estimation procedure (for a description of the procedures regarding inference and model assessment see the next section). Therefore, we weighted the likelihood part that led to a pseudo-likelihood part and an associated pseudo-posterior density (e.g., Goldstein, 2011; Savitsky & Toth, 2016) in Equation 7:
where wp
is the sample weight of person p. To be consistent with usual likelihood expressions, we set
To apply the Bayesian procedure, the parameter vector is divided into components. Then, for each component, the respective sampling step from the corresponding conditional posterior distribution can be derived by using common Bayesian techniques (Hoff, 2009). A conditional distribution of parameters related to one component is a distribution of parameters conditional on the sampled values for all other parameters (related to all other components). Therefore, the procedure can be described by several estimation steps with one step for each component. To incorporate the sample weights mentioned above into the MCMC steps, let
For notational reasons, we define
Step 1
In the first step, the data
For the sake of better readability, we omitted the indices p, i, and t in some cases and used matrix or vector notation. Then, vectors and matrices consist only of components of appropriate combinations of persons, items, and item positions, that is, combinations that are represented in the data.
Step 2
In this step, the item parameters
where
Step 3
The parameters
Incorporating the prior distribution, the conditional distribution can be written as
with
Step 4
The parameters
For reasons of identifiability,
Step 5
The parameters
where
Step 6
To draw
where
Let
where
Step 7
The conjugate prior distribution of
This MCMC sampling procedure provides approximations of the posterior distributions of the model parameters. The mode or mean of the corresponding marginal distributions serves as the point estimate for the respective parameters. Statistical inference regarding these estimates when accounting for sampling weights can be considered as a Bayesian version of pseudo-likelihood estimation (Rabe-Hesketh & Skrondal, 2006).
Inference and Model Assessment for Complex Samples
To calculate the likelihood of the model parameters, we follow Merkle et al. (2018) and use the marginal likelihood to derive the model selection criteria because we want to infer to a population beyond the specific sample considered. The correlation
We consider the likelihood part with sampling weights from Equation 8 in which augmented data are integrated out:
which can be seen as a Bayesian version of the pseudo-likelihood (e.g., Rabe-Hesketh & Skrondal, 2006). As
which is a weighted version of the marginal likelihood described in Merkle et al. (2018) with corresponding log pseudo-likelihood
where
The weighted analogue of the Fisher information
and can be approximated numerically using the central difference approximation (e.g., Abramowitz & Stegun, 1972).
Let
where A is a scaling constant. For instance, in PISA 2000, the balanced repeated replication was used as the replication method with
A weighted version of the design effect
where
To assess the models fitted in the empirical example, we used a design-based version of the Akaike information criterion (dAIC; Lumley & Scott, 2015) to account for the complex structure of samples in large-scale assessments. Using the expected a posteriori (EAP)
Simulation Study
Design and Method
To investigate the performance of the implemented Gibbs sampler, a simulation study was conducted. The model in Equation 1 was used as the data-generating model. As the true population parameters, we chose one fixed set of parameters for 20 items in booklets with four block positions for the data-generating model. Each block consisted of 5 items, and the underlying booklet design (see Table 1) formed a generalized Youden design (e.g., Kiefer, 1975).
The latent factors
The simulations were conducted for sample sizes of
Noninformative priors were specified for all parameters. We used Gibbs sampling with a burn-in period of 2,500 iterations and a chain of 10,000 iterations. Effective sample sizes were calculated to ensure a sufficiently small MCMC error (Hoff, 2009). In all conditions and for all parameters, the average effective sample size was larger than 200.
The statistical behavior was evaluated in terms of bias, root mean square error (RMSE), and coverage rates. To investigate the frequentist behavior of the parameter estimates, the respective Bayesian credibility interval (BCI) was calculated by using the
Results
Bias
In the left part of Table 3, the observed bias for the item parameters is displayed in aggregate form. The estimates for the sample size of 2,500, for the loadings
Summary of the Results of the Simulation Study
Note. For the data-generating model, one fixed set of parameters was used and sample sizes varied from 500 to 2,500.
RMSE
The summaries regarding the RMSE are displayed in the middle part of Table 3. For all parameters, the overall accuracy increased with increasing sample size. For the sample size of
Coverage Rate
To evaluate the accuracy of the standard errors of the estimates, coverage rates were calculated based on the 90% BCI. Again, we briefly summarize the results for the a
i1, a
i2, and b
i
parameters. The estimates of all item parameters showed rates very close to or above the nominal level. Additionally, the sample size did not seem to have an influence on the coverage rates (range for sample size of
The coverage rates for the remaining parameters are shown in Table 3. All parameters except for
Some coverage rates showed a tendency to be overestimated. This might be traced back to the fact that we only used noninformative priors in this simulation study. The use of noninformative priors lead to posterior distributions that represent higher variability. Sampling parameters from informative prior, which are used in the Gibbs sampler, can lead to more accurate coverage rates in simulations.
Discussion
The statistical properties of the proposed model regarding bias, RMSE, and coverage rates were acceptable for sample sizes of
Altogether, for sample sizes
Empirical Example: PISA 2000 Reading Data
To illustrate the proposed approach, we applied the model in Equation 1 and restricted versions of it to a subsample of the published data from the PISA 2000 reading assessment (OECD, 2002).
Data
The sample we analyzed comprised 2,853 students from New Zealand. Note that the sample size differs from that reported in OECD (2002) because students to whom Booklets 8 and 9 were administered were not included in our analyses (see below). As booklets were assigned to students randomly, this subsample can still be considered to be a representative sample of 15-year-old students from New Zealand.
In PISA 2000, items were administered in nine different booklets. Because not all booklets contained reading items in the first three block positions, we confined the set of booklets and items considered in our analyses in two ways: We considered (1) the first three block positions instead of all four block positions in the PISA test and (2) the first seven booklets only. The arrangement of these selected booklets (see Table 2) forms a balanced incomplete block design (BIBD; Frey et al., 2009), that is, every block (or cluster) of items appears in as many booklets as every other block, and any pair of blocks appears in as many booklets as any other pair of blocks. Furthermore, this subdesign forms a generalized Youden design (e.g., Kiefer, 1975), consisting of seven different blocks, which means that, beyond the key constraints of the BIBD, every block is assigned to every block position exactly once. This confinement provides empirical data that make it possible to illustrate the identifiability of the model parameters. Originally, 17 items were polytomously scored with three categories (incorrect, partially correct, and full credit). These items were dichotomized by scoring the full credit as correct and partially correct and incorrect as incorrect. This resulted in data from 96 dichotomously scored items. 3 All missing responses coded as omitted or not reached were treated as incorrect responses. All other missing responses were treated as missing by design.
Further, to specify the aforementioned prior distribution for the
Data analysis
To demonstrate the superiority of the proposed model over more restricted ones, we applied the models described in Equation 1 (M1), Equation 3 (M2), Equation 4 (M3), and Equation 5 (M4) to the data. After inspecting the block position-specific percentage of correct responses (see right part of Table 2), we decided to fit a further model (M2b) with the restrictions than the linear one from Equation 3:
Model M2b assumes that the effects of Position 2 are the same as those of Position 3. One has to note that, in PISA 2000, there was a 15-minute break between Block 2 and Block 3. This might be (at least partly) the reason for why the pattern of the position-specific percentage of correct responses differs from a linear form.
For all models, a Gibbs sampler was applied, while for M2, M2b, M3, and M4, the procedure described in the previous section was adapted. The Gibbs sampling had a burn-in period of 25,000 iterations, and the whole chain was 50,000 iterations long. All reported estimators are the respective means of the posterior distribution (EAP).
To assess the convergence of the MCMC algorithm, we inspected the trace plots of every estimated parameter and the respective autocorrelation function with an increasing lag between the successive draws. All models met the respective diagnostic criteria. Further, effective sample sizes and the proportional scale reduction factor (PSR; Gelman & Rubin, 1992) were calculated for each model parameter. To calculate the PSR, the respective chains were cut in half. Only two estimated parameters (of 872 in total) showed effective sample sizes smaller than 400 but with a minimum of 329. And only one estimated parameter showed (with a value of
To be able to compare the standard deviations of
Results
The results of the analysis regarding the distribution of
Summary of the Results of the Analyses Based on the PISA 2000 Reading Data of New Zealand, Confined to the First Three Blocks of Booklets 1–7
Note.
Considering the mean of the loading parameters
A positive correlation between
To compare the mean changes in difficulty from one position to the other, we used
The standardized mean change in difficulty from Position 1 to Position 3 ranged from
The estimate of
Inspecting the dAIC, M1 is preferable even though the difference from M2b was not very high (see Table 4). This small difference in the model-fit statistics of these two models is at least partly due to the much smaller design effect of M2b. Another noticeable finding is that Model M2 shows the worst fit even if its structure regarding item loadings is more flexible than the structure of Model M3. Also the unidimensional Model M4 shows a better fit than Model M2.
Discussion
Altogether, we can summarize that there were substantial item position effects. Further, the assumption that there are linear decreases in the mean changes in performance during a test did not hold in general. This might be traced back, in our case, to the relatively long break of 15 minutes between Blocks 2 and 3. Finally, there are indications of a factor on the person side that can be interpreted as persistence. In general, the degree to which the person’s persistence influences their response behavior cannot be modeled linearly.
Discussion
We introduced a flexible two-dimensional model for modeling nonlinear position effects on the item side as well as on the person side. Further, the proposed model allows for a flexible loading structure of the items on the two modeled dimensions. A fully Bayesian estimation procedure of the model was proposed, and its performance was demonstrated on simulated data. Further, we applied our approach to reading data collected in PISA 2000 and showed the additional value of the extended flexibility compared to more restrictive models.
The algorithm was implemented in R (R Core Team, 2018). Of course, the usage of other software such as WinBUGS (Lunn et al., 2000) or JAGS (Plummer, 2017) and Stan (Carpenter et al., 2017) was also considered. Handling complex survey data in MCMC estimation as a general-purpose Bayesian software requires the function of weighing individual posterior contributions (Savitsky & Toth, 2016). This seems to be possible in Stan (https://groups.google.com/forum/#!msg/stan-dev/5pJdH72hoM8/GLW_mTeaObA), but it is not easily applicable in WinBUGS or JAGS.
In the following, we discuss the limitations of the proposed model and suggest further developments. Our proposed IRT model for position effects is limited to dichotomous item responses. However, an extension allowing for ordinal response data is possible. For an Item i with C ordered response categories, in the data augmentation step
Further, the model could be extended by introducing the residual item-specific position effects of item intercepts as random effects. This strategy can be extended by including random effects for item-specific position effects for item loadings (Le, 2009).
The central assumption underlying the proposed model was a bivariate normal distribution for ability and persistence. Consequently, persistence takes negative values as well as positive values that decrease and increase the success probability of a correct item response, respectively. In contrast, if persistence is intended to represent performance decline only, restricting persistence to negative values could be reasonable. Hence, the persistence component in the bivariate distribution could be assumed to follow a log-normal distribution multiplied by −1, which only admits negative values. This model is related to a performance-decline item response model (Jin & Wang, 2014). Although the model is formulated for investigating declines in each item in a fixed-order test, the model could also be applied to the situation of position effects in this article, and the decline of a student can occur at the beginning of the test or in the second or third positions.
Supplemental Material
Supplemental Material, 01_JEBS-2019-MAN-0038.R2_Appendix - A Bayesian Item Response Model for Examining Item Position Effects in Complex Survey Data
Supplemental Material, 01_JEBS-2019-MAN-0038.R2_Appendix for A Bayesian Item Response Model for Examining Item Position Effects in Complex Survey Data by Matthias Trendtel and Alexander Robitzsch in Journal of Educational and Behavioral Statistics
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Notes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
