Abstract
Item position effects can seriously bias analyses in educational measurement, especially when multiple matrix sampling designs are deployed. In such designs, item position effects may easily occur if not explicitly controlled for. Still, in practice it usually turns out to be rather difficult—or even impossible—to completely control for effects due to the position of items. The objectives of this article are to show how item position effects can be modeled using the linear logistic test model with additional error term (LLTM +ε) in the framework of generalized linear mixed models (GLMMs), to explore in a simulation study how well the LLTM +ε holds the nominal Type I risk threshold, to conduct power analysis for this model, and to examine the sensitivity of the LLTM +ε to designs that are not completely balanced concerning item position. Overall, the LLTM +ε proved suitable for modeling item position effects when a balanced design is used. With decreasing balance, the model tends to be more conservative in the sense that true item position effects are more unlikely to be detected. Implications for linking and equating procedures which use common items are discussed.
Keywords
In educational assessment and in large-scale assessments in particular, item response theory (IRT) models are often applied to estimate the difficulty of items as well as the ability of examinees. Linking and equating procedures are used to define common scales, which allow achievement scores of two nonequivalent samples to be compared even if the two samples share only a subset of common items. The idea is to use a subset of so-called anchor items that are common to both samples to define a common scale. The corresponding design is known as the common-item nonequivalent groups equating design (Kolen & Brennan, 2004), which is employed in a great number of empirical applications (e.g., Beaton, 1988; Frey, Hartig, & Rupp, 2009). Although this design offers administrative flexibility, several requirements of the common items have to be met to interpret mean ability differences in both samples as mean ability differences in the corresponding populations from which the samples are drawn. If these requirements are not fully guaranteed, the estimation of mean ability difference may be biased, for example, by context effects—that is, effects of specific characteristics of test administration (Brennan, 1992; Zwick, 1992). It is crucial to eliminate or to control context effects to obtain valid item parameters that do not function differently from one administration to the next (Yen, 1980). The assumption of item parameter invariance is fundamental to linking procedures that are based on common items (Kolen & Brennan, 2004).
The so-called National Assessment of Educational Progress (NAEP) reading anomaly (Beaton, 1988; Zwick, 1991) demonstrated potential consequences of context effects. Between 1984 and 1986, estimates of reading proficiency showed a surprisingly large decrease, especially for students of age 17. In an assessment that followed in 1988, a common-population equating design was used to explain this decrease with a different composition of reading item blocks and differences in item order and item context between 1984 and 1986. These attributes of items were not taken into account in the 1986 assessment as the underlying three-parameter logistic model used in NAEP implies that the probability of answering an item correctly is a function of only the examinee’s proficiency and a set of fixed item characteristics. The context attributes, which were assumed to have no effect at all, caused a violation of the item parameter invariance assumption. This underlines the need to minimize context effects, often achieved by keeping all context variables at a constant level. To guarantee constant context conditions, achievement tests are conducted under highly standardized conditions regarding test instruction and administration (Allen, Donoghue, & Schoeps, 2001; Organisation for Economic Co-Operation and Development [OECD], 2003).
Because a test usually contains more than one item, a presentation sequence is consequently implied; this leads to each item occurring in one or several specific positions. In this context, item position effects refer to the phenomenon that the difficulty of an item in an achievement test depends on the position of the item (OECD, 2012; Robitzsch, 2009). Often, an item administered at the end of a test is more difficult than the same item administered at the beginning of the test. This effect is usually explained by increasing fatigue of examinees. As the position in a test is part of the context the item appears in, item position effects may be considered to be a special case of context effects. Even when each examinee responds to an identical set of items in an identical sequence, position effects may occur, for instance, because the fatigue of the examinees increases as well. As some items always have to be administered subsequent to others, position effects cannot be excluded and may occur on each achievement test.
However, when each examinee responds to an identical set of items in an identical sequence, position effects are unobservable because each item occurs only at a specific position for all examinees. Furthermore, when using designs with a single constant item sequence, the meaningfulness of discussing item position effects is questionable. Brennan (1992) argued that it would not be sensible to claim that a context effect due to item position exists simply because a constant item order exists. Consideration of context effects is appropriate only in reference to a universe of (at least two) possible item presentation sequences, as item position effects are incorporated into the measured construct when there is only a single constant sequence of items.
In the context of linking and equating, the administration of items in an identical order across test forms is recommended to obtain item parameter invariance, which is frequently utilized in IRT applications (Cook & Petersen, 1987; Kolen & Brennan, 2004; Meyers, Miller, & Way, 2009). However, in the context of large-scale assessments, tests are often composed of items obtained from a large item pool. Consequently, actual tests may vary in several features, for instance item positions. Due to complex design requirements one cannot necessarily assume that the position of an item is identical in every possible test compilation. Moreover, in educational assessment, large samples of items are used to comprehensively cover the underlying test constructs, and only a subset of items is presented to each examinee. In such multiple matrix sampling designs (Gonzalez & Rutkowski, 2010), the selection of items and the order in which they are presented inevitably vary across examinees. Likewise, one item may occur in several positions; the effect of the item is thus separable from the effect of its position.
Which consequences can be expected if position effects occur? Meyers et al. (2009) found that position effects are a threat to item parameter invariance, which is an essential assumption of common-item linking procedures (Kolen & Brennan, 2004). In each context in which common scales are used to measure change in achievement or to compare achievement scores with normative scales, item position effects may cause a serious bias when two tests are linked. Zwick (1992) has pointed out that the NAEP reading anomaly was caused primarily by changes in the order and context in which the items appeared.
As the presentation order of items cannot be held constant when multiple matrix sampling is applied, most large-scale educational assessments (e.g., the Trends in International Mathematics and Science Study [TIMSS], the Progress in International Reading Literacy Study [PIRLS], and the NAEP) use balancing methods to control for position effects. Often items are grouped into blocks in which the selection and sequence of items is invariant. If the items in a block are related to a common stimulus, they are referred to as a testlet. The blocks are arranged in booklets. In this process, each block may be assigned to several booklets and possibly to differing positions in the booklets. Thus, there is a chance to balance the occurrence of blocks over positions. One design that allows for item position balancing is the balanced incomplete block (BIB) design (Gonzalez & Rutkowski, 2010; Lord, 1965). In a BIB design, every item block appears an equal number of times in all block positions; thus, there is no dependence between an item and its position in the test. The block positions (instead of the positions of single items) vary across several test booklets, but the selection of items and their order within a block are constant. Therefore, as no variation of item position within a block takes place in a BIB design, all items within a block are treated as though they share the same position (i.e., the position within the block). Note that, in the following, the term ‘item position,’ strictly speaking, refers to the position of the block in which the item occurs. Therefore, the number of possible positions at which an item can occur is reduced to the number of blocks in a booklet. Still, as we are interested in the effects of positions of items, we assign each item the block position of the block the item is occurring in.
Furthermore, the frequency of blocks on each position is equal among all blocks in a BIB design. The idea is that—despite the item parameters possibly being affected by position effects—the invariance of item parameter estimates is guaranteed as the BIB design controls for the position effect. For the purpose of illustration, let us assume a single item i placed at a fixed position in a block X administered to two different Samples A and B, where a BIB design was applied both times. Therefore, in both Samples A and B, X occurs at each block position an equal number of times. We will measure two item difficulty parameters: βi(A) (obtained in Sample A for item i) and βi(B) (obtained in Sample B for item i). Both parameters may potentially be biased by position effects. The extent of this bias is not known, though. If position effects appear in both samples in the same magnitude and a BIB design is applied, the extent of the position effect is assumed to be equal for both item parameters βi(A) and βi(B). Thus, even if item parameters are biased due to position effects, these equal biases of βi(A) and βi(B) still guarantee an unbiased difference βi(A)−βi(B). Therefore, position effects are not responsible for potential differences in item parameter difficulties of βi(A) and βi(B) and are no longer a threat to the item parameter invariance assumption. Hence, differences in βi(A) and βi(B) may be interpreted as due to a difference in the mean abilities of A and B, but this difference is intended in common-item nonequivalent groups equating designs.
But how justifiable is the assumption of equal position effects in different populations? Wise, Chia, and Park (1989) found that low-achieving examinees were more affected by item-order and other item-context alterations than high-achieving examinees in a test of work-related knowledge and arithmetic reasoning. Analyzing data of Turkish students from the Programme for International Student Assessment [PISA] 2006, Debeer and Janssen (2013) found that position effects decrease for students with higher ability. Considering such an effect, it is rather unlikely that position effects will appear in two samples in the same magnitude, especially if the two samples vary in their mean abilities.
Consequently, prior to any IRT linking procedure based on common items, two issues should be addressed: The equivalence of the design should be ensured and position effects should be modeled to determine whether they appear in both samples at least congruently. This is especially important for the common items used to link the scales.
Modeling Item Position Effects With an Extended Linear Logistic Test Model
To model item position effects, explanatory item response models (Wilson & De Boeck, 2004) may be applied. Within the framework of generalized linear mixed models (GLMMs; Molenberghs & Verbeke, 2004), explanatory item response models propose that item responses should be modeled as a function of predictors of various kinds. For instance, predictors can consist of characteristics of items, persons, or a combination of items and persons. Furthermore, the effects of various predictors can be modeled as fixed or random effects. To specify position effects in an explanatory model, the position of an item is considered to be a predictor on item side that accounts for some of the variance in item difficulty. An example of such a model is the Linear Logistic Test Model ( LLTM; Fischer, 1973).
Before specifying position effects, let us first consider the Rasch model:
where η pi is a function of the ability parameter θ p and the item difficulty β i :
In a similar notation as in Debeer and Janssen (2013, Equation 4), we add a parameter δ for the position effect:
η pik now is the logit of the response of person p to item i at position k. δ k can be explained in terms of item positions, applying the linear combination (see De Boeck et al., 2011):
The item position is represented in X, and dk is the linear effect of covariate Xk. In the LLTM, several items are expected to share a common property, that is, a common position. Because the prediction of δ k in Equation 4 does not have an error term, δ k is assumed to be perfectly predicted by the item’s position.
Hohensinn et al. (2008) investigated whether the LLTM is appropriate for modeling item position effects. Using simulation studies, they found that the likelihood ratio test (LRT) has rather low power to detect position effects. De Boeck (2008) criticized the LLTM because it is not able to model the unexplained variance of the item difficulty prediction and thus ignores the uncertainty associated with the unexplained item variance. Especially when the LLTM is applied for modeling item position effects, this uncertainty may be substantial, as the prediction of item difficulty by item position is far from perfect. The LLTM then is most certainly a misspecified model. This model misspecification can be minimized if additional predictors—for example, item indicators as used in the Rasch model—are included in X to make the prediction nearly perfect. However, it is not possible to test whether this prediction is as exhaustive as expected.
Hence, an LLTM model with an additional error term (in the following denoted as LLTM +ε) is suggested to allow and to test for an imperfect prediction (De Boeck, 2008). Initially, as the model in Equation 3 assumes that the position parameters δ k are not item dependent but instead are only position dependent, we modify Equation 3 to allow for item-dependent position effects. Again, this is equivalent to Equation 2 in Debeer and Janssen (2013):
Next, δ ik is explained in terms of item properties (i.e., positions), applying the linear combination:
Especially for modeling item position effects, an LLTM +ε seems reasonable, as the item difficulty cannot be completely explained by item position.
In the GLMM framework, the introduction of ε ik implies that the effect of δ ik is decomposed into a fixed part and a random part. The fixed part incorporates the prediction of δ ik in terms of item properties (i.e., positions), whereas the random part ε ik reflects the uncertainty in this prediction.
The effects of persons and the effects of items may also be considered to be random effects, assuming that both person parameters and item parameters are sampled from a distribution that is hypothesized to be normally distributed. Only the parameters of this distribution, rather than the individual person parameters, are estimated. The model formulation of the LLTM +ε we used is the following (Janssen, Schepers, & Peres, 2004; Wilson & De Boeck, 2004):
with
with
When applying the LLTM +ε to model item position effects, one modification of the data structure is necessary to specify the units across which the effect of ε ik is random. If the data appear in the long format, normally, a factor variable is required with as many levels as there are items; this is to specify the units across which the effect of ε ik is random. Hence, this results in ε being random across all items. Now, as every item occurs at several positions, a new factor variable is created in the data matrix with as many levels as there are combinations of items and positions to define a position-specific item indicator across which the effect of ε ik is random.
To estimate the parameters in an LLTM +ε, the likelihood has to be maximized via approximation methods, as the integral appearing in the marginal likelihood function has no analytical solution (Tuerlinckx et al., 2004). Several approximation methods are available. In the present article, the authors used the approximation to the integrand via Laplace’s method (Tierney & Kadane, 1986) as implemented in the R package lme4 (Bates, Maechler, & Bolker, 2011).
When applying the LRT in an LLTM +ε, De Boeck (2008) pointed out that, in a very strict way, relative fit indices used for the LRT do not apply when methods based on approximation of the integrand are used. Therefore, as advised by Hohensinn et al. (2008), simulation studies seem to be appropriate for examining whether the LRT applied in an LLTM +ε holds the nominal Type I risk and supplies sufficient test power.
Research Scope
This article presents the results of a simulation study which examines the methodological adequacy of the LLTM +ε to model position effects. Three questions are addressed: Does the LRT applied in an LLTM +ε hold the nominal Type I risk and is able to supply sufficient test power in a BIB design? Does the test power depend on whether the design is more or less balanced? Does the Rasch model provide unbiased item parameters in spite of the occurrence of position effects if the design is balanced?
An empirical example illustrating how item position effects can be modeled in a reading comprehension test is provided in the Online Appendix to this article.
Method
Procedure
The authors investigated whether the LRTs (see the Online Appendix, p. 5) of the LLTM +ε hold the nominal Type I risk corresponding to the nominal significance level of α = .05 and whether its test power would be sufficient in three different designs, which vary in the way they are balanced. The question is relevant for three reasons: First, according to Hohensinn et al. (2008), the LRT of two LLTMs cannot hold the nominal Type I risk, nor can it supply sufficient test power. Second, in a very strict manner, relative fit indices such as the deviance, Akaike information criterion (AIC) and Bayesian information criterion (BIC), do not apply when methods based on approximation of the integrand such as in Laplace’s method are used for parameter estimation (De Boeck, 2008). Third, as pointed out by Frey et al. (2009), a BIB design often requires numerous booklets, which is not always feasible due to administrative limitations. Thus, it is not always possible to completely balance a design with respect to positions.
The authors employed a simulation in which three factors were manipulated:
the design (completely balanced, partially balanced, unbalanced),
the magnitude of position effects (none, linear, weak nonlinear, medium nonlinear), and
the sample size (2,000, 4,000).
The combination of these factors results in a 3 × 4 × 2 design. For each cell of the design, 1,000 simulated data sets (replications) were generated and analyzed using each of the three models introduced in the empirical example in the Online Appendix:
Model 1 incorporates no position effects.
Model 2 incorporates linear position effects.
Model 3 incorporates nonlinear position effects.
In preparation for the main analysis, ACER ConQuest 2.0 (Wu, Adams, Wilson, & Haldane, 2007) was used to apply a Rasch model in the marginal maximum likelihood (MML) formulation (Embretson & Reise, 2000; Wilson & De Boeck, 2004) to the data of the empirical example to estimate the Rasch item difficulty (RID) and the variance of the latent trait distribution which equaled
First, let us consider the completely balanced design: The BIB design from the empirical example, consisting of 80 items that were clustered in nine blocks and distributed among 20 test booklets, was used for simulation. In the other design conditions, we only used some of the booklets to gain a design, in which each item no longer occurred at each block position. Such designs are often preferred for economic reasons because the number of booklets is reduced. Therefore, such designs can only be partially balanced. The authors define the partially balanced incomplete block (PBIB) design as a design in which each item occurs at only two different block positions with equal frequency. This required using only a good half of the booklets needed in the BIB design. To (partially) balance for a mean position effect, each item occurred at Positions 1 and 4 or at Positions 2 and 3. In the PBIB design, only 12 (instead of 20) booklets were used.
An unbalanced incomplete block (UIB) design refers to a test design in which each item occurs at one or more different positions and at least half of the items occur in at least two different positions. The design is not expected to be balanced for a mean item position effect. In the UIB design, only eight booklets were used. The number of items, however, remains unchanged in the PBIB and the UIB conditions. In the BIB, PBIB, and UIB designs, the blocks were used to link the items (see Table A1).
To summarize, each design condition refers to a different number of booklets which are randomly distributed to the virtual examinees. Assuming the RID from the empirical data as true item parameters
Simulated Conditions.
Substituting Equation 9 into Equation 1 gives the probability P(Xpik = 1) for each item response:
Each response Xpik was generated by sampling a value from a uniform distribution over the interval [0, 1]. If the sampled value was between 0 and P(Xpik = 1), Xpik was set to 1. Otherwise, Xpik was set to 0.
Each replication in each condition was analyzed using each of the three models introduced in the Online Appendix, and two LRTs were conducted: Model 2 versus Model 1 and Model 3 versus Model 2. The ratio of how often the p value fell below the nominal significance level of α = .05 was computed. Depending on the simulated condition, this allowed us to investigate the Type I error rate or test power of the LRT. For example, let us consider Condition A in Table 1, in which a sample was drawn from a population in which no position effects occur at all. The sample was analyzed using Model 1, assuming no position effects, and Model 2, assuming a linear position effect. Comparing the fit of the two models using a LRT, we would expect that Model 2 would fit the data significantly better than Model 1 with a relative frequency corresponding to the nominal significance level of α = .05. Conversely, in Condition B, a sample was drawn from a population in which a linear position effect is present. The sample was analyzed using Models 1, 2, and 3. Comparing the fit indices of the models, we would expect that Model 2 would almost always fit the data significantly better than Models 1 (i.e., this frequency represents the test power). For the same Condition B, a comparison of Models 2 and 3 regarding the fit to the data should only render a significant result at a relative frequency corresponding to the nominal significance level of α = .05.
Finally, the simulated data were analyzed using a Rasch model to compare the true and the estimated item difficulty parameters in each condition. The root mean square error (RMSE) between the true and estimated item parameters and the bias were estimated using classic formulas (see, for instance, Babcock & Albano, 2012):
where N was the number of item parameters, β
i
was the estimated item parameter, and
Finally, the RMSE and the bias were averaged over replications.
Results
Table 2 lists the percentage of significant results (α = .05) of the LRTs for the 1,000 replications for each of the two model comparisons in each of the four conditions of item position effects. Whereas the values always represent a percentage of significant results, they are listed in two different columns, depending on their interpretation as Type I error or test power. Values related to the Type I error rate should meet the nominal significance level of α = .05; values related to test power should ideally approach values close to 1.
Percentage of Significant Results of the LRT in Various Conditions for the Three Designs.
Note. LRT = likelihood ratio test; BIB = balanced incomplete block; PBIB = partially balanced incomplete block; UIB = unbalanced incomplete block.
For example, let us consider the column labeled “α” and the comparison between Model 3 and Model 2 for Condition B and a sample size of 2,000 in the PBIB design: A linear position effect was simulated in a PBIB design. The data were analyzed using Model 3, which modeled nonlinear item position effects, and Model 2, which modeled only a linear item position effect. We would expect that Model 3 would not fit the data better than Model 2. A Type I error (i.e., a better fit of Model 3) occurred in 7% of the 1,000 replications, so the Type I error marginally exceeds the nominal significance level of α = .05.
Considering the BIB design, all LRTs held the nominal Type I error quite well or even fell below the nominal significance level. The ratio also does not depend on the sample size. As expected, test power strongly depends on sample size. However, with N = 4,000, the test power reaches acceptable values of at least 0.5, if the model that was appropriate to the simulated conditions was compared with the preceding more restrictive model. The test power also depends on the magnitudes of the position effects in the population with stronger effects being more easily detected.
In the PBIB design, the Type I risk rises up to 0.08 and the test power is a little lower in all conditions. Surprisingly, the model is fairly able to detect nonlinear position effects, although each item occurs only at two different block positions.
In the UIB design, however, the test power fell below acceptable values. If nonlinear position effects were simulated, the test often preferred an “incorrect” model that assumed only linear position effects. Likewise, the LRT was not able to supply sufficient test power when nonlinear position effects occurred in the population. Whereas a BIB design provides a design that is completely balanced with regard to item position, the degree of balancing is lower in the PBIB and UIB designs. Therefore, it is more difficult to differentiate item position effects from the effects of the items themselves as the position effect is incorporated into the item effect.
Table 3 lists the RMSE between the true and the estimated Rasch item difficulty parameters and the bias in each simulated condition. Positive values in the bias indicate that estimated item parameters are more difficult than are true item parameters. The bias depends on item position effects, which is quite plausible as we expect that item parameters are conjointly affected by position effects. The bias is slightly higher in the PBIB and the UIB designs, whereas the sample size does not affect the bias. Without any position effects the bias vanishes, no matter whether the design is balanced or not. All these results are in line with the rationale which lies behind the concept of balancing. However, for the RMSE a different picture emerges. Remember, if position effects act homogeneously on items and if a BIB design is employed, we would not expect that item parameters are affected by position effects distinctly from each other. Consequently, we would not expect that the RMSE depends on position effects in a BIB design. However, in all three designs, the RMSE increases if position effects occur. Although the BIB design yielded the lowest RMSE, these findings contradict the rationale which lies behind the concept of balancing. The authors have no conclusive explanation for this phenomenon. It might be an issue related to the scaling in logistic models because effects in such models are expressed relative to the standard logistic error standard deviation. If position effects occur which are not included in the analysis model (i.e., the Rasch model), the variance due to positions is not captured by the model. Consequently, the model deviance will increase and the scale may be deflated, which may be responsible for the increasing RMSE.
Descriptive Results Concerning Item Parameter Recoverage of the Rasch model for the Three Designs.
Note. RMSE = root mean square error; BIB = balanced incomplete block; PBIB = partially balanced incomplete block; UIB = unbalanced incomplete block.
Discussion
After item-context effects were demonstrated in the research on the NAEP reading anomaly (Beaton, 1988; Zwick, 1991), test construction and design began to gain particular importance as their consequences for trait measurement became apparent. Item position effects, among other effects of item context, are considered to be potential threats to item parameter invariance (Meyers et al., 2009), which is for instance fundamental to linking procedures that are based on common items. Therefore, the estimation of item position effects and the need of balancing designs are of particular importance in educational assessments, especially when tests should be linkable to a common scale, as in the common-item nonequivalent groups equating design (Kolen & Brennan, 2004).
This article presented a simulation study which revealed that the LRTs applied in the LLTM +ε framework are able to hold the nominal Type I risk (significance level) of α = .05 and supply sufficient test power when a BIB design is applied. When using the LLTM for modeling item position effects, the uncertainty due to an imperfect prediction has to be taken into account. The remaining residual variance may be modeled by an additional error term. More important, results of the simulation underline the relevance of the test design when modeling item position effects. The advantageous properties of the LLTM +ε strictly hold only in a BIB design. This is true even if the LLTM +ε is a “perfect” model, that is, if the model used for data generation and the analysis model are congruent. The authors believe that this is not a weakness of the LLTM +ε, but a problem that applies to IRT models in general. The design has to be appropriate to the model, otherwise even the best model may fail—less appropriate models may fail then even more severely. Concerning position effects, results of the simulation study therefore emphasize the need for and interdependence of both approaches: balancing via a BIB design and modeling item position effects within an appropriate framework. To check if balancing the design yields unbiased Rasch item parameters, it is necessary to model item position effects. In turn, a BIB design has to be used to ensure that the modeling via the LLTM +ε yields unbiased results. In other words, unbiased estimation of position effects depends not only on the appropriateness of the model but on the appropriateness of the test design as well.
Two basic limitations of the current study have to be mentioned. The GLMM framework only allows for modeling linear effects, which means that multiplication of model parameters is not permitted. Thus, the Rasch model that connects person and item parameters additively can easily be modeled and extended to—for instance—position effects models. However, models that further assume multiplicative terms—for instance item discrimination parameters as in the two-parameter logistic (2PL) model (Birnbaum, 1968)—cannot be estimated. Still, such models can be easily conceptualized within the nonlinear generalized linear mixed models (NGLMMs) approach (De Boeck & Wilson, 2004; Tuerlinckx et al., 2004) and estimated with reliable software like the NLMIXED procedure in SAS (SAS Institute, Inc., 2008). In fact, Debeer and Janssen (2013) proposed 2PL models that incorporate position effects but do not investigate their statistical properties.
A second limitation of the models proposed in this article is that they do not consider individual differences in position effects which may occur between individual persons or between individual items. Meulders and Xie (2004) described crossed-factor models that may allow for the modeling of dependencies between item and position. Debeer and Janssen (2013) showed that there was considerable individual variance in position effects in an empirical sample of Belgian students who took a listening comprehension test, and even proposed a person-specific trait that “indicates how a person is affected by the sequencing of items in a specific test” (p. 177). Modeling and, even more, explaining such individual differences would be of particular importance, if the achievement scores of two nonequivalent samples should be compared. Identifying variables related to each examinee (e.g., motivation) which moderate position effects may help to reduce item position effects when considered in the conceptualization of the study. This matters because we have seen that even in a completely balanced design, the RMSE is affected by item position effects.
The results of the simulation study underline the fact that a LLTM +ε applied in a BIB design allows researchers to model item position effects. Thus, it is possible to investigate position effects across subpopulations to verify whether these effects occur in both subpopulations in the same magnitude, which is a necessary condition for the application of linking procedures that are based on common items.
But what can be done if item position effects do not appear in the same magnitude in the two samples that are to be linked? Going back to the example presented in the introduction, we would then receive two item difficulty parameters for item i: βi(A) obtained in Sample A and βi(B) obtained in Sample B. The two parameters may be biased by position effects to a different degree. If position effects in both samples are not related to item difficulty or student ability, there is just a mean difference of position effects between A and B. In this case, it may be sufficient to adjust the linking constant by the mean difference in item position effects. However, if position effects in one or both samples are related to item difficulty and/or student ability, the linking procedure may result in biased estimates of the mean ability difference. It would be worthwhile to investigate whether and to what extent these hypothetical cases occur in empirical applications and how linking procedures can be modified to adequately perform under such conditions.
Footnotes
Acknowledgements
The authors thank two anonymous reviewers and the editor for their constructive help and support with improving this article.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Institute for Educational Quality Improvement, Humboldt-Universität zu Berlin, Germany.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
