Abstract
Some large-scale testing requires examinees to select and answer a fixed number of items from given items (e.g., select one out of the three items). Usually, they are constructed-response items that are marked by human raters. In this examinee-selected item (ESI) design, some examinees may benefit more than others from choosing easier items to answer, and so the missing data induced by the design become missing not at random (MNAR). Although item response theory (IRT) models have recently been developed to account for MNAR data in the ESI design, they do not consider the rater effect; thus, their utility is seriously restricted. In this study, two methods are developed: the first one is a new IRT model to account for both MNAR data and rater severity simultaneously, and the second one adapts conditional maximum likelihood estimation and pairwise estimation methods to the ESI design with the rater effect. A series of simulations was then conducted to compare their performance with those of conventional IRT models that ignored MNAR data or rater severity. The results indicated a good parameter recovery for the new model. The conditional maximum likelihood estimation and pairwise estimation methods were applicable when the Rasch models fit the data, but the conventional IRT models yielded biased parameter estimates. An empirical example was given to illustrate these new initiatives.
Introduction
In large-scale testing, it is not uncommon to require examinees to choose and answer a fixed number of items (e.g., two) from a given set of items (e.g., four), which are referred to as examinee-selected items (ESIs). For example, several subjects in the 2016 Hong Kong Diploma of Secondary Education Examination consist of ESIs. The biology test requires examinees to choose and answer two out of the four constructed-response (CR) items. The chemistry test requires examinees to select two out of the three given sections and answer all CR items in the chosen sections. The physics, integrated science, geography, information and communication technology, and history tests consist of ESIs as well. Other large-scale tests with ESIs include the chemistry tests in 1968 and 1969 and the history test in 2010 of the Advanced Placement Examination in the United States (Lukhele, Thissen, & Wainer, 1994; Wainer & Thissen, 1994) and the Maryland School Performance Assessment Program in the United States (Fitzpatrick & Yen, 1995), and the National Higher Education Entrance Examination in China (W. C. Wang, Jin, Qiu, & Wang, 2012). In these tests, all ESIs are in the CR format and graded by human raters.
Although several educational advantages of the ESI design have been identified, such as increasing learner autonomy, reducing test anxiety, and boosting learning (Wainer & Thissen, 1994), the measurement with the ESI design encounters two challenges—one is the problem of missing not at random (MNAR) data, and the other is the effect of rater severity on CR items. The first challenge indicates that the missing data in the ESI design (i.e., those responses to unselected items) are not ignorable in likelihood inference (Rubin, 1976). For example, more capable students (on the intended latent ability) tend to choose easier items more often than less capable students, and such choice effect makes test scores incomparable across examinees who choose different items (Lukhele et al., 1994; Wainer & Thissen, 1994). The second challenge is that the ESI design usually comprises CR items that require raters to give scores, and raters often have different degrees of severity (Linacre, 1989). Although there are attempts to address, avoid, or overcome the first challenge (Allen, Holland, & Thayer, 2005; Bradlow & Thomas, 1998; Culpepper & Balamuta, 2017; Fitzpatrick & Yen, 1995; Liu & Wang, 2017a, 2017b; Livingston, 1988; Lukhele et al., 1994; Pena, Costa, & Braga Oliveira, 2018; Powers & Bennett, 1999; W. C. Wang et al., 2012), no research has been conducted to address or overcome the second challenge for ESIs, to the best of our knowledge.
Rater errors may come from consistently giving ratings that are higher or lower than the examinees should receive (leniency/severity), overusing middle or extreme categories of a rating scale (centrality/extremity), the rater’s general impression of an examinee (halo effect), or the interaction between raters (dependency; for a review, see Myford & Wolfe, 2003). Because ESIs are usually CR items that are marked by human raters, and human raters usually exhibit very different degrees of severity, it is important to consider both choice effect and rater severity to increase the feasibility of the ESI design, which is the main purpose of this study.
There are several approaches to the choice effect in the ESI design, including pattern-mixture models (Wainer & Thissen, 1994), item response theory (IRT) models with prespecified choice behaviors (Mislevy & Wu, 1996), and bifactor IRT models (W. C. Wang et al., 2012). Wainer and Thissen’s pattern-mixture models assume that Pr(Y|Q = 0) = Pr(Y|Q = 1), where Y is the response and Q is a missing datum indicator, in which Q = 1 if the datum is observed, and Q = 0 otherwise. Moreover, Pr(Y|Q = 0) is unknown unless a researcher can acquire the missing data, so the assumption of Pr(Y|Q = 0) = Pr(Y|Q = 1) cannot be verified empirically. Mislevy and Wu’s models assume that examinees’ choice behaviors are known prior to data analysis, but such an assumption is unlikely to hold true in practice. In W. C. Wang et al.’s (2012) study, a latent propensity is incorporated to account for the choice effect, but the data are assumed missing completely at random (MCAR), which may not hold true in the ESI design.
In addition to these approaches, two others have been recently proposed to deal with the choice effect in the ESI design. One is the nonignorable missingness ESI model (NESIM; Liu & Wang, 2017a); the other is the conditional maximum likelihood estimation (CMLE) and pairwise estimation for the Rasch models (Liu & Wang, 2017b). The NESIM combines two IRT models: an ordinary one for substantive measures and the nominal response model (NRM; Bock, 1972) for the missingness patterns. The CMLE and pairwise estimation methods are feasible for ESIs because of the measurement property of specific objectivity in the family of Rasch models, in which the estimations of item and person parameters are mutually independent. Unfortunately, both approaches fail to account for rater severity.
The purpose of this study was to advance the previous approaches to accommodate both choice effect and rater severity in the ESI design. Specifically, the authors propose (a) a new nonignorable missingness ESI rater model (NESIRM) and adapted (b) CMLE and pairwise estimation methods for ESI items to examine whether MNAR effect or rater effect exists. The former is a new IRT model for MNAR data, whereas the latter are estimation methods that can be applied to the Rasch model for MNAR data.
The authors demonstrate simulation results in the subsequent section to explicate the detriments of ignoring the MNAR effect and/or rater effect on item parameter estimators. The new methods are expected to perform well in recovering the “true” parameters when MNAR effect and/or rater effect is not ignored. Also, the new methods are expected to perform similarly to conventional methods if MNAR effect and/or rater effect are ignorable. Details are presented in the following simulation studies. In addition, the new and conventional methods were applied to empirical data to compare their difference in item estimates. Significant difference might imply that there exists MNAR effect or/and rater effect. In such situations, the new methods should yield more reliable estimates than those by the conventional methods. Details are presented in the empirical example section.
This study is organized as follows. The NESIRM and its relationship with the NESIM are introduced. Then, Rasch models for rater severity are outlined. Next that the missingness mechanism and the substantive latent trait in the ESI design could be eliminated in the CMLE and pairwise estimation methods when specific objectivity holds true is demonstrated. How to estimate the parameters of the NESIRM and implement the CMLE and pairwise estimation methods are described. Then the results of a series of simulations that were conducted to investigate the parameter recovery of the NESIRM, the effectiveness of the CMLE and pairwise estimation methods, and the consequences of ignoring choice effect and rater severity on parameter estimation are summarized, using conventional IRT models. An empirical example is provided online in appendix O to illustrate the implications and applications of the new initiatives. Finally, conclusions are drawn and suggestions for future studies are given.
The NESIRM for Choice Effect and Rater Severity
First of all, the authors introduced the necessary components of a general framework of missingness modeling for item response models. Generally speaking, an item response model has to be specified for the observed responses and a missingness model for missing data indicators for the NESIRM. Conventionally, the missing data indicator is a binary random variable used to indicate whether a response is missing (coded “0”) or observed (coded “1”). However, such coding is not appropriate to the ESI. Liu and Wang (2017a) indicated that the missing data indicators are statistically dependent due to the nature of the ESI design. Take the “choose one from two items” as an example. The resulting missing data indicators will be (1, 0) or (0, 1) for the two items. The other patterns such as (1, 1) or (0, 0) are not allowed in such design. Thus, the two missing data indicators are dependent of each other. A good choice of missing data indicators is to regard the missing patterns as nominal variables as shown in the following paragraphs.
Let Ycom denote complete data and consist of an observable part Yobs and a missing part Ymis. Yobs and Ymis are categorical variables in this study. Let Mb∈ (1, . . . , k, . . . , Wb) signify the index of the selection patterns within block b of items, and Wb represent the number of selection patterns in block b. b denotes the index of the block, and a block means a group of items that students have to select from. The random variable Mb takes on a set of possible different values. Take “choose two out of four items” as example. There can be six patterns, so Mb = 6. The realized value, mb, of Mb could be one of the six values (1, 2, 3, 4, 5, and 6). As a result, using Mb avoids the statistical dependence as mentioned previously.
Given that Ycom and Mb are both random variables, the joint probability of Ycom and M, Pr(Ycom, M) can be factorized as follows:
By marginalizing over the unobservable Ymis, the joint probability becomes,
where S∈ (0, 1, . . . , C) and C denotes the total number of rating points minus one. By employing the parameters of interest, Equation 2 becomes
where θ and γ are the target latent trait and some latent propensity (e.g., individual’s tendency, which can be related to θ), respectively, ξ is the collection of all item parameters, ζ denotes the collection of all structural parameters of the missingness model. The prior distributions of item parameters (ξ and ζ) are omitted due to the absence of prior information in this paper. Let ρ denote the linear correlation between θ and γ to account for the MNAR effect, and assume that θ and γ follow a bivariate normal distribution; then Equation 3 becomes
Furthermore, based on the local independence assumption, the Mb, Yobs, and Ymis are assumed stochastically independent given γ, the Ymis is marginalized and Equation 4 is simplified to,
which is the NESIM (Liu & Wang, 2017a). Moreover, Pr(Yobs| θ, ξ) is assumed to follow an IRT model such as the partial credit model (PCM; Masters, 1982), whereas
which means the missingness model can be ignored (i.e., missing data are ignorable) and this is an MCAR mechanism. The missing at random (MAR), Pr(Mb|γ, ζ, Yobs), is not considered in the NESIM due to the local independence assumption (i.e., Yobs is ignored given γ).
Notice that ρ does not convey the information about whether the choice effect in a specific block is related to θ (i.e., nonignorable). In this study, the authors relax the linear correlation assumption between θ and γ by introducing a block-specific parameter to detect the choice effect in each block. Thus, γ is decomposed as a linear combination of θ and a new random effect, ε (e.g., individual’s tendency which is not related to θ), in the multidimensional NRM (MNRM; see Equation 8) to account for the choice effect. Different from the NESIM, Equation 5 is changed as follows:
where Pr(θ) and Pr(ε) are the distributions of θ and ε, respectively, and assumed stochastically independent to each other because a relationship of linear addition for θ and ε is assumed. Based on our experience, if theta and epsilon are assumed dependent, in addition to the linear relationship, model identification problems will occur. Moreover, Pr(Mb| θ, ε, ζ) for person n and choice pattern k in block b follows the MNRM):
where ζ∈ (ω,λ, τ), ω bk is a slope parameter for θ n , λ bk represents a slope parameter for ε n , τ bk signifies an intercept parameter, and ε n accounts for the examinee’s comprehensive propensity and is assumed to be statistically independent of θ n . The variable ω bk is the key indicator of the choice effect for pattern k in block b for θ n to determine whether the choice effect is ignorable (i.e., whether H0: ω bk = 0 is true). This information could help test designers to organize the items in the blocks to reduce the choice effect for preliminary analysis or further test development.
To account for rater severity, Pr(Yobs| θ, ξ) is assumed to follow the facets model (Linacre, 1989):
where δ ic is the threshold c of item i, C denotes the total number of rating points minus one (c = 0, . . . , k, . . . , C), ynis∈ (0, 1, . . . , C), η s indicates the severity of rater s, and δi0≡ 0. Combining Equations 7 to 9 creates the NESIRM.
The NESIRM is a new model for ESIs, which subsumes the old NESIM as a special case in two aspects. First, the old NESIM assumes that all raters have the same level of severity (i.e., no rater effect), whereas the NESIRM recognizes that different raters may have varying degrees of severity. Second, the ρ parameter in the NESIM indicates a universal choice effect across blocks, whereas the ω parameter in the NESIRM describes the choice effect on each block. The ρ cannot inform which block of ESIs have none, weak, or strong choice effect. In the preliminary study, the ω could help practitioners to rearrange the items in the blocks to reduce the choice effect for preliminary or further study. The NESIRM is a generalized MNAR model, which can be simplified to the NESIM when ω
bk
= λ
bk
ρ and
Parameter Estimation for the NESIRM
The NESIRM is basically a two-dimensional IRT model because it includes two latent variables (θ and ε). For parameter estimation, a researcher can use the marginal maximum likelihood with expectation-maximization (MML-EM) algorithm to integrate the θ and γ distributions in the likelihood function. Given a specific selection pattern vector
where N is the number of examinees. The random variable
In addition to the MML-EM, a researcher can adopt the Markov chain Monte Carlo (MCMC) method, which is available for various IRT models and has been implemented in freeware, similar to the Just Another Gibbs Sampler (JAGS; Plummer, 2003). In this study, the MCMC method was adopted via JAGS because it is easy to set up model constraints. Specifically, NESIRM is identified by constraining ωb1 = λb1 = τb1 = 0 for the first category (alternatively,
Apart from model constraints, the ESI design must meet the following requirements to establish a common scale (Liu & Wang, 2017a). First, at least two blocks of ESIs are needed, and there are some overlaps among examinees between blocks. Second, if there is only one block of ESIs, at least two items must be chosen (e.g., choose two out of the three items). Third, if there is only one block of two ESIs, at least one compulsory item (all examinees must answer) should be included. Fourth, if there is only one block of two ESIs but no compulsory item, at least some examinees must answer both ESIs. These requirements are not too harsh to meet in practice because multiple blocks or compulsory items are usually included in the ESI design. In addition to these requirements, the rating design should be implemented well to ensure linkage among raters for parameter estimation (Linacre, 1989).
Conditional Estimation of Rasch Models for Choice Effect and Rater Severity
In this section, the aim is not to develop a new MNAR model for choice effect and add a rater severity to the NESIRM. Instead, it is shown that, by specifying an IRT model from the Rasch family for the item responses and appropriate estimation methods, one does not need to explicitly specify the missingness model. The idea is that a researcher may find an estimator that does not involve θ and the missingness mechanism so that the item parameter estimation is independent of θ and the missingness mechanism.
Liu and Wang (2017b) showed that CMLE for the Rasch models leveraged the property of sufficient statistics for θ, so the item parameter estimation was independent of θ and the missingness mechanism in the ESI design (Fischer, 1973; Mair & Hatzinger, 2007). In this study, CMLE is adapted to deal with both choice effect and rater severity. The details of derivation can be found online in Appendix B. Specifically, the likelihood function of the item parameters and the rater effect parameters for examinee n, given rater s, is shown in Equation 1 of Appendix B. The summation of the likelihood of all possible response patterns is also shown in Equation 2 of Appendix B. By dividing Equation 1 by Equation 2, the likelihood function, which does not involve θ and missingness model, is obtained. Thus, the choice effect can be ignored and rater effect can be estimated along with item parameters.
Pairwise Estimation of Rasch Models for Choice Effect and Rater Severity
Three variants of pairwise estimation algorithms for rater data were introduced for the Rasch models (Garner & Engelhard, 2009). However, they neither considered the missing data that were induced by an incomplete rating design nor attempted to handle ESI data. In this study, pairwise estimation algorithms are adapted to take into account both choice and rater effect in the ESI design.
In the case of a large number of items or raters, CMLE may become inefficient because of the costly, recursive computation of the elementary symmetric function (Andersen, 1970). Rasch proposed pooling all item pairs to obtain a pairwise noniterative (PWN) estimation for the item parameters (Choppin, 1968, 1985). The main purpose of the pairwise estimation is to eliminate θ by calculating the odds ratio of paired items. Choppin elaborated on the PWN and the pairwise iterative (PWI) approaches based on pairwise likelihood (Zwinderman, 1995). The pairwise eigenvector (PWE) approach proposed by Garner and Engelhard (2009) could produce item parameter estimates that were nearly identical to those from the PWN and the PWI approaches. In this study, all three approaches were adapted to the ESI design and investigated their recovery of the item and the rater parameters. The details of derivation can be found in online Appendix C. Specifically, the idea of pairwise estimations is to use distinct paired response patterns such as (yi = 0, yj = 1) and (yi = 1, yj = 0) on items i and j. The paired response patterns are tabulated in a paired comparison matrix
In summary, the CMLE and the pairwise estimation methods are not affected by the choice effect and the rater severity in the ESI context, given that the Rasch models could fit the data. On the contrary, the NESIRM is not restricted to the Rasch models and could accommodate other IRT models, such as the generalized facets model (W. C. Wang & Liu, 2007); however, specification of the MNAR mechanism is required (e.g., the flexible MNRM).
Comparison Between NESIRM and Conditional/Pairwise Estimation
The major difference between the NESIRM and the conditional/pairwise estimations is that the former must specify a missingness model for missing data patterns, but it is free to specify the IRT model for the item responses. In contrast, the conditional/pairwise estimations must specify one of the Rasch models (e.g., PCM) for the item responses, but it does not need to explicitly specify the missingness model because the missingness model can be eliminated during conditional and pairwise estimations.
Both methods are summarized in Table D of online Appendix D. For example, the CMLE, PWN, PWI, and PWE are appropriate when the (facet) Rasch models are used and the missingness model does not have to be specified. The NESIRM can include any IRT model when rater effect is involved, whereas the NESIM can also include any IRT model but it ignores the rater effect. The NESIRM/NESIM can accommodate various IRT models for item responses such as the PCM, NRM (Bock, 1972), and so on. The choice of the IRT model for item responses depends on the research interest or model fit. Although the NESIRM/NESIM must specify the missingness model, Liu and Wang (2017a) found the NRM flexible to nominal missing data indicators and robust to unknown missingness models based on simulation studies. On the contrary, conditional/pairwise estimations (not new IRT models) are somewhat restricted in practice because the Rasch models must be able to fit the item responses reasonably, although their significant advantage is that the missingness model does not need to be specified.
In summary, it is suggested that practitioners use conditional/pairwise estimations first to check whether the Rasch models can fit the item responses well. The choice effect and rater effect have been tackled in the conditional/pairwise estimations, thus one needs to check the model fit to data alone. If the Rasch models failed to characterize the data, one can resort to the NESIRM, where the IRT model for the item responses must be specified by user as well as the missingness model. The researchers’ responsibility is to find an IRT model that could fit the data reasonably. For missingness model, the (M)NRM is recommended due its flexibility and robustness (Liu & Wang, 2017a).
Simulations
The motivation of the simulations was to demonstrate the detriments of ignoring the MNAR effect and/or rater effect on item parameter estimators and to show that the NESIRM and the conditional/pairwise methods could perform well in recovering “true” parameters no matter whether there exists choice effect and/or rater effect. A series of simulations is conducted to compare the CMLE, pairwise estimation, NESIRM, NESIM, and PCM in terms of the recovery of the item and the rater parameters in the ESI context.
Design and Analysis
In the ESI design, the examinees were required to choose and answer one item from a pair of items. There were four blocks (pairs) of three-point items and four raters. The thresholds δ i for item i were generated with increasing difficulty. Specifically, δi1 was generated from a uniform distribution ranging from −1.5 to 0, whereas δi2 was generated from a uniform distribution ranging from 0 to 1.5. The average of δ across items was rescaled to zero as a model constraint. In total, 500 examinees were sampled from the standard normal distribution. Such a sample size was found sufficient to demonstrate the impact of ignoring MNAR data and the rater effect (to be shown in the “Results” section) although in practice, the sample size used in the ESI design is usually far larger than 500.
Three missingness mechanisms were considered: (a) random selection (RS), (b) linear selection (LS), and (c) nonlinear selection (NS). In the RS condition, examinees chose items randomly (each item in a pair had a 0.5 probability of being chosen). The RS served as the baseline for performance comparison. In the LS condition, the more proficient the examinee is, the higher the probability of choosing the first item in a pair, regardless of its difficulty. In total, 500 probabilities were randomly drawn from a uniform distribution ranging from 0 to 1 and sorted from low to high. Likewise, the ability levels of the 500 examinees were sorted from low to high. Then, the sorted probabilities were assigned to the sorted examinees, so that the higher the ability, the higher the probability of selecting the first item, which could be either easier or more difficult in each pair. In the NS condition, the responses were generated according to the NESIRM, in which ωb1 was drawn from a uniform distribution between −2 and 2, λb1 was set at 1 for simplicity (it was less interesting than other parameters), and τb1 was set at 0, so the relationship between item selection M and θ was nonlinear.
The severity levels of the four raters were set as η = (0, –1, 0, 1) for simplicity, and the sum of η was set at 0 as a model constraint. Also η = (0, 0, 0, 0) is set to indicate no rater effect. An incomplete and spiral-like rating designs were adopted (DeCarlo, 2010; Engelhard, 1997). In the incomplete rating design, each examinee was judged by two raters, and each rater judged a subset of examinees on all eight items, with overlaps between raters. Specifically, the first two raters (severity = 0, –1) judged examinees 1 to 275, whereas the last two raters (severity = 0, 1) judged examinees 226 to 500, and the overlaps between the first and the last two raters comprised 50 examinees. In the spiral-like rating design, each rater judged four of the eight items, as shown in Table E of online Appendix E. Specifically, Item 1 was marked by Raters 1 and 2, Item 2 by Raters 2 and 3, and so on until Item 8 by Raters 1 and 4. The authors did not consider a complete rating design in this study because it is seldom adopted in large-scale testing.
There were altogether 12 conditions, including two rating designs × two rater effects × three missingness mechanisms. In total, 70 replications were conducted under each condition, which appeared sufficient based on our preliminary studies. The item and the rater parameters were fixed across replications, whereas the person parameters were randomly drawn in each replication. For the assessment of parameter recovery, the bias and the root mean square error (RMSE) of estimator
The CMLE method implemented in the
It was anticipated that all methods would perform satisfactorily when there was no choice effect and no rater effect. The CMLE, the pairwise estimation, and the NESIRM would yield practically unbiased estimates in all conditions. The NESIM would produce biased estimates when the rater severity existed but was ignored. The PCM would suffer more serious bias when both choice effect and rater severity existed but were ignored.
Results
This section summarizes the bias and the RMSE for the parameter estimates yielded by the seven methods (CMLE, PWN, PWI, PWE, NESIRM, NESIM, and PCM) under the incomplete and spiral-like rating designs.
Incomplete Rating Design
The bias and the RMSE of

Bias in item estimators
Spiral-Like Rating Design
The bias and the RMSE of
Appendices J and K online show the bias and the RMSE for
To summarize, the CMLE, pairwise estimation, and NESIRM methods are very robust against different missingness mechanisms (i.e., RS, LS, and NS). On the contrary, the PCM and the NESIM methods always yield biased estimates when the choice effect and/or the rater effect are/is present. The CMLE and the pairwise methods are useful for the ESI design with CR items when the Rasch models can reasonably fit the data. If the Rasch models do not fit the data, the NESIRM should be adopted, in which the IRT model for the observed data can be more general than the Rasch models, such as the generalized facets model. Generally, using the MNRM to account for missingness mechanism is sufficiently robust.
Other Simulation Conditions
The above simulation studies did not address the following conditions: (a) random assignment of examinees to raters, (b) small sample size, and (c) the recovery of person parameters. Thus, three follow-up simulations are added. The goal is to examine whether the bias patterns would be affected under the three conditions, compared with previous simulation results. Details of designs and results were described in online Appendices L, M, and N.
In summary, the bias patterns were similar to previous simulation results irrespective of random assignment of examinees to raters or small sample size, whereas the person parameters can be well recovered by maximum likelihood estimation.
Conclusion and Discussion
Currently, the ESI design is rarely used in Western countries because the missing data are usually MNAR, which invalidates the use of common IRT models (X. B. Wang, Wainer, & Thissen, 1995). This conundrum inevitably overrides the practical advantages of the ESI design. Although some Asian countries still adopt the ESI design in high-stake and large-scale tests, number correct scores are usually used for score reports, completely ignoring both choice and rater effects. Despite the development of a few IRT models and approaches to tackle the choice effect in the ESI design (Liu & Wang, 2017a, 2017b), the rater effect is not considered in the literature. Because ESIs are usually in the CR format and thus graded by human raters, approaches that do not consider the rater effect become inapplicable.
In this study, the authors developed the NESIRM and adapted the CMLE and three pairwise estimation methods to account for both choice and rater effects in the ESI design. With this approach, the person parameters become comparable among examinees who choose different ESIs and whose answers are graded by different raters. Simulation studies confirm the advantages. The CMLE and the three pairwise methods require a good fit of the Rasch models, whereas the NESIRM is more flexible to accommodate other IRT models for observed data. Conventional approaches, such as the PCM and the NESIM, fail to consider the choice effect and/or the rater effect and yield biased estimates for the item parameters. With these approaches, the person parameters are not comparable among examinees who choose different ESIs and/or whose answers are graded by different raters.
The empirical example in online Appendix O illustrates a way to adopt the NESIRM, CMLE, and pairwise estimation methods to analyze ESI data with the rater effect. Taking those parameter estimates obtained from the NESIRM-C as representing the gold standard, the authors find that those obtained from the NESIRM, CMLE, and pairwise estimation methods are almost identical to the gold standard, justifying their applicability to the ESI design with the rater effect. In contrast, ignoring the choice effect and/or the rater effect by adopting the NESIM or the Rasch model results in misleading parameter estimates.
In practice, the missingness mechanisms in the ESI design may be more complex than those manipulated in this and past studies. Future studies should investigate how the NESIRM, CMLE, and pairwise estimation methods will perform under various choice effects. There are rater effects other than severity, such as inconsistency, centrality/extremity, halo, and rater dependency. The hierarchical rater model with signal detection theory rater components for the ESI design is an another interesting route to explore (Patterson, 2013). Several IRT models have tackled these rater effects (DeCarlo, Kim, & Johnson, 2011; Patz, Junker, Johnson, & Mariano, 2002; W. C. Wang, Su, & Qiu, 2014; W. C. Wang & Wilson, 2005). It is important for future studies to incorporate these models into the NESIRM and evaluate its performance.
Supplemental Material
Appendix12_SUPPLEMENTAL_MATERIAL_PDF – Supplemental material for Item Response Theory Modeling for Examinee-selected Items with Rater Effect
Supplemental material, Appendix12_SUPPLEMENTAL_MATERIAL_PDF for Item Response Theory Modeling for Examinee-selected Items with Rater Effect by Chen-Wei Liu, Xue-Lan Qiu and Wen-Chung Wang in Applied Psychological Measurement
Footnotes
Acknowledgements
The first author would like to thank Prof. Wen-Chung Wang, and Dr. Xue-Lan Qiu for their valuable revisions and data collection for this article. This work is the first author’s last collaboration with Prof. Wang. Goodbye, we will miss you
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Research Grants Council of the Hong Kong SAR under GRF Project No. 18613716.
Supplemental Material
Supplemental material is available for this article online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
