Estimating Probabilities of Passing for Examinees With Incomplete Data in Mastery Tests

Abstract

Administrative problems such as computer malfunction and power outage occasionally lead to missing item scores and hence to incomplete data on mastery tests such as the AP and U.S. Medical Licensing examinations. Investigators are often interested in estimating the probabilities of passing of the examinees with incomplete data on mastery tests. However, there is a lack of research on this estimation problem. The goal of this article is to suggest two new approaches—one each based on classical test theory and item response theory—for estimating the probabilities of passing of the examinees with incomplete data on mastery tests. The two approaches are demonstrated to have high accuracy and negligible misclassification rates.

Keywords

item response theory (IRT) models logistic regression regression imputation

Introduction

Measurement professionals are familiar with the phenomenon of missing item scores, or, incomplete tests, on educational assessments. A wide variety of problems related to missing item scores in educational or psychological measurement were tackled by researchers such as De Ayala et al. (2001), Finch (2008), Feinberg (2020), Sinharay (2021), Smits et al. (2002), and Xiao and Bulut (2020).

This article focuses on tests such as the AP (e.g., Patterson & Ewing, 2013), Praxis (Educational Testing Service, 2020), and U.S. Medical Licensing Examination or USMLE (2020) that are used to decide whether the test-takers have attained a specific level of knowledge or mastery of a given subject and to report pass–fail statuses instead of or in addition to scaled scores. For convenience, tests that report pass–fail statuses will henceforth be referred to as mastery tests. These tests primarily include licensure and certification tests. Occasionally, administrative problems such as computer malfunction, natural calamities, noise or disruption in the testing area (e.g., American Educational Research Association, American Psychological Association, & National Council for Measurement in Education, 2014), and power outage lead to missing item scores in mastery testing (e.g., Feinberg, 2020; USMLE, 2020). Missing item scores for an examinee lead to an incomplete test for the examinee.

While several options such as retests are available to the test administrators for handling incomplete mastery tests, one option is to report estimated pass–fail statuses to the examinees with incomplete tests after estimating their probabilities of passing. There is a lack of research on this topic with the exception of Feinberg (2020), who suggested four approaches that can be used to estimate the passing probabilities of examinees based on their performance on the part of the test that they were able to complete. However, Feinberg (2020) considered the case where the test includes only dichotomous items and all the missing item scores are located at the end of the test. Further, the misclassification rates of the best performing approaches of Feinberg (2020) were often large, especially for the examinees whose actual scores were close to the passing score, and only one data set was analyzed by Feinberg (2020).

The objective of this article is to extend the research of Feinberg (2020) on estimating passing probabilities on incomplete mastery tests in several ways. First, this article considers mastery tests that include both dichotomous and polytomous items—these tests are often referred to as mixed-format tests (e.g., Kolen & Lee, 2011)—AP (e.g., Patterson & Ewing, 2013) tests are examples of such tests. Second, this article considers the case when the missing item scores do not necessarily occur at the end of the test; for example, for a test such as the Praxis®Spanish World Language test (ets.org/s/praxis/pdf/5195.pdf) that includes speaking items, poor audio quality, or excessive background noise during recording may render some responses to the speaking items, which do not necessarily appear at the end of the test, unscorable, leading to missing scores on these items. Third, while the three approaches based on item response theory (IRT) of Feinberg (2020) were based on the Rasch model, this article considers more general IRT models including the three-parameter logistic model (3PLM; Birnbaum, 1968) and the generalized partial credit model (GPCM; Muraki, 1992). Fourth, Feinberg (2020) reported results from only one data set—this article reports results from several simulated data sets and one real data set. Finally, two new approaches—one based on classical test theory (CTT) and one on IRT—for estimation of the passing probability on incomplete mastery tests are suggested in an attempt to overcome the problem of large misclassification rates that some approaches of Feinberg (2020) suffered from; two comparison studies—one based on simulated data and one based on real data—are employed to demonstrate the superior performance of the two new approaches compared with the approaches of Feinberg (2020).

The next section includes a review of the literature on the analysis of missing data for educational and psychological tests and of the four approaches of Feinberg (2020). The “Method: . . .” section includes a description of two new approaches for estimating the passing probabilities for the examinees with incomplete data on mastery tests. The “Data Analysis” section includes a comparison of the two new approaches and three approaches of Feinberg (2020) using simulated and real data. The last section includes discussions and conclusions.

Omitted and not-reached responses, which also lead to missing item scores, are quite common in educational assessments. But these responses occur due to various types of examinee behavior and not due to administrative problems—so they are not considered in this article. Instead, the assumption is made that the omitted and not-reached responses are treated using the operational approaches of the tests concerned. Researchers such as De Ayala et al. (2001) and Glas and Pimentel (2008) explored various approaches for handling omitted and not-reached responses.

Analysis of Missing Data in Educational Measurement

Researchers such as Enders (2010), Graham (2009, 2012), Schafer (1997), Schafer and Graham (2002), Sinharay et al. (2001), and Vriens and Sinharay (2006) provided extensive reviews of the literature on missing data analysis in general. While Holman and Glas (2005), Glas and Pimentel (2008), Sulis and Porcu (2017), and Rose et al. (2017) suggested various advanced IRT models for modeling missing data in educational and psychological measurement, measurement researchers and practitioners considered several additional problems related to missing data. Some of the research is summarized below, categorized by the problem of interest.

Various Types of Missing Data Analysis in Educational Measurement

Estimation of Model Parameters or Summary Statistics

Finch (2008) and Sulis and Porcu (2017) compared several missing-data imputation approaches with respect to their accuracy in estimating parameters of dichotomous and polytomous IRT models, respectively, in the presence of missing item scores. Sijtsma and van der Ark (2003) compared several missing-data imputation approaches with respect to their accuracy in the presence of missing data in estimating the coefficient $α$ , Mokken’s scalability coefficient $H$ , and two goodness-of-fit statistics for the Rasch model.

Imputation of Raw/Scaled Score and Grade Point Average

Huisman and Molenaar (2001) compared several missing-data imputation approaches for imputation of the total/raw score in psychology tests in the presence of missing item scores. Sinharay (2021) compared several missing-data imputation approaches for imputation of scaled scores in the presence of missing item scores in educational tests. Smits et al. (2002) compared several missing-data imputation approaches with respect to their accuracy in imputation of grade point averages in the presence of missing grades on several courses.

Ability Estimation in the Presence of Missing Data

Xiao and Bulut (2020) compared several missing-data imputation approaches with respect to their accuracy in estimating the ability parameters of dichotomous IRT models in the presence of missing item scores. Cetin-Berber et al. (2019) compared several missing-data imputation approaches with respect to their accuracy in estimating the examinee ability in computerized adaptive multistage testing in the presence of missing item scores.

Various Treatments of Omitted Responses and Their Impact

Shin (2009) examined how omitted responses should be handled in IRT equating. Kohler et al. (2017) investigated how different treatments of item nonresponse affect relevant outcome measures such as estimates of coefficients of a linear regression of examinee ability on explanatory variables such as gender and socioeconomic status, on educational assessments. De Ayala et al. (2001) examined how omitted responses affect common IRT ability parameter estimates such as the maximum likelihood estimate and the posterior mean of ability for dichotomous IRT models. They also examined whether the omitted responses should be treated as incorrect responses. Pohl et al. (2014) attempted to find the best approaches for item and person parameter estimation in the presence of omitted responses on competency tests.

Estimation of the Passing Probability on an Incomplete Mastery Test

Feinberg (2020) suggested four approaches for estimating the probability of passing of an examinee on an incomplete mastery test. More details of the four approaches are provided shortly.

Brief Summary of Missing Data Analysis in Educational Measurement

The above discussion indicates that (a) missing data could lead to various types of problems for educational tests, (b) researchers have suggested various approaches for tacking these problems, and (c) there is a lack of research on estimating the passing probability of an examinee on an incomplete mastery test, with the exception of Feinberg (2020).

The Four Approaches of Feinberg (2020)

Feinberg (2020) considered the case where, due to administrative problems, the scores on items $S + 1$ to $I$ are missing for an examinee on a test that includes $I$ dichotomously scored items. Thus, scores on items 1 to $S$ are available for the examinee. Consider that an IRT model has been fitted to data from the test and that the raw score, proportion correct score, IRT ability estimate, and the corresponding standard deviation (SD) based on Items 1 to $S$ for the examinee are denoted by $R_{S}$ , $p_{S}$ , ${\hat{θ}}_{S}$ , and ${\hat{σ}}_{S}$ , respectively. Suppose that $S_{S}$ is the scaled score that is equivalent to the examinee ability of ${\hat{θ}}_{S}$ . Suppose that the passing/cut score on the raw-score scale is $C$ so that the examinee needs to score at least $C - R_{S}$ on items $S + 1$ to $I$ to pass the test. Suppose that the passing score on the examinee ability ( $θ$ ) scale is $θ_{C}$ . The four approaches that Feinberg (2020) suggested for estimating the passing probability for the examinee are discussed below:

• Model-based standard error (MSE) approach: This approach involves the assumption that the posterior distribution of the examinee ability $θ$ can be accurately approximated by a $N (θ | {\hat{θ}}_{S}, {\hat{σ}}_{S}^{2})$ distribution, that is, by the normal distribution with mean ${\hat{θ}}_{S}$ and variance ${\hat{σ}}_{S}^{2}$ , followed by the estimation of the passing probability of the examinee by the probability of a value larger than $θ_{C}$ in the normal distribution, or, as

P P_{MSE} = P (θ \geq θ_{C}) = 1 - P (θ < θ_{C}) = 1 - P (\frac{θ - {\hat{θ}}_{S}}{{\hat{σ}}_{S}} < \frac{θ_{C} - {\hat{θ}}_{S}}{{\hat{σ}}_{S}}) = 1 - Φ (\frac{θ_{C} - {\hat{θ}}_{S}}{{\hat{σ}}_{S}}),

where $Φ$ denotes the cumulative density function of the standard normal distribution.

• Bayesian approach: This approach employs a reference data set collected from past administrations of the test and involves the estimation of the probability of passing as

P P_{Bayesian} = \frac{P (S_{S} | Pass) P_{Pass}}{P_{S_{S}}},

where $P_{S_{S}}$ is the (prior) proportion of examinees in the reference data set whose estimated scaled score based on items 1 to $S$ is equal to $S_{s}$ , $P_{Pass}$ is the proportion of examinees in the reference data set who passed the test, and $P (S_{S} | Pass)$ is the proportion of examinees in the reference data set who passed the test and obtained a scale score estimate of $S_{s}$ based on items 1 to $S$ .

• Binomial distribution (BD) approach: This approach involves (a) the assumption that the raw score of the examinee on items $S$ +1 to $I$ follows the binomial distribution with $I - S$ trials and success probability $p_{S}$ and (b) computation of the estimated passing probability of the examinee as the probability of a value of at least $C - R_{S}$ in the binomial distribution, or, as

P P_{Binomial} = \sum_{t = C - R_{S}}^{I - S} (\begin{matrix} I - S \\ t \end{matrix}) p_{S}^{t} (1 - p_{S})^{I - S - t} .

(1)

• Lord–Wingersky (LW) or Recursion approach: This approach involves the computation of $p_{LW} (t | {\hat{θ}}_{s})$ , the estimated probability of obtaining a total raw score of $t$ on items $S$ +1 to $I$ given ability ${\hat{θ}}_{S}$ , using the LW recursion algorithm (Lord & Wingersky, 1984) and the IRT item parameter estimates. Then one computes the estimated passing probability of the examinee as

P P_{LW} = \sum_{t = C - R_{S}}^{I - S} p_{LW} (t | {\hat{θ}}_{S}) .

(2)

Three of the aforementioned four approaches are based on an IRT model—Feinberg (2020) used the Rasch model because the model is used for the USMLE whose data were analyzed by him. After estimating the passing probability for an examinee with an incomplete test, one assigns a “pass” or “fail” status to the examinee if the estimated probability is larger than or smaller than cutoffs of $1 - α / 2$ and $α / 2$ , respectively, for a prespecified $α$ , and an “indeterminate” status if the estimated probability falls between the cutoffs. In this article, $α$ was set equal to .05, which is one of the two values considered by Feinberg (2020); this choice leads to assigning an examinee a “pass” or “fail” status depending on whether the estimated passing probability for the examinee is larger or smaller than .975 and .025, respectively, and an “indeterminate” status if the estimated probability is between .025 and .975. Use of the other value of $α$ (.0000001) considered in Feinberg (2020) did not affect the conclusions of this article—so results for that value are not reported and can be obtained from the authors.

Feinberg (2020) used a data set from the USMLE in 2015 to compare the accuracy, indeterminate, and misclassification rates of the four approaches that he suggested. The comparisons of Feinberg (2020) revealed that

• The MSE approach performed the worst among his four approaches; the power of the approach was too low on average.

• The Bayesian approach may be difficult to implement in practice due to the problem of finding appropriate prior information that is required by the approach and was less powerful than the LW and BD approaches.

• Among the LW and BD approaches, the BD approach performed as well as the LW approach in several cases, but somewhat worse in a case when the items were ordered by difficulty (see his Figure 5).

Therefore, Feinberg (2020) concluded that the LW approach is the most robust option for estimating passing probabilities on incomplete mastery tests. However, Feinberg (2020) pointed to the large misclassification rates of the LW approach (and those of two other approaches that he considered), especially for borderline pass–fail examinees and for a large number of missing item scores. For example, in his Figure 3, the misclassification rate of the LW approach was about 10% for the borderline pass–fail examinees who answered only the first 100 items on a 252-item test. While Feinberg (2020) commented that a high-stakes testing program would likely want to create a policy that results in close to, if not exactly, an expected 0% misclassification rate, the misclassification rate of his LW approach was often considerably larger than 0%—so the ideal imputation approach remained elusive in his study.¹ In addition, Feinberg (2020) analyzed only one data set and considered the case where the test includes only dichotomous items and all the missing item scores occur at the end of the test. Finally, his IRT-based approaches were based on the Rasch model. So, there is a considerable scope of further extensions of the research of Feinberg (2020) on estimating passing probabilities on incomplete mastery tests—this article reports results from such extensions.

Methods: Two New Approaches for Estimating Passing Probabilities on Incomplete Mastery Tests

In this section, two new approaches—one based on CTT and another based on IRT—are suggested for estimating probabilities of passing on incomplete mastery tests—the approaches apply to mixed-format tests and can handle missing item scores that are not necessarily at the end of the test.

Some Notation

Let us consider a mixed-format test that includes several dichotomously scored items and several polytomously scored items with potentially different numbers of score categories. An example of such a test is the 71-item AP Spanish test in 2011, which was considered by Y. Kang and Lee (2011)—the test included 65 multiple-choice (MC) and dichotomously scored items, three polytomous items with maximum scores of 5 each, and three more polytomous items with maximum scores of 9 each. Let us denote the sets of items on which scores are available and missing for an examinee as $A$ and $M$ , respectively. For example, for an examinee who took the above-mentioned AP test, $A$ may comprise 50% of all MC items and one polytomous item and $M$ may comprise the remaining 50% of the MC items and the remaining five polytomous items. In practice, $A$ and $M$ may differ over the examinees, but a subscript for examinees was not used for convenience. Note that the set $M$ does not necessarily appear at the end of the test. Let us denote the raw score, IRT ability estimate, and the corresponding SD based on the scores on the items in $A$ for the examinee as $R_{A}$ , ${\hat{θ}}_{A}$ , and ${\hat{σ}}_{A}$ , respectively. Suppose $C$ is the cut score on the raw score scale for the test. Thus, the examinee needs to score at least $C - R_{A}$ on the items in $M$ to pass the test. Let $p_{A}$ denote the proportional score on $A$ , that is,

p_{A} = \frac{Total observed raw score based on all the items in A}{Maximum possible raw score over all the items in A} \cdot

(3)

A New Approach Based on IRT

The LW approach of Feinberg (2020) was modified in multiple ways to estimate the passing probability on incomplete and mixed-format mastery tests. First, while the LW approach of Feinberg (2020) utilizes the LW algorithm (Lord & Wingersky, 1984) that can be used to compute the probability $p_{LW} (t | θ)$ of a raw score $t$ on the missing portion of a test with only dichotomous items, this article used the extension of the LW algorithm to mixed-format tests that was suggested by Hanson (1994) and Thissen et al. (1995) to obtain the probability of a raw score $t$ on the missing portion ( $M$ ) of a mixed-format test for a given value of examinee ability. The extension of Hanson (1994) and Thissen et al. (1995) is briefly described in Appendix A. The R package irtplay (Lim, 2020) was used to compute these probabilities that are denoted as $p_{MLW} (t | θ)$ , where “MLW” denotes “modified Lord–Wingersky.” Second, while the LW approach of Feinberg (2020) was predicated on the Rasch model, it is possible to use less restricted IRT models instead of the Rasch model. A combination of the 3PLM and the GPCM was used in this article for the dichotomously and polytomously scored items, respectively. Third, the LW approach of Feinberg (2020) uses $p_{LW} (t | {\hat{θ}}_{A})$ to estimate the passing probability, but does not take into account the variability of ${\hat{θ}}_{A}$ in computing $p_{LW} (t | {\hat{θ}}_{A})$ . The variability of ${\hat{θ}}_{A}$ may be large, especially for small $A$ s (that is, for too many missing item scores), and may cause $p_{LW} (t | {\hat{θ}}_{A})$ to be imprecise—this imprecision may have led to the large misclassification rate of the LW approach in Feinberg (2020). In this article, the variability in ${\hat{θ}}_{A}$ was taken into account by estimating the probability of a raw score $t$ on the missing portion ( $M$ ) of a mixed-format test as

p_{MLW} (t) = \int p_{MLW} (t | θ) N (θ | {\hat{θ}}_{A}, {\hat{σ}}_{A}^{2}) d θ,

(4)

where $p_{MLW} (t | θ)$ is computed using the recursive approach of Hanson (1994) and Thissen et al. (1995). Thus, the MLW approach takes into account the variability inherent in the ability estimate by integrating $p_{MLE} (t | θ)$ with respect to the examinee’s estimated posterior distribution. Because the $N (θ | {\hat{θ}}_{A}, {\hat{σ}}_{A}^{2})$ distribution is supposed to provide a close approximation of the examinee posterior distribution (e.g., Chang & Stout, 1993), the estimates ${\hat{θ}}_{A}$ and ${\hat{σ}}_{A}^{2}$ were set equal to the posterior mode and posterior variance based on $A$ , respectively, in this article.² The integration in Equation (4) was approximated using Gauss–Hermite integration (e.g., Naylor & Smith, 1982). The standard normal distribution was used as the prior distribution on the examinee ability in the computation of the posterior mean and variance of the examinees. The passing probability for the examinee under the MLW approach is then estimated as

P P_{MLW} = \sum_{t = C - R_{A}}^{M_{\max}} p_{MLW} (t),

where $M_{\max}$ is the maximum possible raw score on $M$ . A few lines of R code (R Core Team, 2020) for applying the MLW approach to estimate the passing probability for an examinee on an incomplete and mixed-format mastery test is provided in Appendix B—the code utilized the R package mirt (Chalmers, 2012) to fit an IRT model to the data and compute the posterior modes and standard deviations of the examinees, R package irtplay (Lim, 2020) to implement the recursive approach of Hanson (1994) and Thissen et al. (1995), and the R function “gauss.quad” in the R package statmod (Giner & Smyth, 2016) to implement the Gauss–Hermite integration.

A New Approach Based on CTT

The approach of Hanson (1994) and Thissen et al. (1995) is an IRT-based way to compute $P (T = t | θ)$ , the probability of a raw score $t$ for an examinee on a mixed-format test for a given examinee ability $θ$ . Similarly, Lee (2007) and Lee et al. (2006) suggested a CTT-based approach to compute $P (T = t | π_{1}, π_{2}, \dots, π_{L})$ , the probability of a raw score $t$ for an examinee on a mixed-format test for a set of given CTT-based true scores $π_{1}, π_{2}, \dots, π_{L}$ .³ An approach that involves averaging out of the uncertainty involved in the estimated CTT-based true scores, somewhat in the same manner in which the uncertainty in the IRT-based ability estimate was integrated out in Equation (4), can be used to estimate the probability of the raw score $t$ on $M$ for an examinee as

P_{CTT} (T = t) = \int_{π_{1}, π_{2}, \dots, π_{L}} P (T = t | π_{1}, π_{2}, \dots, π_{L}) p (π_{1}, π_{2}, \dots, π_{L}) d π_{1} d π_{2} \dots d π_{L},

(5)

for an appropriate choice of $p (π_{1}, π_{2}, \dots, π_{L})$ . The passing probability under this approach is then estimated as

P P_{CTT} = \sum_{t = C - R_{A}}^{M_{\max}} P_{CTT} (T = t),

(6)

where $P_{CTT} (T = t)$ is computed based on all the items in $M$ . A more detailed description of the approach is provided in Appendix C. If a test includes only dichotomous items, then this approach can be considered as an extension of the BD approach.

Data Analysis

The performances of the two new approaches were compared with that of three approaches of Feinberg (2020) in two comparison studies: one using data simulated to look like the data set of Feinberg (2020) and another using a real data set from a mixed-format mastery test.

Comparison Using Simulated Data

While Feinberg (2020) used a data set from the 2015 USMLE in his comparisons, data simulated so as to be similar to those were used in this article to compare the approaches for estimating passing probabilities. Several data sets comprising scores of 42,413 examinees to 252 dichotomously scored items (the same numbers as in Feinberg, 2020) were simulated from the 3PLM. The true examinee abilities were simulated from a standard normal distribution. The passing rate on the USMLE was 85% in 2015 (USMLE, 2020). Therefore, the passing score on the examinee ability scale ( $θ_{C}$ ) was assumed to be the 15th percentile of the standard normal distribution, which is equal to −1.04. The true pass–fail status of each simulated examinee in each data set was “pass” or “fail” depending on whether the examinee’s true ability was larger or smaller than this passing score. To ensure that the item difficulties on the test match the passing score on average, the true item-difficulty parameters were simulated from a normal distribution with mean −1 and variance 1.⁴ The true item slope and guessing parameters were simulated from uniform distributions with ranges (0.5, 1.5) and (0.0, 0.3), respectively. One set of 252 true item parameters was drawn once and used to simulate all the data sets. The passing raw score ( $C$ ) was assumed to be the integer that is immediately larger than the expected value of the raw score for ability $θ_{C}$ on a 252-item test whose item parameters are identical to the (aforementioned) true item parameters. The value of $C$ turned out to be equal to 124 for the simulations here.

A total of 100 replications of the following steps were performed for $S$ = 10, 11, …, 241, 242, where $S$ represents the number of items out of 252 that the examinees completed before administrative problems: (a) simulate a data set from the 3PLM, (b) randomly select 5% of the examinees in the data set and mark their scores as missing on the last 252- $S$ items (the choice of 5% examinees in this step was inspired by the fact that about 5% examinees typically have some missing scores for the real data set considered in the next section), (c) fit the 3PLM and the CTT model of Lee (2007) and Lee et al. (2006) to the subset of the data comprising the 95% examinees with no missing scores, (d) estimate the pass–fail statuses of the 5% examinees using the models fitted in the previous step and the five approaches and compare them with the corresponding true pass–fail statuses.⁵ Because no information from prior years was available, the Bayesian approach of Feinberg (2020) was not considered in the comparison. As in Feinberg (2020), the estimated and true pass–fail statuses were used to compute the percentages of accurate classification, misclassification, and indeterminate classification for all examinees, for borderline pass–fail examinees (whose raw score on the whole test was within half SD of the raw score, where the SD is computed over all examinees in the sample), and extreme examinees (whose raw scores were below the 2.5th percentile or above 97.5th percentile of the distribution of the raw scores for the sample) over all replications. Thus, for example, the percent of accurate classifications for all examinees were computed from $100 \times$ 42, $413 \times$ 0. $05 \approx$ 212,065 examinees (that ensured that the standard errors of the percentages were always smaller than 0.1).

Figure 1 shows the percents of accurate classification (left column), indeterminate classification (middle column), and misclassification (right column) along the vertical axis for all approaches for various values of $S$ (plotted along the horizontal axis). The top, middle, and bottom row of plots, respectively show results for all examinees (like Figure 2 of Feinberg, 2020), borderline examinees (like Figure 3 of Feinberg, 2020), and extreme examinees (like Figure 4 of Feinberg, 2020). The values for the BD, LW, MLW, CTT, and MSE approaches are shown using solid, short-dashed, dotted, long-dashed, and dotted-and-dashed lines, respectively. For convenience of viewing, the lines were smoothed using the loess function in R software (R Core Team, 2020) before plotting.

Figure 1.

Comparison of the accuracy, indeterminate and misclassification percentages of the five approaches for estimating probabilities of passing for simulated data.

Figure 2.

Comparison of the expected losses of the five approaches for three values of $ℓ$ for the simulated data.

Figure 3.

Comparison of the accuracy, indeterminate, and misclassification percentages of the five approaches for estimating probabilities of passing for the real data set.

Figure 4.

Comparison of the expected losses of the five approaches for three values of $ℓ$ for the real data example.

The findings from Figure 1 regarding the three approaches of Feinberg (2020) are mostly in agreement with those from Figures 2 to 4 of Feinberg (2020). For example, (a) the curves for any group of examinees for any approach in Figure 1 look very much like the corresponding curves in Feinberg (2020), which indicates that the comparison of Feinberg (2020) was successfully replicated in this article in spite of the use of simulated (instead of real) data; (b) as in Figures 2 to 4 of Feinberg (2020), Figure 1 shows that the accuracy and misclassification rates are the largest for the BD and LW approaches and smallest for the MSE approach, the accuracy rates of all the approaches are the largest for the extreme examinees and smallest for the borderline examinees, and the misclassification rates of all the approaches are the smallest for the extreme examinees and largest for the borderline examinees.

Regarding the comparative performance of the two new approaches (the MLW and CTT approaches) and the approaches of Feinberg (2020), Figure 1 indicates that the misclassification rates of both new approaches are very close to zero (like that of the MSE approach) for all values of $S$ and all groups of examinees. The figure also shows that the accuracy rates of the two new approaches are smaller than that of the BD and LW approaches for $S$ smaller than about 150, but are as large as or larger than that of the BD and LW approaches for larger values of $S$ . Finally, the indeterminate rates of the new approaches are large for $S$ smaller than about 150, but small for larger values of $S$ . Thus, the new approaches seem to combine the good features of the approaches of Feinberg (2020). While the new approaches have low misclassification rates as does the MSE approach, their accuracy rates are large, as is the case with the BD and LW approaches of Feinberg (2020), for large values of $S$ . Also, the decisions from the new approaches seem reasonable, especially for high-stakes tests, in the sense that unless an examinee has completed a large part of the test, the new approaches are conservative and do not classify many examinees (hence their indeterminate rates are large) whereas the BD and LW approaches are liberal enough to classify more examinees at the expense of a large misclassification rate. When an examinee has completed a large part of the test, the new approaches have small indeterminate rates, misclassification rates of near 0%, and large accuracy rates. Among the two new approaches, the MLW approach seems slightly more favorable compared with the CTT approach; while the accuracy, indeterminate, and misclassification rates for the two approaches are very close for all and extreme examinees, the MLW approach has slightly larger accuracy rates and slightly smaller indeterminate and misclassification rates compared with the CTT approach for the borderline examinees.

While Figure 1 provides detailed information about the performance of the approaches, a single index combining the information from the accuracy, indeterminate, and misclassification rates may make it easier for practitioners to compare the approaches. Therefore, a measure, referred to as the expected loss, was defined as

Expected Loss = Percent of Misclassification \times ℓ + 0.5 \times Percent of Indeterminate Decisions .

The measure is motivated by the idea of loss functions in decision theory (e.g., Ferguson, 1967) and reflects the idea that the losses resulting from an accurate decision, indeterminate decision, and misclassification, respectively, are 0, 0.5, and $ℓ$ .⁶ While the value of $ℓ$ may be assumed to be small, like 1, for low-stakes tests, it may be much larger for high-stakes tests where a misclassification can result in the testing company having to pay compensation (if the examinee fails and later wins a legal battle with the company) or the candidate causing harm to the society (if the examinee passes the test and, for example, becomes a poor teacher).

Figure 2 shows the values of the expected loss for $ℓ$ =1 (left panels), $ℓ$ =5 (middle panels), and $ℓ$ =20 (right panels) for the five approaches for various values of $S$ for all examinees (topmost panels), borderline examinees (middle panels), and extreme examinees (bottom panels). The figure shows that for small values of $ℓ$ (that is, for low-stakes tests) and $S$ smaller than 150, the BD and LW approaches are the optimum approaches because they lead to the smallest expected loss for all examinees and extreme examinees. However, for larger values of $ℓ$ (that is, for high-stakes tests) or $S$ larger than 150 (that is, fewer missing item scores), the CTT and MLW are the optimum.

While Figures 1 and 2 provide a comparison of the approaches at an overall level, Table 1 is intended to provide a deeper look at how the LW and MLW approaches function and differ for individual examinees. The table shows, for six simulated examinees, the true ability ( $θ_{Tr}$ ), the raw score on the whole test ( $R_{W}$ ), the proportional score on the whole test ( $p_{W}$ ), and the true pass–fail status ( $S t_{Tr}$ ). The table also shows, for $S = 10$ , the examinees’ $R_{S}$ , $p_{S}$ , ${\hat{θ}}_{S}$ , ${\hat{σ}}_{S}$ , $P P_{LW}$ , the estimated pass–fail status as “P” (pass), “F” (fail), or “I” (indeterminate) from the LW approach ( $S t_{LW}$ ), $P P_{MLW}$ , and the estimated pass–fail status from the MLW approach ( $S t_{MLW}$ ). The first two of these examinees are extreme examinees and last four are borderline pass–fail examinees. The first examinee has a true ability of −2.13, actually failed the test, performed poorly on the first 10 items, and, consequently, is estimated to fail the test according to both the LW and MLW approaches. The second examinee has a true ability of 2.21, actually passed the test, performed very well on the first 10 items, and, consequently, is estimated to pass the test according to both the LW and MLW approaches. The third and fourth examinees are very close with respect to their true abilities and both actually passed the test. However, the third examinee performed considerably better than the fourth on the first 10 items (raw score of 6 versus 3 on those items), and ${\hat{θ}}_{S}$ is considerably larger for the third examinee than the fourth examinee. Consequently, the estimated probability of passing from the LW approach is .99 for the third examinee and .00 for the fourth examinee, that is, according to the LW approach, the third examinee is estimated to pass (that is the correct decision), but the fourth examinee is estimated to fail (that is the incorrect decision) the test. On the other hand, the MLW approach takes into account the fact that the estimated SD of the estimated ability is as large as 0.58 (so that the passing $θ$ -score of −1.04 is within the 95% confidence band ${\hat{θ}}_{S} \pm 2 {\hat{σ}}_{S}$ ) and estimates the passing probabilities as .90 and .31, respectively; consequently, the passing status is indeterminate for the MLW approach for both examinees. In contrast, the true abilities of the fifth and sixth examinees are very close and both of these examinees actually failed the test. However, the fifth examinee performed considerably worse than the sixth on the first 10 items, and, consequently, according to the LW approach, the fifth examinee is estimated to fail (that is the correct decision), but the sixth examinee is estimated to pass (that is the incorrect decision) the test. However, for the MLW approach, the passing status is indeterminate for both of these examinees.

Table 1.

Some Details Regarding Six Simulated Examinees.

$θ_{Tr}$	$R_{W}$	$p_{W}$	$S t_{Tr}$	$R_{S}$	$p_{S}$	${\hat{θ}}_{S}$	${\hat{σ}}_{S}$	$P P_{LW}$	$S t_{LW}$	$P P_{MLW}$	$S t_{MLW}$
−2.14	69	.27	F	0	.0	−2.3	.62	.00	F	.01	F
2.21	243	.96	P	10	1	1.3	.71	1.00	P	1.00	P
−.90	132	.52	P	6	.6	−0.3	.58	.99	P	.90	I
−.88	133	.53	P	3	.3	−1.3	.58	.00	F	.31	I
−1.07	125	.50	F	4	.4	−1.0	.59	.00	F	.54	I
−1.07	125	.50	F	7	.7	0.0	.60	1.00	P	.97	I

Comparison Using Real Data

A data set comprising scores of about 6,000 examinees on one recent form of a high-stakes and mixed-format mastery test was used in a comparison of the approaches. The test is computerized, measures the knowledge, skills, and abilities of the examinees in a language, and comprises 64 MC and dichotomously scored items and eight polytomously scored items. Among the polytomous items, all of which are constructed-response (CR) items, four involve writing and four involve speaking. The scores on each polytomous item can only be one among 0, 1, 2, 3, and 4. Thus, the maximum possible raw score on the test is 96. No item scores were missing due to administrative problems for the available data set, which is a subset of the data set for an entire administration of the test. For the data set, the average raw score, reliability of the raw score, average percent correct score on the MC items, average percent correct score on the CR items, average interitem correlation among the MC items, and average interitem correlation among the CR items are 69, 0.91, 76, 61, 0.11, and .57, respectively. The raw score is equated to the raw score on a reference form and then converted to a pass–fail status using a passing score recommended by a standard-setting committee. The passing score on the raw score scale ( $C$ ) for this particular form was obtained and used in the comparison. A unidimensional IRT model comprising the 3PLM for the MC items and the GPCM for the CR items was fitted to the data set. The IRT model was found to fit the data set adequately from a model-fit analysis using the generalized $S - χ^{2}$ item-fit statistic (e.g., T. Kang & Chen, 2008), statistic for testing local dependence (Maydeu-Olivares & Liu, 2015), and the Poly-DIMTEST statistic (e.g., Stout, 1987). The passing score on the examinee ability scale ( $θ_{C}$ ) was equal to about −0.24.⁷ This passing score and the actual raw scores of the examinees on the whole test were used to compute the actual/true pass–fail status of each examinee in the data set. The actual passing rate was about 60% on the form analyzed here.

On rare occasions, computer problems lead to missing item scores for the mastery test. In addition, on rare occasions, the speaking items are unscorable due to background noise or poor audio quality. Thus, various patterns of missing item scores can be observed for the test. Consequently, 11 patterns of missing item scores, with various extent of missingness, were considered in the comparison study. Each of these patterns corresponds to two specific percents of missing MC and CR item scores. The patterns are shown in Table 2. For example, the first pattern corresponds to 100% missing MC item scores and 50% missing CR item scores; thus, the pattern represents a severe extent of missingness and all the approaches are expected to perform relatively poorly for this case. The reliability of the total score on the available items is shown in the last column of the table.⁸ Roughly, as one goes down the table, the proportion of missingness decreases and the reliability increases, and the approaches are expected to perform more accurately in estimating the passing probabilities.

Table 2.

The 11 Patterns of Missing Item Scores Considered in the Comparison Using Real Data.

Pattern	Percent missing MC item scores	Percent missing CR item scores	Reliability
1	100	50	.78
2	50	100	.78
3	25	100	.84
4	100	0	.85
5	0	100	.88
6	50	0	.88
7	0	50	.89
8	25	0	.89
9	0	25	.89
10	0	12.5	.90
11	10	0	.90

Note. MC = multiple-choice; CR = constructed-response.

A total of 100 replications of the following steps were performed for each pattern of missing item scores mentioned in Table 2: (a) randomly select 5% of the examinees in the data set and mark their scores as missing on the items determined by the missing score pattern. For example, when the percentages of missing CR and MC item scores are 100 and 50, respectively, then a random set of 50% MC items (that varies over the examinees) was drawn for each of the 5% examinees and scores on these randomly drawn items and all CR items were assumed missing for these 5% examinees. (b) fit the IRT model (that is a combination of the 3PLM and GPCM) and the CTT model of Lee (2007) and Lee et al. (2006) to the subset of the data comprising the 95% examinees with no missing scores, (c) Estimate the pass-fail statuses of the 5% examinees using the models fitted in the previous step and the five imputation approaches and compare them to their actual true pass-fail statuses. The BD approach of Feinberg (2020), which is meant to be used for dichotomous items, was applied with a slight modification in order to accommodate the polytomous items—the success probability $p_{A}$ (that plays the role that $p_{S}$ plays in Equation 1) was computed using Equation 3 and the probability of a raw score $t$ on $M$ was computed as the probability of the value $t$ in a binomial distribution with $M_{\max}$ trials and success probability $p_{A}$ . The application of the LW approach of Feinberg (2020) involved the computation of $p_{LW} (t | {\hat{θ}}_{A})$ using the combination of the 3PLM and GPCM and the recursive approach of Hanson (1994) and Thissen et al. (1995).

As in the simulations, the accuracy, indeterminate, and misclassification rates were computed for all examinees, borderline pass–fail examinees (whose raw score on the whole test was within half SD of the raw passing score, where the SD was computed for the full examinee sample), and extreme examinees (whose raw scores were below the 2.5th percentile or above 97.5th percentile of the distribution of the raw scores for the sample).

Figure 3 shows the percentages of accurate classification, indeterminate classification, and misclassification for all the 11 missing score patterns for all examinees (top row), for borderline pass–fail examinees (middle row), and extreme examinees (bottom row) for the five approaches.

The lines in Figure 3 are not as smooth as in Figure 1 because unlike in the latter figure, the number of items with missing scores do not constitute a continuum in the former figure. However, several patterns in Figure 3 are similar to those in Figure 1. For example, (a) as one goes from left to the right of any panel, the accuracy rates mostly increase and the indeterminate and misclassification rates mostly decrease for any approach; (b) for all the approaches, the accuracy rates are the largest for the extreme examinees and smallest for the borderline examinees, and the indeterminate and misclassification rates are the smallest for the extreme examinees and largest for the borderline examinees; (c) the accuracy rates are the largest for the BD and LW approaches for more missing items, but are the largest for all approaches except the MSE approach for fewer missing items; (d) the misclassification rates are the largest for the BD and LW approaches and are close to zero for the new approaches and the MSE approach for all cases. The misclassification rates of the BD approach are somewhat larger on average than that of the other approaches; this could be the outcome of the fact that the proportional score on the MC items is not a good predictor of that on the CR items and vice versa and the BD approach treats these quantities to be equivalent.

Figure 4, like Figure 2, shows the values of the expected loss for $ℓ$ = 1 (left panels), $ℓ$ = 5 (middle panels), and $ℓ$ = 20 (right panels) for the five approaches for the 11 missing score patterns for all examinees (topmost panels), borderline examinees (middle panels), and extreme examinees (bottom panels). The figure shows that the LW and the BD approach are favorable for $ℓ$ = 1, but the MLW approach is favorable for larger values of $ℓ$ .

Discussion on the Comparison of the Approaches

The results from the simulated and real data sets seem to favor the use of the two new approaches—the MLW and CTT approaches—for high-stakes mastery tests. Feinberg (2020) stated that a high-stakes testing program would likely want to create a policy that resulted in close to, if not exactly, an expected 0% misclassification rate at the cost of additional indeterminate classifications. The new approaches are exactly the types of approaches that Feinberg (2020) called for because of their small misclassification rates, modestly large indeterminate classification rates, and large power for a few missing item scores. For tests for which misclassifications do not have serious consequences, the BD or LW approach of Feinberg (2020), which are computationally simpler, may be preferred for the examinees who completed a small part of the test (if the test administrators decide to report scores for such examinees).

Conclusions

There is a lack of research on estimating the passing probability of examinees with incomplete data on mastery tests. This article presents two new approaches to estimate the aforementioned probabilities and demonstrates the benefit of the approaches compared with the existing approaches for high-stakes mastery tests—the misclassification rates for the new approaches are shown to be very close to zero in two comparison studies based on simulated and real data sets. The superior performance of the new approaches compared with the existing approaches is partially due to the fact that the new approaches take into account the uncertainty inherent in the estimates of the examinees’ true abilities by averaging out the uncertainty.

The data sets considered in this article were large. We repeated our analysis with smaller subsamples and found that the comparative performance of the approaches remain very similar to those reported in the article for sample sizes of about 1,000 or more. The items with missing scores are equally difficult on average with those with no missing scores in the simulated and real data examples earlier. Appendix D includes a simulation when the items with missing scores are more difficult on average than those with no missing scores. The MLW and LW approaches perform well in those simulations because these are based on IRT and take into account the item difficulties of the items with missing scores. However, the BD and CTT approaches perform poorly because they cannot take into account the item difficulties.

While there is no guarantee that the results found for the new approaches in this article will generalize to all data sets, the strong theoretical basis of the new approaches and the results from the simulated and real data sets presented above provide convincing evidence in favor of the approaches. A testing program that is considering an application of the new approaches to their test should evaluate the properties of the approaches for their data in the same manner as in the above comparison using real data. In addition, the testing programs applying the MLW approach should ensure that the IRT model fits the data adequately, the item parameter estimates are accurate, and the examinee population has not shifted.

The new approaches are more computation-intensive compared with those suggested by Feinberg (2020), but they did not require more than a couple of seconds for one replication for the real data set considered in this article. Considering the potentially serious consequences of misclassifications, the extra computation required in the new approaches is probably worth it, especially for high-stakes mastery tests.

The findings of this article are important especially given the several recent instances of technical difficulty during large-scale tests (e.g., Byrne, 2017; Sinharay et al., 2014; Sinharay et al., 2015) and the potential future technical difficulties due to poor internet access on at-home tests (e.g., Michel, 2020), especially because technical difficulties are a major cause of missing item scores.

The findings of this article have the practical implication that for tests (such as professional licensure tests) for which misclassifications have serious consequences, the new approaches are preferable for estimating the passing probability of examinees with incomplete data on mastery tests. However, for tests for which misclassifications do not have serious consequences, the BD or LW approaches of Feinberg (2020) may be preferred for the examinees who completed a small part of the test (if the test administrators decide to report scores for such examinees).

The new approaches were based on classical measurement techniques—one was based on IRT while the other was based on CTT. It is possible to use other approaches, especially those based on classical statistical methods, such as linear regression and logistic regression, in future research. In addition, further research could estimate passing probabilities using multiple imputation approaches, such as those suggested by Schafer (1997) and Raghunathan et al. (2001) and applied to measurement research by Edwards and Finch (2018) and Xiao and Bulut (2020). Another potential line of related research is the application of data mining methods such as random forests and gradient boosting machine (e.g., Hastie et al., 2009; Sinharay, 2016) to estimate passing probabilities.

The choice of $α$ in determining a pass–fail classification would depend on the test administrators. Because those with no missing item scores typically have to obtain a score above the cut score to pass (thus, the requirement for them effectively is a passing probability larger than .5), one could argue that to be fair to the examinees with missing item scores, $α$ should be set equal to 1 and anyone with a passing probability larger than .5 should be classified as “pass.” However, such a strategy effectively acknowledges that the items with missing scores are redundant and a classification decision can be made without those scores and may be unfair to those without missing scores (they may wonder why they had to answer some redundant items). On the other hand, one could argue that because the examinee may have performed differently on the items with missing scores compared with the rest of the test, the passing probability has to be considerably larger than .5 for the examinee to be classified as a “pass.”

The research presented in this article can be extended in several additional ways. First, the comparison of approaches can be performed for more simulated and real data sets including data from different types of tests. Second, there is a lack of comparison of approaches for estimating probabilities of passing when the item scores are missing not at random (MNAR; e.g., Little & Rubin, 2002). Only missing at random (MAR) data were considered in this article, as well as in Feinberg (2020). However, recent research on other missing data problems for item-response data by, for example, Xiao and Bulut (2020) and Sinharay (2021), implies that the accuracy of the reported pass–fail statuses from the approaches considered in this article will be smaller (than those found in this article) for MNAR data, but probably not by much as long as the number of missing item scores is not too large. Also, the comparative performance of the approaches for MNAR data is expected to be similar to that for MAR data—limited simulations (whose results are not reported here and can be obtained from the author) supported this claim. Third, while this article focuses on tests that report pass and fail classifications, it is possible to extend the research to tests like AP that report multiple classifications. Finally, while this article focuses on estimating passing probabilities, it is possible to estimate the pass–fail classifications themselves in future research.

Footnotes

Appendix A

Appendix B

Appendix C

Appendix D

Author’s Note

Any opinions expressed in this publication are those of the author and not necessarily of Educational Testing Service.

Declaration of Conflicting Interests

The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

Sandip Sinharay

Notes

References

American Educational Research Association, American Psychological Association, & National Council for Measurement in Education. (2014). Standards for educational and psychological testing. American Educational Research Association.

Birnbaum

(1968). Some latent trait models and their use in inferring an examinee’s ability. In Lord

F. M.

Novick

M. R.

(Eds.), Statistical theories of mental test scores (pp. 397-479). Addison-Wesley.

Byrne

M. R.

(2017, October 23). Decisions, concerns, and questions pertaining to two 2017 statewide assessment events involving Algebra I and English II EOCs of the Missouri assessment program [Report submitted to Governor Eric Grietens]. https://www.moagainstcommoncore.com/2017EOCAssessmentIssues-10-23-17.pdf

Cetin-Berber

D. D.

Sari

H. I.

Huggins-Manley

A. C.

(2019). Imputation methods to deal with missing responses in computerized adaptive multistage testing. Educational and Psychological Measurement, 79(3), 495-511. https://doi.org/10.1177/0013164418805532

Chalmers

R. P.

(2012). mirt: A multidimensional item response theory package for the R environment. Journal of Statistical Software, 48(6), 1-29. https://doi.org/10.18637/jss.v048.i06

Chang

H. H.

Stout

(1993). The asymptotic posterior normality of the latent trait in an IRT model. Psychometrika, 58(1), 37-52. https://doi.org/10.1007/BF02294469

De Ayala

R. J.

(2009). The theory and practice of item response theory. Guilford Press.

De Ayala

R. J.

Plake

B. S.

Impara

J. C.

(2001). The impact of omitted responses on the accuracy of ability estimation in item response theory. Journal of Educational Measurement, 38(3), 213-234. https://doi.org/10.1111/j.1745-3984.2001.tb01124.x

Educational Testing Service. (2020). The Praxis test information bulletin 2020-21.

10.

Edwards

J. M.

Finch

W. H.

(2018). Recursive partitioning methods for data imputation in the context of item response theory: A Monte Carlo simulation. Psicológica Journal, 39(1), 88-117. https://doi.org/10.2478/psicolj-2018-0005

11.

Enders

C. K.

(2010). Applied missing data analysis. Guilford Press.

12.

Feinberg

(2020). Estimating classiﬁcation decisions for incomplete tests. Educational Measurement: Issues and Practice. Advance online publication. https://doi.org/10.1111/emip.12412

13.

Ferguson

T. S.

(1967). Mathematical statistics: A decision theoretic approach. Academic Press.

14.

Finch

(2008). Estimation of item response theory parameters in the presence of missing data. Journal of Educational Measurement, 45(3), 225-245. https://doi.org/10.1111/j.1745-3984.2008.00062.x

15.

Giner

Smyth

G. K.

(2016). statmod: Probability calculations for the inverse Gaussian distribution. R Journal, 8(1), 339-351. https://doi.org/10.32614/RJ-2016-024

16.

Glas

C. A. W.

Pimentel

J. L.

(2008). Modeling nonignorable missing data in speeded tests. Educational and Psychological Measurement, 68(6), 907-922. https://doi.org/10.1177/0013164408315262

17.

Graham

J. W.

(2009). Missing data analysis: Making it work in the real world. Annual Review of Psychology, 60, 549-576. https://doi.org/10.1146/annurev.psych.58.110405.085530

18.

Graham

J. W.

(2012). Missing data. Springer.

19.

Hanson

B. A.

(1994). Extension of Lord-Wingersky algorithm to computing test scores for polytomous items [Unpublished manuscript].

20.

Hastie

Tibshirani

Friedman

J. H.

(2009). The elements of statistical learning: Data mining, inference, and prediction. Springer.

21.

Holman

Glas

C. A. W.

(2005). Modelling non-ignorable missing-data mechanisms with item response theory models. British Journal of Mathematical and Statistical Psychology, 58(1), 1-17. https://doi.org/10.1111/j.2044-8317.2005.tb00312.x

22.

Huisman

Molenaar

I. W.

(2001). Imputation of missing scale data with item response models. In Boomsma

van Duijn

M. A. J.

Snijders

T. A. B.

(Eds.), Essays on item response theory (pp. 221-244). Springer. https://doi.org/10.1007/978-1-4613-0169-1_13

23.

Kang

Chen

T. T.

(2008). Performance of the generalized S—χ² item-ﬁt index for polytomous IRT models. Journal of Educational Measurement, 45(4), 391-406. https://doi.org/10.1111/j.1745-3984.2008.00071.x

24.

Kang

Lee

W.-C.

(2011). Comparison of IRT linking and equating methods with mixed-format tests. In Kolen

M. J.

Lee

W.-C.

(Eds.), Mixed-format tests: Psychometric properties with a primary focus on equating (pp. 47-76). University of Iowa.

25.

Kohler

Pohl

Carstensen

C. H.

(2017). Dealing with item nonresponse in large-scale cognitive assessments: The impact of missing data methods on estimated explanatory relationships. Journal of Educational Measurement, 54(4), 397-419. https://doi.org/10.1111/jedm.12154

26.

Kolen

M. J.

Lee

W.-C.

(2011). Psychometric properties of scores on mixed-format tests. Educational Measurement: Issues and Practice, 30(2), 15-24. https://doi.org/10.1111/j.1745-3992.2011.00201.x

27.

Lee

W.-C.

(2007). Multinomial and compound multinomial error model for tests with complex item scoring. Applied Psychological Measurement, 31(4), 255-274. https://doi.org/10.1177/0146621606294206

28.

Lee

W.-C.

Wang

Kim

Brennan

R. L.

(2006). A strong true-score model for test scores based on polytomous items (CASMA Research Report No. 16). University of Iowa.

29.

Lim

(2020). irtplay: Unidimensional item response theory modeling. R package Version 1.6.2.

30.

Little

R. A.

Rubin

(2002). Statistical analyses with missing data (2nd ed.). John Wiley. https://doi.org/10.1002/9781119013563

31.

Lord

F. M.

Wingersky

M. S.

(1984). Comparison of IRT true-score and equipercentile observed-score “equatings”. Applied Psychological Measurement, 8(4), 453-461. https://doi.org/10.1177/014662168400800409

32.

Maydeu-Olivares

Liu

(2015). Item diagnostics in multivariate discrete data. Psychological Methods, 20(2), 276-292. https://doi.org/10.1037/a0039015

33.

Michel

R. S.

(2020). Remotely proctored K-12 high stakes standardized testing during COVID-19: Will it last? Educational Measurement: Issues and Practice, 39(3), 28-30. https://doi.org/10.1111/emip.12364

34.

Muraki

(1992). A generalized partial credit model: Application of an EM algorithm. Applied Psychological Measurement, 16(2), 159-176. https://doi.org/10.1177/014662169201600206

35.

Naylor

J. C.

Smith

A. F. M.

(1982). Applications of a method for the eﬃcient computation of posterior distributions. Applied Statistics, 31(3), 214-225. https://doi.org/10.2307/2347995

36.

Patterson

B. F.

Ewing

(2013). Validating the use of AP exam scores for college course placement [College Board Research Report No. 2013-2]. College Board.

37.

Pohl

Graefe

Rose

(2014). Dealing with omitted and not-reached items in competence tests. Educational and Psychological Measurement, 74(3), 423-452. https://doi.org/10.1177/0013164413504926

38.

R Core Team. (2020). R: A language and environment for statistical computing. R Soundation for Statistical Computing.

39.

Raghunathan

Lepkowski

J. M.

van Hoewyk

Solenberger

P. W.

(2001). A multivariate technique for multiply imputing missing values using a sequence of regression models. Survey Methodology, 27, 85-96.

40.

Robitzsch

(2019). sirt: Supplementary item response theory models. R package version 3.6-21.

41.

Rose

von Davier

Nagengast

(2017). Modeling omitted and not-reached items in IRT models. Psychometrika, 82(3), 795-819. https://doi.org/10.1007/s11336-016-9544-7

42.

Schafer

J. L.

(1997). Analysis of incomplete multivariate data. Chapman & Hall. https://doi.org/10.1201/9781439821862

43.

Schafer

J. L.

Graham

J. W.

(2002). Missing data: Our view of the state of the art. Psychological Methods, 7(2), 147-177. https://doi.org/10.1037/1082-989X.7.2.147

44.

Shah

B. K.

(1994). On the distribution of the sum of independent integer valued random variables. American Statistician, 27, 123-124.

45.

Shin

S.-H.

(2009). How to treat omitted responses in Rasch model-based equating. Practical Assessment, Research, and Evaluation, 14, Article 1.

46.

Sijtsma

van der Ark

L. A.

(2003). Investigation and treatment of missing item scores in test and questionnaire data. Multivariate Behavioral Research, 38(4), 505-528. https://doi.org/10.1207/s15327906mbr3804_4

47.

Sinharay

(2016). An NCME instructional module on data mining methods for classiﬁcation and regression. Educational Measurement: Issues and Practice, 35(3), 38-54. https://doi.org/10.1111/emip.12115

48.

Sinharay

(2021). Score reporting for examinees with incomplete data on large-scale assessments. Educational Measurement: Issues and Practice, 40(1), 79-91. https://doi.org/10.1111/emip.12396

49.

Sinharay

Stern

H. S.

Russell

(2001). The use of multiple imputation for the analysis of missing data. Psychological Methods, 6(4), 317-329. https://doi.org/10.1037/1082-989X.6.4.317

50.

Sinharay

Wan

Choi

S. W.

Kim

(2015). Assessing individual-level impact of interruptions during online testing. Journal of Educational Measurement, 52(1), 80-105. https://doi.org/10.1111/jedm.12064

51.

Sinharay

Wan

Whitaker

Kim

Zhang

Choi

S. W.

(2014). Determining the overall impact of interruptions during online testing. Journal of Educational Measurement, 51(4), 419-440. https://doi.org/10.1111/jedm.12052

52.

Smits

Mellenbergh

G. J.

Vorst

H. C. M.

(2002). Alternative missing data techniques to grade point average: Imputing unavailable grades. Journal of Educational Measurement, 39(3), 187-206. https://doi.org/10.1111/j.1745-3984.2002.tb01173.x

53.

Stout

(1987). A nonparametric approach for assessing latent trait unidimensionality. Psychometrika, 52(4), 589-617. https://doi.org/10.1007/BF02294821

54.

Sulis

Porcu

(2017). Handling missing data in item response theory. Assessing the accuracy of a multiple imputation procedure based on latent class analysis. Journal of Classiﬁcation, 34(2), 327-359. https://doi.org/10.1007/s00357-017-9220-3

55.

Thissen

Pommerich

Billeaud

Williams

(1995). Item response theory for scores on tests including polytomous items with ordered responses. Applied Psychological Measurement, 19(1), 39-49. https://doi.org/10.1177/014662169501900105

56.

U.S. Medical Licensing Examination. (2020). 2020 bulletin of information. https://www.usmle.org/pdfs/bulletin/bulletin2020.pdf

57.

Vriens

Sinharay

(2006). Dealing with missing data in surveys and databases. In The handbook of marketing research (pp. 178-191). Sage. https://doi.org/10.4135/9781412973380.n10

58.

Xiao

Bulut

(2020). Evaluating the performances of missing data handling methods in ability estimation from sparse data. Educational and Psychological Measurement, 80(5), 932-954. https://doi.org/10.1177/0013164420911136